0% found this document useful (0 votes)
8 views17 pages

Module 3 Notes

This document discusses C compilers and optimization techniques specific to ARM architecture, focusing on data types, local variable types, function argument types, and the implications of signed versus unsigned operations. It emphasizes efficient use of data types, loop structures, register allocation, function calls, pointer aliasing, and structure arrangement for optimal performance. The document provides guidelines for coding practices to enhance performance and reduce overhead in ARM-based systems.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views17 pages

Module 3 Notes

This document discusses C compilers and optimization techniques specific to ARM architecture, focusing on data types, local variable types, function argument types, and the implications of signed versus unsigned operations. It emphasizes efficient use of data types, loop structures, register allocation, function calls, pointer aliasing, and structure arrangement for optimal performance. The document provides guidelines for coding practices to enhance performance and reduce overhead in ARM-based systems.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 17

MODULE 3

C Compilers and Optimization


3.1 Basic C Data Types
 ARM processors have 32-bit registers and 32-bit data processing operations.
 The ARM architecture is a RISC load/store architecture. In other words you must load values from
memory into registers before acting on them. There are no arithmetic or logical instructions that
manipulate values in memory directly.

 Early versions of the ARM architecture (ARMv1 to ARMv3) provided hardware support for loading
and storing unsigned 8-bit and unsigned or signed 32-bit values.

 The ARMv4 architecture and above support signed 8-bit and 16-bit loads and stores directly, through
new instructions.

 Finally, ARMv5 adds instruction support for 64-bit load and stores. This is available in
ARM9E and later cores.
 Prior to ARMv4, ARM processors were not good at handling signed 8-bit or any 16-bit values.
Therefore ARM C compilers define char to be an unsigned 8-bit value, rather than a signed 8-bit value
as is typical in many other compilers.
3.1.1 LOCAL VARIABLE TYPES

 ARMv4-based processors can efficiently load and store 8-, 16-, and 32-bit data. However, most ARM
data processing operations are 32-bit only. For this reason, you should use a 32-bit datatype, int or long,
for local variables wherever possible.

 Avoid using char and short as local variable types, even if you are manipulating an 8- or 16-bit value. If you
require modulo arithmetic of the form 255 1 0, then use the char type.

 The following code checksums a data packet containing 64 words. It shows why you should avoid using
char for local variables.

Now compare this to the compiler output where instead we declare i as an unsigned int.

Next, suppose the data packet contains 16-bit values and we need a 16-bit checksum. It is
tempting to write the following C code:

The loop is now three instructions longer than the loop for example checksum_v2 earlier! There are
two reasons for the extra instructions:


The LDRH instruction does not allow for a shifted address offset as the LDR instruction did in
checksum_v2. Therefore the first ADD in the loop calculates the address of item i in the array. The
LDRH loads from an address with no offset. LDRH has fewer addressing modes than LDR as it
was a later addition to the ARM instruction set. (See Table 5.1.)

The cast reducing total + array[i] to a short requires two MOV instructions. The compiler shifts
left by 16 and then right by 16 to implement a 16-bit sign extend. The shift right is a sign-
extending shift so it replicates the sign bit to fill the upper 16 bits.
3.1.2 FUNCTION ARGUMeNT TYPeS

 Consider the following simple function, which adds two 16-bit values, halving the second,
and returns a 16-bit sum:

 The input values a, b, and the return value will be passed in 32-bit ARM registers. Should the
compiler assume that these 32-bit values are in the range of a short type, that is, 32,768 to 32,767?
Or should the compiler force values to be in this range by sign-extending the lowest 16 bits to fill the 32-bit
register?
 The compiler must make compatible decisions for the function caller and callee. Either the caller or callee
must perform the cast to a short type.
 function arguments are passed wide if they are not reduced to the range of the type and narrow if they
are reduced to the range of the type
 We tell which decision the compiler has made by looking at the assembly output for add_v1.

If the compiler passes arguments wide, then the callee must reduce function arguments to
the correct range.

If the compiler passes arguments narrow, then the caller must reduce the range.

If the compiler returns values wide, then the caller must reduce the return value to the
correct range.

If the compiler returns values narrow, then the callee must reduce the range before returning
the value.

 The gcc compiler we used is more cautious and makes no assumptions about the range of
argument value. This version of the compiler reduces the input arguments to the range of
a short in both the caller and the callee. It also casts the return value to a short type. Here
is the compiled code for add_v1:
SIGNED VERSUS UNSIGNED TYPES
If your code uses addition, subtraction, and multiplication, then there is no performance
difference between signed and unsigned operations. However, there is a difference when it
comes to division. Consider the following short example that averages two integers:

Consider the following short example that averages two integers:

Notice that the compiler adds one to the sum before shifting by right if the sum is
negative. In other words, it replaces x/2 by the statement:

(x<0)? ((x+1) >> 1): (x >> 1)

It must do this because x is signed. In C on an ARM target, a divide by two is not a right shift if x is
negative. For example, −3 >>1 = −2 but −3/2 = −1. Division rounds towards zero, but arithmetic right
shift rounds towards −∞.

3.2 C Looping Structures

Loops with a Fixed Number of Iterations


Here is the last version of the 64-word packet checksum routine we studied in Section
5.2. This shows how the compiler treats a loop with incrementing count i++

It takes three instructions to implement the for loop structure:


■ An ADD to increment i
■ A compare to check if i is less than 64
■ A conditional branch to continue the loop if i < 64 This is not efficient.

On the ARM, a loop should only use two instructions:


■ A subtract to decrement the loop counter, which also sets the condition code flags on
the result
■ A conditional branch instruction

The key point is that the loop counter should count down to zero rather than counting up to
some arbitrary limit. Then the comparison with zero is free since the result is stored 5.3 C
Looping Structures 115 in the condition flags. Since we are no longer using i as an array
index, there is no problem in counting down rather than up.
Signed and Unsigned Loop Counter

• For an unsigned loop counter i we can use either of the loop continuation conditions i!=0 or
i>0.
As i can’t be negative, they are the same condition.
• For a signed loop counter, it is tempting to use the condition i>0 to continue the loop

The compiler is not being inefficient. It must be careful about the case when i = -0x80000000 because
the two sections of code generate different answers in this case. For the first piece of code the SUBS
instruction compares i with 1 and then decrements i. Since -0x80000000 < 1, the loop terminates. For
the second piece of code, we decrement i and then compare with 0. Modulo arithmetic means that i
now has the value +0x7fffffff, which is greater than zero. Thus the loop continues for many iterations.
Of course, in practice, i rarely takes the value -0x80000000.

Therefore you should use the termination condition i!=0 for signed or unsigned loop counters. It saves
one instruction over the condition i>0 for signed i.
Loops using a variable number of iterations

Now suppose we want our checksum routine to handle packets of arbitrary size. We pass in a variable
N giving the number of words in the data packet.
The checksum_v7 example shows how the compiler handles a for loop with a variable number of
iterations N.

that the compiler checks that N is nonzero on entry to the function. Often this check is unnecessary
since you know that the array won’t be empty. In this case a do-while loop gives better performance
and code density than a for loop.

Example shows how to use a do-while loop to remove the test for N being zero that occurs in a for
loop

3.2.1 Loop Unrolling

In decrement loop each loop iteration costs two instructions in addition to the body of the loop:
• a subtract to decrement the loop count and
• a conditional branch.

We call these instructions the loop overhead

• On ARM7 or ARM9 processors the


subtract takes - one cycle and
branch takes - three cycles, giving an overhead of four cycles per loop.
• You can save some of these cycles by unrolling a loop—repeating the loop body several times,
and reducing the number of loop iterations by the same proportion. For example, let’s unroll
our packet checksum example four times.

There are two questions you need to ask when unrolling a loop:

■ How many times should I unroll the loop?


■ What if the number of loop iterations is not a multiple of the unroll amount? For example, what if N
is not a multiple of four in checksum_v9?

To start with the first question, only unroll loops that are important for the overall performance of the
application. Otherwise unrolling will increase the code size with little performance benefit. Unrolling
may even reduce performance by evicting more important code from the cache

For the second question, try to arrange it so that array sizes are multiples of your unroll amount. If this
isn’t possible, then you must add extra code to take care of the leftover cases. This increases the code
size a little but keeps the performance high
3.3 Register Allocation

 The compiler attempts to allocate a processor register to each local variable you use in a C
function. It will try to use the same register for different local variables if the use of the
variables do not overlap.

 When there are more local variables than available registers, the compiler stores the excess
variables on the processor stack. These variables are called spilled or swapped out variables
since they are written out to memory (in a similar way virtual memory is swapped out to disk).

■ Try to limit the number of local variables in the internal loop of functions to 12. The compiler
should be able to allocate these to ARM registers.

3.4 Function Calls

• The ARM Procedure Call Standard (APCS) defines how to pass function arguments and return
values in ARM registers. The more recent ARM-Thumb Procedure Call Standard (ATPCS)
covers ARM and Thumb interworking as well.
• The first four integer arguments are passed in the first four ARM registers: r0, r1, r2, and r3.
Subsequent integer arguments are placed on the full descending stack, ascending in memory as
in Figure 5.1. Function return integer values are passed in r0.
• Two-word arguments such as long long or double are passed in a pair of consecutive argument
registers and returned in r0, r1.
• Function return integer values are passed in r0.

Four-register rule
3.5 POINTER ALIASING

• Two pointers are said to alias when they point to the same address.
• If you write to one pointer, it will affect the value you read from the other pointer.
• In a function, the compiler often doesn’t know which pointers can alias and which pointers
can’t.
• The compiler must be very pessimistic and assume that any write to a pointer may affect the
value read from any other pointer, which can significantly reduce code efficiency.
Note that the compiler loads from step twice. Usually a compiler optimization called common
subexpression elimination would kick in so that *step was only evaluated once, and the value reused
for the second occurrence. However, the compiler can’t use this optimization here. The pointers timer1
and step might alias one another. In other words, the compiler cannot be sure that the write to timer1
doesn’t affect the read from step

3.6 Structure Arrangement


The way you lay out a frequently used structure can have a significant impact on its performance and
code density. There are two issues concerning structures on the ARM:
 alignment of the structure entries and
 the overall size of the structure
 ARM compilers will automatically align the start address of a structure to a multiple of the
largest access width used within the structure (usually four or eight bytes) and align entries
within structures to their access width by inserting padding.
For example, consider the structure

 However, packed structures are slow and inefficient to access.


 The compiler emulates unaligned load and store operations by using several aligned accesses with data
operations to merge the results.
 Only use the __packed keyword where space is far more important than speed and you can’t reduce padding
by rearragement.
The following rules generate a structure with the elements packed for maximum efficiency:
 Place all 8-bit elements at the start of the structure.
 Place all 16-bit elements next, then 32-bit, then 64-bit.
 Place all arrays and larger elements at the end of the structure.
If the structure is too big for a single instruction to access all the elements, then group the elements into
substructures. The compiler can maintain pointers to the individual substructures.

You might also like