0% found this document useful (0 votes)

8 views17 pages

Module 3 Notes

This document discusses C compilers and optimization techniques specific to ARM architecture, focusing on data types, local variable types, function argument types, and the implications of signed versus unsigned operations. It emphasizes efficient use of data types, loop structures, register allocation, function calls, pointer aliasing, and structure arrangement for optimal performance. The document provides guidelines for coding practices to enhance performance and reduce overhead in ARM-based systems.

Uploaded by

Kesavan Mettur Varadharajan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views17 pages

Module 3 Notes

Uploaded by

Kesavan Mettur Varadharajan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 17

MODULE 3

C Compilers and Optimization

3.1 Basic C Data Types
 ARM processors have 32-bit registers and 32-bit data processing operations.
 The ARM architecture is a RISC load/store architecture. In other words you must load values from
memory into registers before acting on them. There are no arithmetic or logical instructions that
manipulate values in memory directly.

 Early versions of the ARM architecture (ARMv1 to ARMv3) provided hardware support for loading
and storing unsigned 8-bit and unsigned or signed 32-bit values.

 The ARMv4 architecture and above support signed 8-bit and 16-bit loads and stores directly, through
new instructions.

 Finally, ARMv5 adds instruction support for 64-bit load and stores. This is available in
ARM9E and later cores.
 Prior to ARMv4, ARM processors were not good at handling signed 8-bit or any 16-bit values.
Therefore ARM C compilers deﬁne char to be an unsigned 8-bit value, rather than a signed 8-bit value
as is typical in many other compilers.
3.1.1 LOCAL VARIABLE TYPES

 ARMv4-based processors can efﬁciently load and store 8-, 16-, and 32-bit data. However, most ARM
data processing operations are 32-bit only. For this reason, you should use a 32-bit datatype, int or long,
for local variables wherever possible.

 Avoid using char and short as local variable types, even if you are manipulating an 8- or 16-bit value. If you
require modulo arithmetic of the form 255 1 0, then use the char type.

 The following code checksums a data packet containing 64 words. It shows why you should avoid using
char for local variables.

Now compare this to the compiler output where instead we declare i as an unsigned int.

Next, suppose the data packet contains 16-bit values and we need a 16-bit checksum. It is
tempting to write the following C code:

The loop is now three instructions longer than the loop for example checksum_v2 earlier! There are
two reasons for the extra instructions:

■
The LDRH instruction does not allow for a shifted address offset as the LDR instruction did in
checksum_v2. Therefore the ﬁrst ADD in the loop calculates the address of item i in the array. The
LDRH loads from an address with no offset. LDRH has fewer addressing modes than LDR as it
was a later addition to the ARM instruction set. (See Table 5.1.)
■
The cast reducing total + array[i] to a short requires two MOV instructions. The compiler shifts
left by 16 and then right by 16 to implement a 16-bit sign extend. The shift right is a sign-
extending shift so it replicates the sign bit to ﬁll the upper 16 bits.
3.1.2 FUNCTION ARGUMeNT TYPeS

 Consider the following simple function, which adds two 16-bit values, halving the second,
and returns a 16-bit sum:

 The input values a, b, and the return value will be passed in 32-bit ARM registers. Should the
compiler assume that these 32-bit values are in the range of a short type, that is, 32,768 to 32,767?
Or should the compiler force values to be in this range by sign-extending the lowest 16 bits to ﬁll the 32-bit
register?
 The compiler must make compatible decisions for the function caller and callee. Either the caller or callee
must perform the cast to a short type.
 function arguments are passed wide if they are not reduced to the range of the type and narrow if they
are reduced to the range of the type
 We tell which decision the compiler has made by looking at the assembly output for add_v1.
■
If the compiler passes arguments wide, then the callee must reduce function arguments to
the correct range.
■
If the compiler passes arguments narrow, then the caller must reduce the range.
■
If the compiler returns values wide, then the caller must reduce the return value to the
correct range.
■
If the compiler returns values narrow, then the callee must reduce the range before returning
the value.

 The gcc compiler we used is more cautious and makes no assumptions about the range of
argument value. This version of the compiler reduces the input arguments to the range of
a short in both the caller and the callee. It also casts the return value to a short type. Here
is the compiled code for add_v1:
SIGNED VERSUS UNSIGNED TYPES
If your code uses addition, subtraction, and multiplication, then there is no performance
difference between signed and unsigned operations. However, there is a difference when it
comes to division. Consider the following short example that averages two integers:

Consider the following short example that averages two integers:

Notice that the compiler adds one to the sum before shifting by right if the sum is
negative. In other words, it replaces x/2 by the statement:

(x<0)? ((x+1) >> 1): (x >> 1)

It must do this because x is signed. In C on an ARM target, a divide by two is not a right shift if x is
negative. For example, −3 >>1 = −2 but −3/2 = −1. Division rounds towards zero, but arithmetic right
shift rounds towards −∞.

3.2 C Looping Structures

Loops with a Fixed Number of Iterations

Here is the last version of the 64-word packet checksum routine we studied in Section
5.2. This shows how the compiler treats a loop with incrementing count i++

It takes three instructions to implement the for loop structure:

■ An ADD to increment i
■ A compare to check if i is less than 64
■ A conditional branch to continue the loop if i < 64 This is not efficient.

On the ARM, a loop should only use two instructions:

■ A subtract to decrement the loop counter, which also sets the condition code flags on
the result
■ A conditional branch instruction

The key point is that the loop counter should count down to zero rather than counting up to
some arbitrary limit. Then the comparison with zero is free since the result is stored 5.3 C
Looping Structures 115 in the condition flags. Since we are no longer using i as an array
index, there is no problem in counting down rather than up.
Signed and Unsigned Loop Counter

• For an unsigned loop counter i we can use either of the loop continuation conditions i!=0 or
i>0.
As i can’t be negative, they are the same condition.
• For a signed loop counter, it is tempting to use the condition i>0 to continue the loop

The compiler is not being inefficient. It must be careful about the case when i = -0x80000000 because
the two sections of code generate different answers in this case. For the first piece of code the SUBS
instruction compares i with 1 and then decrements i. Since -0x80000000 < 1, the loop terminates. For
the second piece of code, we decrement i and then compare with 0. Modulo arithmetic means that i
now has the value +0x7fffffff, which is greater than zero. Thus the loop continues for many iterations.
Of course, in practice, i rarely takes the value -0x80000000.

Therefore you should use the termination condition i!=0 for signed or unsigned loop counters. It saves
one instruction over the condition i>0 for signed i.
Loops using a variable number of iterations

Now suppose we want our checksum routine to handle packets of arbitrary size. We pass in a variable
N giving the number of words in the data packet.
The checksum_v7 example shows how the compiler handles a for loop with a variable number of
iterations N.

that the compiler checks that N is nonzero on entry to the function. Often this check is unnecessary
since you know that the array won’t be empty. In this case a do-while loop gives better performance
and code density than a for loop.

Example shows how to use a do-while loop to remove the test for N being zero that occurs in a for
loop

3.2.1 Loop Unrolling

In decrement loop each loop iteration costs two instructions in addition to the body of the loop:
• a subtract to decrement the loop count and
• a conditional branch.

We call these instructions the loop overhead

• On ARM7 or ARM9 processors the

subtract takes - one cycle and
branch takes - three cycles, giving an overhead of four cycles per loop.
• You can save some of these cycles by unrolling a loop—repeating the loop body several times,
and reducing the number of loop iterations by the same proportion. For example, let’s unroll
our packet checksum example four times.

There are two questions you need to ask when unrolling a loop:

■ How many times should I unroll the loop?

■ What if the number of loop iterations is not a multiple of the unroll amount? For example, what if N
is not a multiple of four in checksum_v9?

To start with the first question, only unroll loops that are important for the overall performance of the
application. Otherwise unrolling will increase the code size with little performance benefit. Unrolling
may even reduce performance by evicting more important code from the cache

For the second question, try to arrange it so that array sizes are multiples of your unroll amount. If this
isn’t possible, then you must add extra code to take care of the leftover cases. This increases the code
size a little but keeps the performance high
3.3 Register Allocation

 The compiler attempts to allocate a processor register to each local variable you use in a C
function. It will try to use the same register for different local variables if the use of the
variables do not overlap.

 When there are more local variables than available registers, the compiler stores the excess
variables on the processor stack. These variables are called spilled or swapped out variables
since they are written out to memory (in a similar way virtual memory is swapped out to disk).

■ Try to limit the number of local variables in the internal loop of functions to 12. The compiler
should be able to allocate these to ARM registers.

3.4 Function Calls

• The ARM Procedure Call Standard (APCS) defines how to pass function arguments and return
values in ARM registers. The more recent ARM-Thumb Procedure Call Standard (ATPCS)
covers ARM and Thumb interworking as well.
• The first four integer arguments are passed in the first four ARM registers: r0, r1, r2, and r3.
Subsequent integer arguments are placed on the full descending stack, ascending in memory as
in Figure 5.1. Function return integer values are passed in r0.
• Two-word arguments such as long long or double are passed in a pair of consecutive argument
registers and returned in r0, r1.
• Function return integer values are passed in r0.

Four-register rule
3.5 POINTER ALIASING

• Two pointers are said to alias when they point to the same address.
• If you write to one pointer, it will affect the value you read from the other pointer.
• In a function, the compiler often doesn’t know which pointers can alias and which pointers
can’t.
• The compiler must be very pessimistic and assume that any write to a pointer may affect the
value read from any other pointer, which can signiﬁcantly reduce code efﬁciency.
Note that the compiler loads from step twice. Usually a compiler optimization called common
subexpression elimination would kick in so that *step was only evaluated once, and the value reused
for the second occurrence. However, the compiler can’t use this optimization here. The pointers timer1
and step might alias one another. In other words, the compiler cannot be sure that the write to timer1
doesn’t affect the read from step

3.6 Structure Arrangement

The way you lay out a frequently used structure can have a significant impact on its performance and
code density. There are two issues concerning structures on the ARM:
 alignment of the structure entries and
 the overall size of the structure
 ARM compilers will automatically align the start address of a structure to a multiple of the
largest access width used within the structure (usually four or eight bytes) and align entries
within structures to their access width by inserting padding.
For example, consider the structure

 However, packed structures are slow and inefficient to access.

 The compiler emulates unaligned load and store operations by using several aligned accesses with data
operations to merge the results.
 Only use the __packed keyword where space is far more important than speed and you can’t reduce padding
by rearragement.
The following rules generate a structure with the elements packed for maximum efficiency:
 Place all 8-bit elements at the start of the structure.
 Place all 16-bit elements next, then 32-bit, then 64-bit.
 Place all arrays and larger elements at the end of the structure.
If the structure is too big for a single instruction to access all the elements, then group the elements into
substructures. The compiler can maintain pointers to the individual substructures.

Mod 3
No ratings yet
Mod 3
35 pages
Mod 3
No ratings yet
Mod 3
42 pages
Module 3 Notes-1
No ratings yet
Module 3 Notes-1
30 pages
Pointer Aliasing in C for Microcontrollers
No ratings yet
Pointer Aliasing in C for Microcontrollers
21 pages
Module 3 Notes
No ratings yet
Module 3 Notes
18 pages
BCS402 - MC - M3 - Notes SJCIT
No ratings yet
BCS402 - MC - M3 - Notes SJCIT
18 pages
BCS402 MC Module3 Notes
No ratings yet
BCS402 MC Module3 Notes
30 pages
BCS402 Module 3 PDF
No ratings yet
BCS402 Module 3 PDF
18 pages
Module 2 Part B (Mces 21cs43)
No ratings yet
Module 2 Part B (Mces 21cs43)
29 pages
MC Ia-2
No ratings yet
MC Ia-2
14 pages
Arm Unit 3
No ratings yet
Arm Unit 3
62 pages
Module 3
No ratings yet
Module 3
51 pages
Imp Notes - 1
No ratings yet
Imp Notes - 1
28 pages
Efficient ARM Loop Techniques
No ratings yet
Efficient ARM Loop Techniques
20 pages
MC-module 3 C Compilers and Optimization (BCS402)
No ratings yet
MC-module 3 C Compilers and Optimization (BCS402)
22 pages
BCS402 M3
No ratings yet
BCS402 M3
110 pages
Module 5
No ratings yet
Module 5
33 pages
ARM MC Module 03
No ratings yet
ARM MC Module 03
21 pages
3rd Module MC Sem Exam Preparation
No ratings yet
3rd Module MC Sem Exam Preparation
31 pages
BCS402 Module 3 PDF
No ratings yet
BCS402 Module 3 PDF
12 pages
ARM C Data Types and Efficiency
No ratings yet
ARM C Data Types and Efficiency
24 pages
Module 3
No ratings yet
Module 3
35 pages
Module 3 Book1 - Merged
No ratings yet
Module 3 Book1 - Merged
42 pages
Embedded C Interview Questions
75% (4)
Embedded C Interview Questions
3 pages
Hello World
No ratings yet
Hello World
18 pages
Department of Computer Science and Engineering
No ratings yet
Department of Computer Science and Engineering
25 pages
Programming in C and C++: 10. Undefined Behaviour and Optimisations
No ratings yet
Programming in C and C++: 10. Undefined Behaviour and Optimisations
27 pages
Document 1
No ratings yet
Document 1
21 pages
Embedded C Programming Guide
100% (1)
Embedded C Programming Guide
57 pages
C Programming: Code Behavior Explained
No ratings yet
C Programming: Code Behavior Explained
4 pages
Cls - Arm - Labs - Lab05 - 2020 - 1spring (Wiki at Erau - Us) PDF
No ratings yet
Cls - Arm - Labs - Lab05 - 2020 - 1spring (Wiki at Erau - Us) PDF
3 pages
C Programming Notes
No ratings yet
C Programming Notes
53 pages
Assignment 2 C Programming
No ratings yet
Assignment 2 C Programming
4 pages
BCS402 IA2 (Version A) Scheme 24-25
No ratings yet
BCS402 IA2 (Version A) Scheme 24-25
10 pages
Answers 2 Reviews and Exercises
No ratings yet
Answers 2 Reviews and Exercises
26 pages
C Programming Basics and Syntax Guide
No ratings yet
C Programming Basics and Syntax Guide
38 pages
Crash Course in C and Assembly: Zeljko Vrba
No ratings yet
Crash Course in C and Assembly: Zeljko Vrba
10 pages
C Programming Concepts and Notes
No ratings yet
C Programming Concepts and Notes
51 pages
A Quick Introduc - On To C Programming: Based On Lewis Girod Slides CENS Systems Lab
No ratings yet
A Quick Introduc - On To C Programming: Based On Lewis Girod Slides CENS Systems Lab
36 pages
Es (U4) 1
No ratings yet
Es (U4) 1
24 pages
ARM Instruction Set Guide
No ratings yet
ARM Instruction Set Guide
4 pages
Puzzels On C
No ratings yet
Puzzels On C
14 pages
C - Chapter - 02 RR
No ratings yet
C - Chapter - 02 RR
36 pages
Lect-03-Variables and Datatypes
No ratings yet
Lect-03-Variables and Datatypes
31 pages
Chapter 3: Introduction To Assembly Language Programming: CEG2400 - Microcomputer Systems
No ratings yet
Chapter 3: Introduction To Assembly Language Programming: CEG2400 - Microcomputer Systems
57 pages
Assignment # 5 - 5
No ratings yet
Assignment # 5 - 5
6 pages
C Programming Debugging Guide
No ratings yet
C Programming Debugging Guide
11 pages
C Apps Question 1
No ratings yet
C Apps Question 1
8 pages
L13-Arithmetic Instructions
No ratings yet
L13-Arithmetic Instructions
40 pages
Module01 ClassNotes
No ratings yet
Module01 ClassNotes
29 pages
A Quick Introduction To C Programming
No ratings yet
A Quick Introduction To C Programming
44 pages
C Tutorial
No ratings yet
C Tutorial
63 pages
C-Var Defs, Data Types
100% (1)
C-Var Defs, Data Types
21 pages
Unit-4 Signed or Unsigned Bits Concept
No ratings yet
Unit-4 Signed or Unsigned Bits Concept
6 pages
4 - Chapter 3 C Programming - 1 - 2024
No ratings yet
4 - Chapter 3 C Programming - 1 - 2024
44 pages
A Quick Introduction To C Programming: Lewis Girod CENS Systems Lab July 5, 2005
No ratings yet
A Quick Introduction To C Programming: Lewis Girod CENS Systems Lab July 5, 2005
42 pages
Columbus Fitness S900 Treadmill Manual Guide
No ratings yet
Columbus Fitness S900 Treadmill Manual Guide
14 pages
How To Unlock Huawei P8 Ligt 2017 (En)
No ratings yet
How To Unlock Huawei P8 Ligt 2017 (En)
15 pages
GVCs Computerised System
No ratings yet
GVCs Computerised System
30 pages
Domino A200 Plus
No ratings yet
Domino A200 Plus
2 pages
ECDIS FMD-3100 Installation Guide
No ratings yet
ECDIS FMD-3100 Installation Guide
79 pages
COA - Question Bank
No ratings yet
COA - Question Bank
6 pages
CH341A Manual-NEW-JP - Ja.en
No ratings yet
CH341A Manual-NEW-JP - Ja.en
27 pages
Processor Design and Instruction Flow
No ratings yet
Processor Design and Instruction Flow
21 pages
Tugas B.Inggris
No ratings yet
Tugas B.Inggris
2 pages
Basic Computer Skills Notes UNIT I & II
No ratings yet
Basic Computer Skills Notes UNIT I & II
35 pages
10 BTC Freebitcoin Hack Script Live 2023
75% (4)
10 BTC Freebitcoin Hack Script Live 2023
4 pages
MIPSfpga Hands-On Learning On A Commercial Soft-Core
No ratings yet
MIPSfpga Hands-On Learning On A Commercial Soft-Core
5 pages
EE1005 L01 Computers & Programming
No ratings yet
EE1005 L01 Computers & Programming
35 pages
Compatible Motherboards I7-8700
No ratings yet
Compatible Motherboards I7-8700
2 pages
PROFIBUS Diagnostics Suite: User Manual
No ratings yet
PROFIBUS Diagnostics Suite: User Manual
48 pages
NVM Express 1 - 4c
No ratings yet
NVM Express 1 - 4c
407 pages
HUANANZHI X79 6M V2.0-User%u2019s Manual
No ratings yet
HUANANZHI X79 6M V2.0-User%u2019s Manual
15 pages
2 Esp It Worksheets Book 1 Answer Key
100% (1)
2 Esp It Worksheets Book 1 Answer Key
3 pages
When Classifying and Describing Computer Systems For A Small Business
No ratings yet
When Classifying and Describing Computer Systems For A Small Business
2 pages
Report File 09
No ratings yet
Report File 09
34 pages
SRAM vs. DRAM: Key Differences Explained
No ratings yet
SRAM vs. DRAM: Key Differences Explained
16 pages
Embedded QVL Mfh27ai Processor r1.0
No ratings yet
Embedded QVL Mfh27ai Processor r1.0
2 pages
DIY Arduino Robot Guide
No ratings yet
DIY Arduino Robot Guide
30 pages
William Stallings Computer Organization and Architecture 9 Edition
No ratings yet
William Stallings Computer Organization and Architecture 9 Edition
60 pages
Chapter 3
No ratings yet
Chapter 3
113 pages
Configuring Pca For Cloudera CDP Techpaper
No ratings yet
Configuring Pca For Cloudera CDP Techpaper
39 pages
ContentDirector 1000E 2000B QuickReferenceGuide
No ratings yet
ContentDirector 1000E 2000B QuickReferenceGuide
2 pages
Operating System PDF
No ratings yet
Operating System PDF
14 pages
Innosilicon IP Request Form-NEW
No ratings yet
Innosilicon IP Request Form-NEW
77 pages
CM8828
No ratings yet
CM8828
23 pages

Module 3 Notes

Uploaded by

Module 3 Notes

Uploaded by

MODULE 3

C Compilers and Optimization

Consider the following short example that averages two integers:

(x<0)? ((x+1) >> 1): (x >> 1)

3.2 C Looping Structures

Loops with a Fixed Number of Iterations

It takes three instructions to implement the for loop structure:

On the ARM, a loop should only use two instructions:

3.2.1 Loop Unrolling

We call these instructions the loop overhead

• On ARM7 or ARM9 processors the

■ How many times should I unroll the loop?

3.4 Function Calls

3.6 Structure Arrangement

 However, packed structures are slow and inefficient to access.

You might also like