Unit – II – Embedded Computing Platform Design
Syllabus:
The CPU Bus-Memory devices and systems–Designing with computing
platforms – consumer electronics architecture – platform-level performance analysis -
Components for embedded programs- Models of programs- Assembly, linking and
loading – compilation techniques- Program level performance analysis – Software
performance optimization – Program level energy and power analysis and
optimization – Analysis and optimization of program size- Program validation and
testing.
Introduction:
In this chapter, we concentrate on bus-based computer systems created using
microprocessors, I/O devices, and memory components.
The microprocessor is an important element of the embedded computing
system. It cannot perform any operation without memories and I/O devices.
Hardware platforms for embedded systems often build around with the help of
memory and I/O devices.
CPU BUS:
The bus is the mechanism by which the CPU communicates with memory and
devices.
A bus is, at a minimum, a collection of wires, but the bus also defines a
protocol by which the CPU, memory, and devices communicate.
One of the major roles of the bus is to provide an interface to memory and I/O
devices.
Types of Buses:
1. Data Bus 2. Address Bus
3. Control Bus 4. System Bus
Bus Protocols:
The protocol is nothing but certain rules and conditions for the data
communication.
The basic building block of most bus protocols is the four-cycle handshake
The handshake ensures that when two devices want to communicate,
EC6703 – ERTS Class Notes – Prepared by R.SARAVANAN – AP / ECE - PSNACET Page 1
Unit – II – Embedded Computing Platform Design
One is ready to transmit and the other is ready to receive.
The handshake uses a pair of wires dedicated to the handshake:
Enq (meaning enquiry)
Ack (meaning acknowledge).
Extra wires are used for the data transmitted during the handshake
Four Cycles of Handshake:
Device 1 raises its output to signal an enquiry, which tells device 2 that it
should get ready to listen for data
When device 2 is ready to receive, it raises its output to signal an
acknowledgment. At this point, devices 1 and 2 can transmit or receive.
Once the data transfer is complete, device 2 lowers its output, signalling that it
has received the data.
After seeing that ack has been released, device 1 lowers its output
Timing Diagram:
EC6703 – ERTS Class Notes – Prepared by R.SARAVANAN – AP / ECE - PSNACET Page 2
Unit – II – Embedded Computing Platform Design
Microprocessor Buses:
Microprocessor buses build on the handshake for communication between
the CPU and other system components.
The term bus is used in two ways.
The most basic use is as a set of related wires,
It also means a protocol for communicating between components.
The fundamental bus operations are reading and writing.
Major Components:
Clock provides synchronization to the bus components,
R/W is true when the bus is reading and false when the bus is writing,
Address is an a-bit bundle of signals that transmits the address for an access,
Data is an n-bit bundle of signals that can carry data to or from the CPU, and
Data ready signals when the values on the data bundle are valid.
Timing Diagram:
EC6703 – ERTS Class Notes – Prepared by R.SARAVANAN – AP / ECE - PSNACET Page 3
Unit – II – Embedded Computing Platform Design
The behavior of a bus is most often specified as a timing diagram. A timing
diagram shows how the signals on a bus vary over time.
A’s value is known at all times, so it is shown as a standard waveform that
changes between zero and one.
B and C alternate between changing and stable states.
A stable signal has a stable value that could be measured by an oscilloscope.
But we cannot measure all possible values of address and data lines using
timing diagram
State Diagram:
State diagram for the bus transaction is helpful to complement the timing diagram
DMA (Direct Memory Access):
Direct memory access (DMA) is a bus operation that allows reads and writes
not controlled by the CPU.
A DMA transfer is controlled by a DMA controller, which requests control of
the bus from the CPU.
After gaining control, the DMA controller performs read and write operations
directly between devices and memory.
The DMA requires the CPU to provide two additional bus signals:
The bus request is an input to the CPU through which DMA controllers ask for
ownership of the bus.
The bus grant signals that the bus has been granted to the DMA controller.
EC6703 – ERTS Class Notes – Prepared by R.SARAVANAN – AP / ECE - PSNACET Page 4
Unit – II – Embedded Computing Platform Design
The DMA controller uses these two signals to gain control of the bus using a
classic four-cycle handshake.
The bus request is asserted by the DMA controller when it wants to control the
bus, and the bus grant is asserted by the CPU when the bus is ready.
The CPU will finish all pending bus transactions before granting control of the
bus to the DMA controller. When it does grant control, it stops driving the
other bus signals: R/W, addresses, and so on.
Once the DMA controller is bus master, it can perform reads and writes using
the same bus protocol as with any CPU-driven bus transaction
After the transaction is finished, the DMA controller returns the bus to the CPU
by deasserting the bus request
System Bus Configuration:
A microprocessor system often has more than one bus. High-speed devices
may be connected to a high-performance bus, while lower-speed devices are
connected to a different bus. A small block of logic known as a bridge allows the
buses to connect to each other.
EC6703 – ERTS Class Notes – Prepared by R.SARAVANAN – AP / ECE - PSNACET Page 5
Unit – II – Embedded Computing Platform Design
There are several good reasons to use multiple buses and bridges.
Higher-speed buses may provide wider data connections.
A high-speed bus usually requires more expensive circuits and connectors.
The cost of low-speed devices can be held down by using a lower-speed,
lower-cost bus.
The bridge may allow the buses to operate independently, thereby providing
some parallelism in I/O operations
AMBA Bus (Adv Micro Controller Bus Architecture):
Since the ARM CPU is manufactured by many different vendors, the bus
provided off-chip can vary from chip to chip. ARM has created a separate bus
specification for single-chip systems. The AMBA bus [ARM99A] supports CPUs,
memories, and peripherals integrated in a system-on-silicon.
The AMBA high-performance bus (AHB) is optimized for high-speed
transfers and is directly connected to the CPU. It supports several high-
performance features: pipelining, burst transfers, split transactions and
multiple bus masters.
A bridge can be used to connect the AHB to an AMBA peripherals bus
(APB). This bus is designed to be simple and easy to implement it also
consumes relatively little power.
The AHB assumes that all peripherals act as slaves, simplifying the logic
required in both the peripherals and the bus controller. It also does not perform
pipelined operations, which simplifies the bus logic.
EC6703 – ERTS Class Notes – Prepared by R.SARAVANAN – AP / ECE - PSNACET Page 6
Unit – II – Embedded Computing Platform Design
Memory Device Organization:
The most basic way to characterize a memory is by its capacity, such as 256
MB. However, manufacturers usually make several versions of a memory of a given
size, each with a different data width.
For example, a 256-MB memory may be available in two versions:
As a 64M *4-bit array, a single memory access obtains an 8-bit data item,
As a 32 M* 8-bit array, a single memory access obtains a 1-bit data item,
The height/width ratio of a memory is known as its aspect ratio. The best
aspect ratio depends on the amount of memory required.
Internally, the data are stored in a two-dimensional array of memory cells. The
n-bit address received by the chip is split into a row and a column address
(with n =r+ c). The row and column select a particular memory cell.
Random-Access Memories:
Random-access memories can be both read and written. They are called
random access because, unlike magnetic disks, addresses can be read in any
order
Most bulk memory in modern systems is dynamic RAM (DRAM).
DRAM is very dense; it does, however, require that its values be refreshed
periodically since the values inside the memory cells decay over time
EC6703 – ERTS Class Notes – Prepared by R.SARAVANAN – AP / ECE - PSNACET Page 7
Unit – II – Embedded Computing Platform Design
SDRAM Operation
The dominant form of dynamic RAM today is the synchronous DRAMs
(SDRAMs), which use clocks to improve DRAM performance.
SDRAMs use Row Address Select (RAS) and Column Address Select (CAS)
signals to break the address into two parts, which select the proper row and
column in the RAM array.
SDRAMs use a separate refresh signal to control refreshing
SDRAMs include registers that control the mode in which the SDRAM
operates.
SDRAMs support burst modes that allow several sequential addresses to be
accessed by sending only one address
SIMMs and DIMMs
Memory for PCs is generally purchased as single in-line memory modules
(SIMMs) or double in-line memory modules (DIMMs).
A SIMM or DIMM is a small circuit board that fits into a standard memory
socket.
Read Only Memory:
Read-only memories (ROMs) are pre programmed with fixed data are also less
sensitive to radiation induced errors.
EC6703 – ERTS Class Notes – Prepared by R.SARAVANAN – AP / ECE - PSNACET Page 8
Unit – II – Embedded Computing Platform Design
Types of ROM:
Flash is dominant form of field-programmable ROM.
Electrically erasable, must be block erased.
Random access, but write/erase is much slower than read.
NOR flash is more flexible.
NAND flash is more dense
Flash memory is the dominant form of field-programmable ROM and is
electrically erasable. Flash memory uses standard system voltage for erasing
and programming
It allows to be reprogrammed inside a typical system
Most flash memories today allow certain blocks to be protected.
A common application is to keep the boot-up code in a protected block but
allow updates to other memory blocks on the device called as Black Boot
Flash.
EC6703 – ERTS Class Notes – Prepared by R.SARAVANAN – AP / ECE - PSNACET Page 9
Unit – II – Embedded Computing Platform Design
Designing With Computing Platforms:
The computing platform of the embedded system application is mainly designed with
System Architecture
Hardware Design
PC as a Platform
Development Environment
Debugging
System Architecture:
Architecture is a set of elements and the relationships between them that
together form a single unit. The architecture of an embedded computing system
is the blueprint for implementing that system.
The architecture of an embedded computing system includes both hardware
and software elements. Some software is very hardware-dependent.
Hardware platform architecture
It contains several elements:
CPU:An embedded computing system clearly contains a microprocessor
Bus: It is an integral part of the microprocessor
Memory : RAM & ROM used in hardware
I/O devices: Timers, Counters, ADC, DAC, RTC, networking, sensors,
actuators, etc.
EC6703 – ERTS Class Notes – Prepared by R.SARAVANAN – AP / ECE - PSNACET Page 10
Unit – II – Embedded Computing Platform Design
Evaluation boards:
Designed by CPU manufacturer or others.
Includes CPU, memory, some I/O devices.
May include prototyping section.
CPU manufacturer often gives out evaluation board net list---can be used as
starting point for your custom board design.
Hardware and software architectures
Hardware and software are intimately related:
Software doesn’t run without hardware;
How much hardware you need is determined by the software requirements:
Speed;
Memory.
Adding logic to a board:
Programmable logic devices (PLDs) provide low/medium density logic.
Field-programmable gate arrays (FPGAs) provide more logic and multi-level
logic.
Application-specific integrated circuits (ASICs) are manufactured for a single
purpose.
The PC as a platform:
Advantages:
Cheap and easy to get;
Rich and familiar software environment.
Disadvantages:
Requires a lot of hardware resources;
Not well-adapted to real-time.
EC6703 – ERTS Class Notes – Prepared by R.SARAVANAN – AP / ECE - PSNACET Page 11
Unit – II – Embedded Computing Platform Design
Typical PC hardware platform
Typical busses:
• PCI (Peripheral Component Interconnect): standard for high-speed interfacing
33 or 66 MHz.
PCI Express.
• USB (Universal Serial Bus) : relatively low-cost serial interface with high
speed.
Software elements
• IBM PC uses BIOS (Basic I/O System) to implement low-level functions:
Boot-up;
Minimal device drivers.
• BIOS have become a generic term for the lowest-level system software.
Developing Environment
The part of the software development on a PC or workstation known as a host
The hardware on which the code will finally run is known as the target.
The host and target are frequently connected by a USB link, but a higher-
speed link such as Ethernet can also be used.
EC6703 – ERTS Class Notes – Prepared by R.SARAVANAN – AP / ECE - PSNACET Page 12
Unit – II – Embedded Computing Platform Design
• The host should be able to do the following:
load programs into the target,
start and stop program execution on the target, and
examine memory and CPU registers
Host-based tools:
1. Cross compiler:
Compiles code on host for target system.
It runs on the one type of machine and generates code for the another
machine.
After compiled the code is downloaded to the target system by serial
line.
2. Cross debugger:
Displays target state, allows target system to be controlled.
Debugging:
The process of modifying the embedded code which runs on the host system
for its device configuration is called debugging.
Debugging Techniques:
It is the process of checking the errors and correcting those errors.
It can be performed in two sides, one is software side and other is hardware
side.
For both the sides many debugging tools are available.
EC6703 – ERTS Class Notes – Prepared by R.SARAVANAN – AP / ECE - PSNACET Page 13
Unit – II – Embedded Computing Platform Design
Types of Software Debugging Tools
There are two types of software debugging tools are available.
Serial port tool
Break Point tool
Serial Port Tool:
It is the most important debugging tool.
It will perform the debugging from the initial state of the embedded system
design
This port can be used not only for debugging but also for solving the problems
in the field.
Break point Tool:
Another important debugging tool is the breakpoint.
The simplest form of a breakpoint is for the user to specify an address at which
the program’s execution is to break.
Once the PC reaches that address, control is returned to the monitor program.
From the monitor program, the user can examine and/or modify CPU registers,
after which execution can be continued.
Advantage:
Implementing breakpoints does not require using exceptions or external device
Types of Hardware Debugging Tools:
When the software tools are inefficient to debug the system, the hardware tools
will be used.
Microprocessor In circuit Emulators
Logic Analyzer
EC6703 – ERTS Class Notes – Prepared by R.SARAVANAN – AP / ECE - PSNACET Page 14
Unit – II – Embedded Computing Platform Design
Microprocessor In-circuit emulators
A microprocessor in-circuit emulator is a specialized hardware tool, which
helps the debug software in working embedded system.
Allows you to stop execution, examine CPU state, and modify registers.
The CPU provides as much debugging functionality without any memory
utilization.
Drawbacks:
Specific to particular Mp&Mc only
Very Expensive
Logic analyzer architecture:
• It can sample different values simultaneously and but can display “0” or
changing values for each.
• It records the values of the signals into an internal memory and display the
results on the display.
Once the memory is full
Run is aborted.
EC6703 – ERTS Class Notes – Prepared by R.SARAVANAN – AP / ECE - PSNACET Page 15
Unit – II – Embedded Computing Platform Design
Modes of Logic Analyzer:
1. State Mode:
It represents different values of sampling the values.
It uses system own clock to control the sampling.
2. Timing Mode:
It also represents different values of sampling the values.
It uses an internal clock to take several samples per clock period in a typical
system
EC6703 – ERTS Class Notes – Prepared by R.SARAVANAN – AP / ECE - PSNACET Page 16
Unit – II – Embedded Computing Platform Design
Consumer Electronics Architecture
It is an example for complex embedded systems and the platform that supports
them.
Not all the devices have all features, depending upon the way the device is to
be used, but most devices select features from common menus.
Similarly, there is no single platform for consumer electronic devices, but
architecture in use is organized around some common themes.
Consumer Use cases:
1. Multimedia:
The media may be audio, still images or video.
They are stored in compressed form, uncompressed on viewing.
A large and growing number of standards has been developed for multimedia
compression
Eg. MP3, Dolby Digital for audio , JPEG for Images, MPEG-2, MPEG – 4,
H.264 for video
2. Data storage and management
It will keep track of your multimedia and storage of multimedia, etc.
3. Communication:
It may be relatively simple and sophisticated to use by means of USB, Ethernet
port or a cellular telephone link
Use case for Playing Multimedia
EC6703 – ERTS Class Notes – Prepared by R.SARAVANAN – AP / ECE - PSNACET Page 17
Unit – II – Embedded Computing Platform Design
Non-functional requirements for CE
Often battery-operated, strict power budget.,
Eg. Typical battery for portable devices provides only 75mW which must
supports all processors, display and radio
Very inexpensive and provides very high performance.
User interface must be capable but inexpensive.
CE devices and hosts
It shows a use case for connecting to a client. The connection may be either
USB or over a internet.
Many devices talk to host system.
PC host does things that are hard to do on the device
Platforms and operating systems:
Many CE devices use a DSP for signal processing and a RISC CPU for other
tasks.
I/O devices include buttons, screen, USB.
EC6703 – ERTS Class Notes – Prepared by R.SARAVANAN – AP / ECE - PSNACET Page 18
Unit – II – Embedded Computing Platform Design
Platform-Level Performance Analysis
Bus-based systems add another layer of complication to performance analysis.
Platform-level performance involves much more than the CPU.
The CPU, Bus and Memory or I/O devices all acts as a independent elements
operated in parallel.
We often focus on the CPU because it processes instructions, but any part of
the system can affect total system performance.
More precisely, the CPU provides an upper bound on performance, but any
other part of the system can slow down the CPU.
Performance depends on all the elements of the system:
CPU.
Cache.
Bus.
Main memory.
I/O device.
Simple System
Consider the simple system as shown in Figure. We want to move data from
memory to the CPU to process it. To get the data from memory to the CPU we must:
read from the memory;
transfer over the bus to the cache; and
transfer from the cache to the CPU
EC6703 – ERTS Class Notes – Prepared by R.SARAVANAN – AP / ECE - PSNACET Page 19
Unit – II – Embedded Computing Platform Design
Bandwidth as performance
Bandwidth applies to several components:
Memory.
Bus.
CPU fetches.
Different parts of the system run at different clock rates. Different components
may have different widths (bus, memory).
Let T: # bus cycles; P: time/bus cycle.
Total time for transfer: t = TP.
D: data payload length.
O1 + O2 = overhead O.
Bus burst transfer bandwidth
EC6703 – ERTS Class Notes – Prepared by R.SARAVANAN – AP / ECE - PSNACET Page 20
Unit – II – Embedded Computing Platform Design
T: # bus cycles; P: time/bus cycle.
Total time for transfer: t = TP.
D: data payload length.
O1 + O2 = overhead O.
Parallelism:
Computer systems have multiple components.
When the hardware and software are properly designed, those systems can
operate independently for at least part of the time.
When different components of the system operate in parallel, we can get more
work done in a given amount of time.
DMA:
Direct memory access is a prime example of parallelism.
DMA was designed to off-load memory transfers from the CPU. The CPU can
do other useful work while the DMA transfer is running
Speed things up by running several units at once.
DMA provides parallelism if CPU doesn’t need the bus:
DMA + bus.
EC6703 – ERTS Class Notes – Prepared by R.SARAVANAN – AP / ECE - PSNACET Page 21
Unit – II – Embedded Computing Platform Design
CPU.
Sequential and parallel schedules in a bus-based system
EC6703 – ERTS Class Notes – Prepared by R.SARAVANAN – AP / ECE - PSNACET Page 22
Unit – II – Embedded Computing Platform Design
Components for Embedded Programs:
• In this section, we consider code for three structures or components that are
commonly used in embedded software:
the state machine,
the circular buffer, and
the queue.
State machines are well suited to reactive systems such as user interfaces;
circular buffers and queues are useful in digital signal processing
State Machines:
When inputs appear intermittently rather than as periodic samples, it is often
convenient to think of the system as reacting to those inputs.
The reaction of most systems can be characterized in terms of the input
received and the current state of the system.
This leads naturally to a finite-state machine style of describing the reactive
system’s behavior.
The state machine style of programming is also an efficient implementation of
such computations.
Circular Buffers:
The data stream style makes sense for data that comes in regularly and must be
processed.
For each sample, the filter must emit one output that depends on the values of
the last n inputs.
In a typical workstation application, we would process the samples over a
given interval by reading them all in from a file and then computing the results
all at once in a batch process
The circular buffer is a data structure that lets us handle streaming data in
an efficient way.
At each point in time, the algorithm needs a subset of the data stream that
forms a window into the stream
EC6703 – ERTS Class Notes – Prepared by R.SARAVANAN – AP / ECE - PSNACET Page 23
Unit – II – Embedded Computing Platform Design
The window slides with time as we throw out old values no longer needed and
add new values.
Since the size of the window does not change, we can use a fixed-size buffer to
hold the current data
Queues:
Queues are also used in signal processing and event processing.
Queues are used whenever data may arrive and depart at somewhat
unpredictable times or when variable amounts of data may arrive.
A queue is often referred to as an elastic buffer.
EC6703 – ERTS Class Notes – Prepared by R.SARAVANAN – AP / ECE - PSNACET Page 24
Unit – II – Embedded Computing Platform Design
Models of Programs:
In this section, we develop models for programs that are more general than
source code.
Once we have such a model, we can perform many useful analyses on the
model more easily than we could on the source code. It can be done by
Data Flow Graph
Control / Data Flow Graph
Data Flow Graph:
A data flow graph is a model of a program with no conditionals.
In a high-level programming language, a code segment with no conditionals—
more precisely, with only one entry and exit point is known as a basic block.
Describes the minimal ordering requirements on operations
Single Assignment Form:
w = a + b; w = a + b;
x = a - c; x1 = a - c;
y = x + d; y = x1 + d;
x = a + c; x2 = a + c;
z = y + e; z = y + e;
Original basic block in C Single Assignment Form
EC6703 – ERTS Class Notes – Prepared by R.SARAVANAN – AP / ECE - PSNACET Page 25
Unit – II – Embedded Computing Platform Design
Control-data flow graph:
• CDFG: represents control and data. Uses data flow graphs as components.
• Two types of nodes:
Decision;
Data flow.
Data flow node
Encapsulates a data flow graph:
Write operations in basic block form for simplicity.
Control Node:
EC6703 – ERTS Class Notes – Prepared by R.SARAVANAN – AP / ECE - PSNACET Page 26
Unit – II – Embedded Computing Platform Design
CDFG Example:
if (cond1) bb1();
else bb2();
bb3();
switch (test1) {
case c1: bb4(); break;
case c2: bb5(); break;
case c3: bb6(); break;
EC6703 – ERTS Class Notes – Prepared by R.SARAVANAN – AP / ECE - PSNACET Page 27
Unit – II – Embedded Computing Platform Design
Assembly and Linking:
Assembly and linking are the last steps in the compilation process. They turn a
list of instructions into an image of the program’s bits in memory.
Compilers do not directly generate machine code, but instead create the
instruction-level program in the form of human-readable assembly language
The assembler’s job is to translate symbolic assembly language statements into
bit-level representations of instructions known as object code
The assembler takes care of instruction formats and does part of the job of
translating labels into addresses.
The final steps in determining the addresses of instructions and data are
performed by the linker, which produces an executable binary file.
That file may not necessarily be located in the CPU’s memory, however, unless
the linker happens to create the executable directly in RAM.
The program that brings the program into memory for execution is called a
loader
Programs may be composed from several files.
Addresses become more specific during processing:
Relative addresses are measured relative to the start of a module;
Absolute addresses are measured relative to the start of the CPU address
space.
EC6703 – ERTS Class Notes – Prepared by R.SARAVANAN – AP / ECE - PSNACET Page 28
Unit – II – Embedded Computing Platform Design
Assemblers:
Assemblers not only translating assembly code into object code,
It also translated the assembler must translate opcode and format the bits in
each instruction, and translate labels into addresses.
Labels make the assembly process more complex, but they are the most
important abstraction provided by the assembler
Labels:
Label processing requires making two passes through the assembly source code
as follows:
The first pass scans the code to determine the address of each label.
The second pass assembles the instructions using the label values computed in
the first pass
EC6703 – ERTS Class Notes – Prepared by R.SARAVANAN – AP / ECE - PSNACET Page 29
Unit – II – Embedded Computing Platform Design
Basic Compilation Techniques:
• It is useful to understand how a high-level language program is translated into
instructions.
• Since implementing an embedded computing system often requires
controlling the instruction sequences used to handle interrupts,
placement of data and instructions in memory
Compilation:
• Compilation strategy (Wirth):
Compilation = translation + optimization
• Compiler determines quality of code:
use of CPU resources;
memory access scheduling;
code size.
Compilation begins with high-level language code such as C and generally
produces assembly code.
EC6703 – ERTS Class Notes – Prepared by R.SARAVANAN – AP / ECE - PSNACET Page 30
Unit – II – Embedded Computing Platform Design
The high-level language program is parsed to break it into statements and
expressions.
In addition, a symbol table is generated, which includes all the named objects
in the program.
Some compilers may then perform higher-level optimizations that can be
viewed as modifying the high-level language program input without reference
to instructions.
Simplifying arithmetic expressions is one example of a machine-independent
optimization.
Not all compilers do such optimizations, and compilers can vary widely
regarding which combinations of machine-independent optimizations they do
perform.
Instruction-level optimizations are aimed at generating code.
They may work directly on real instructions or on a pseudo-instruction format
that is later mapped onto the instructions of the target CPU.
This level of optimization also helps modularize the compiler by allowing code
generation to create simpler code that is later optimized
Example 1: Arithmetic expressions:
Expression: a*b + 5*(c-d)
Data Flow Graph:
EC6703 – ERTS Class Notes – Prepared by R.SARAVANAN – AP / ECE - PSNACET Page 31
Unit – II – Embedded Computing Platform Design
Assembly Language Program
ADR r4, a
MOV r1, [r4]
ADR r4, b
MOV r2, [r4]
ADD r3, r1, r2
ADR r4, c
MOV r1, [r4]
ADR r4, d
MOV r5, [r4]
SUB r6, r4, r5
MUL r7, r6, #5
ADD r8, r7, r3
Example 2: Control code generation:
if (a+b > 0)
x = 5;
else x = 7;
Data Flow Graph:
EC6703 – ERTS Class Notes – Prepared by R.SARAVANAN – AP / ECE - PSNACET Page 32
Unit – II – Embedded Computing Platform Design
Assembly Language Program:
ADR r5,a
LDR r1,[r5]
ADR r5,b
LDR r2,[r5]
ADD r3,r1,r2
BLE label3
LDR r3,#5
ADR r5,x
STR r3,[r5]
B stmtent
LDR r3,#7
ADR r5,x
STR r3,[r5]
stmtent ...
Procedure linkage:
Another major code generation problem is the creation of procedures. It needs
the code to:
call and return;
Pass parameters and results.
Procedure stacks are typically built to grow down from high addresses.
A stack pointer (sp) defines the end of the current frame, while a frame pointer
(fp) defines the end of the last frame.
Procedure Stack:
EC6703 – ERTS Class Notes – Prepared by R.SARAVANAN – AP / ECE - PSNACET Page 33
Unit – II – Embedded Computing Platform Design
ARM procedure linkage:
• APCS (ARM Procedure Call Standard):
r0-r3 passes parameters into procedure. Extra parameters are put on
stack frame.
r0 holds return value.
r4-r7 hold registers values.
r11 is frame pointer, r13 is stack pointer.
r10 holds limiting address on stack size to check for stack overflows.
EC6703 – ERTS Class Notes – Prepared by R.SARAVANAN – AP / ECE - PSNACET Page 34
Unit – II – Embedded Computing Platform Design
Program-Level Performance Analysis:
• Need to understand performance in detail:
Real-time behavior, not just typical.
On complex platforms.
• Program performance ¹ CPU performance:
Pipeline, cache are windows into program.
We must analyze the entire program.
Execution Time:
Execution time is a global property of a program.
The execution time of a program often varies with the input data values.
The cache has a major effect on program performance.
Execution times may vary even at the instruction level.
Eg. Floating-point operations are the most sensitive to data values, than the
normal integer execution
Program Performance:
Some microprocessor manufacturers supply simulators for their CPUs takes as
input an executable for the microprocessor along with input data, and simulate
the execution of that program.
A timer connected to the microprocessor bus can be used to measure
performance of executing sections of code
EC6703 – ERTS Class Notes – Prepared by R.SARAVANAN – AP / ECE - PSNACET Page 35
Unit – II – Embedded Computing Platform Design
A logic analyzer can be connected to the microprocessor bus to measure the
start and stop times of a code segment
Program performance metrics:
Average-case execution time.
Typically used in application programming.
Worst-case execution time.
A component takes longer times to complete the deadline makes
dissatisfaction.
Best-case execution time.
This measure can be important in Multirate real-time system
Elements of program performance:
Basic program execution time formula:
execution time = program path + instruction timing
The path is the sequence of instructions executed by the program
The instruction timing is determined based on the sequence of instructions
traced by the program path
Solving these problems independently helps simplify analysis.
Easier to separate on simpler CPUs.
Accurate performance analysis requires:
Assembly/binary code.
Execution platform.
Instruction timing:
Not all instructions take the same amount of time.
Multi-cycle instructions.
Fetches.
Execution times of instructions are not independent.
Pipeline interlocks.
EC6703 – ERTS Class Notes – Prepared by R.SARAVANAN – AP / ECE - PSNACET Page 36
Unit – II – Embedded Computing Platform Design
Cache effects.
Execution times may vary with operand value.
Floating-point operations.
Some multi-cycle integer operations.
Example: Data-dependent paths in an if statement
Truth Table:
0 0 0 T1=F, T3=F: no assignments
0 0 1 T1=F, T3=T: A4
0 1 0 T1=T, T2=F: A2, A3
0 1 1 T1=T, T2=T: A1, A3
1 0 0 T1=T, T2=F: A2, A3
1 0 1 T1=T, T2=T: A1, A3
1 1 0 T1=T, T2=F: A2, A3
1 1 1 T1=T, T2=T: A1, A3
EC6703 – ERTS Class Notes – Prepared by R.SARAVANAN – AP / ECE - PSNACET Page 37
Unit – II – Embedded Computing Platform Design
Measurement-driven performance analysis:
The most direct way to determine the execution time of a program is by
measuring it.
Not so easy as it sounds:
Must actually have access to the CPU.
Must know data inputs that give worst/best case performance.
Must make state visible
Feeding the program:
Need to know the desired input values.
May need to write software scaffolding to generate the input values.
Software scaffolding may also need to examine outputs to generate feedback-
driven inputs.
Trace-driven measurement:
Trace-driven:
Instrument (Monitoring) the program.
Save information about the path.
Requires modifying the program.
Trace files are large.
Widely used for cache analysis.
Physical measurement:
In-circuit emulator allows tracing.
Affects execution timing.
Logic analyzer can measure behavior at pins.
Address bus can be analyzed to look for events.
Code can be modified to make events visible.
Particularly important for real-world input streams.
EC6703 – ERTS Class Notes – Prepared by R.SARAVANAN – AP / ECE - PSNACET Page 38
Unit – II – Embedded Computing Platform Design
Software Performance Optimization
1. Loop Optimizations:
Loops are important targets for optimization because programs with loops tend
to spend a lot of time executing those loops.
There are three important techniques in optimizing loops:
code motion,
induction variable elimination, and
Strength reduction
Code motion:
Code motion lets us move unnecessary code out of a loop.
If a computation’s result does not depend on operations performed in the loop
body, then we can safely move it out of the loop
Example:
for (i=0; i<N*M; i++)
z[i] = a[i] + b[i];
Induction variable elimination:
An induction variable is a variable whose value is derived from the loop
iteration variable’s value.
The compiler often introduces induction variables to help it implement the loop
EC6703 – ERTS Class Notes – Prepared by R.SARAVANAN – AP / ECE - PSNACET Page 39
Unit – II – Embedded Computing Platform Design
Consider loop:
for (i=0; i<N; i++)
for (j=0; j<M; j++)
z[i,j] = b[i,j];
Rather than recompute i*M+j for each array in each iteration, share induction
variable between arrays, increment at end of loop body.
Cache Optimizations:
Loop nest: set of loops, one inside other.
Perfect loop nest: no conditionals in nest.
Because loops use large quantities of data, cache conflicts are common.
Example:
for (j = 0; j < M; j++)
for (i = 0; i < N; i++)
a[j][i] = b[j][i] * c;
Performance optimization hints:
Use registers efficiently.
Use page mode memory accesses.
Analyze cache behavior:
instruction conflicts can be handled by rewriting code, rescheduling;
conflicting scalar data can easily be moved;
Conflicting array data can be moved, padded.
EC6703 – ERTS Class Notes – Prepared by R.SARAVANAN – AP / ECE - PSNACET Page 40
Unit – II – Embedded Computing Platform Design
Energy/power optimization
Energy: ability to do work.
Most important in battery-powered systems.
Power: energy per unit time.
Important even in wall-plug systems---power becomes heat.
Opportunities for saving power:
We may be able to replace the algorithms with others that do things in clever
ways that consume less power.
Memory accesses are a major component of power consumption in many
applications.
By optimizing memory accesses we may be able to significantly reduce power.
We may be able to turn off parts of the system—such as subsystems of the
CPU, chips in the system when we do not need them in order to save power.
Measuring energy consumption for a piece of code:
Factors contribute energy consumption of the program:
Energy consumption varies somewhat from instruction to instruction.
The sequence of instructions has some influence.
The opcode and the locations of the operands also matter
Cache Behaviour:
Caches are an important factor in energy consumption.
On the one hand, a cache hit saves a costly main memory access,
EC6703 – ERTS Class Notes – Prepared by R.SARAVANAN – AP / ECE - PSNACET Page 41
Unit – II – Embedded Computing Platform Design
On the other, the cache itself is relatively power hungry because it is built from
SRAM, not DRAM
Energy consumption has a sweet spot as cache size changes:
cache too small: program thrashes, burning energy on external memory
accesses;
Cache too large: cache itself burns too much power.
Li and Henkel [Li98] measured the influence of caches on energy consumption.
It breaks down the energy consumption of a computer running MPEG (a video
encoder) into several components:
software running on the CPU,
main memory,
data cache and instruction cache
Cache Sweet Spot
EC6703 – ERTS Class Notes – Prepared by R.SARAVANAN – AP / ECE - PSNACET Page 42
Unit – II – Embedded Computing Platform Design
Optimizing for energy:
First-order optimization:
high performance = low energy
Use registers efficiently.
Identify and eliminate cache conflicts.
Moderate loop unrolling eliminates some loop overhead instructions.
Eliminate pipeline stalls.
Inlining procedures may help: reduces linkage, but may increase cache
thrashing.
Program Validation & Testing:
Complex systems need testing to ensure that they work as they are intended.
But bugs can be subtle, particularly in embedded systems, where specialized
hardware and real-time responsiveness make programming more challenging.
Fortunately, there are many available techniques for software testing that can
help us generate a comprehensive set of tests to ensure that our system works
properly
The two major types of testing strategies:
Black-box Testing: It generates tests without looking at the internal structure
of the program.
Clear-box (also known as white-box) : It generate tests based on the program
structure
Clear Box Testing:
The control/data flow graph extracted from a program’s source code is an
important tool in developing clear-box tests for the program.
To test the program, we must exercise both its control and data operations.
In order to execute and evaluate these tests, we must be able to control
variables in the program and observe the results of computations
In general, we may need to modify the program to make it more testable.
EC6703 – ERTS Class Notes – Prepared by R.SARAVANAN – AP / ECE - PSNACET Page 43
Unit – II – Embedded Computing Platform Design
By adding new inputs and outputs, we can usually substantially reduce the
effort required to find and execute the test.
We must accomplish the following three things in a test
Provide the program with inputs that exercise the test we are interested in.
Execute the program to perform the test.
Examine the outputs to determine whether the test was successful
Black Box Testing:
Complements clear-box testing.
May require a large number of tests.
Tests software in different ways.
Black-box tests are generated without knowledge of the code being tested
Tests should be created that provide specified outputs and evaluate whether the
results also satisfy the inputs
Black-box test vectors:
Random tests.
May weight distribution based on software specification.
Regression tests.
Tests of previous versions, bugs, etc.
May be clear-box tests of previous versions.
EC6703 – ERTS Class Notes – Prepared by R.SARAVANAN – AP / ECE - PSNACET Page 44