Embedded Computer: Memory System, Input/Output
Outline
Memory System
Types of memory Caches
Input/Output
Ingo Sander ingo@[Link]
Memory Mapped I/O Polling Interrupt
September 4, 2007
IL2206 Embedded Systems
The memory bottleneck
Memory System
Most instructions in a RISC processor can execute in a single clock cycle BUT Access to the main memory (typically in SDRAM) is slow If memory access time can be shortened the system would perform considerably better
September 4, 2007
IL2206 Embedded Systems
Memory Performance
Memory Bandwidth
rate at which information can be transferred from the memory system
Memory Bandwidth
If R is the number of request that the memory can serve simultaneously then BW = R/L Example:
A 32-bit memory with latency 20 ns has a bandwidth BW = 32 Bit / 20 ns = 1.6 GBit/s = 20 MByte/s
Latency
Time between the following two time instances
time instance where the processor issues a request to the memory time instance where the requested data arrives and is available for use by processor
September 4, 2007 IL2206 Embedded Systems 5
September 4, 2007
IL2206 Embedded Systems
Types of memory
ROM (Read Only Memory)
Mask-programmable Flash programmable (can be reprogrammed, but has long access times)
SRAM vs. DRAM
SRAM (Static RAM)
Faster Easier to integrate with logic Higher power consumption
RAM (Random Access Memory)
DRAM SRAM
DRAM (Dynamic RAM)
Denser Must be refreshed
September 4, 2007
IL2206 Embedded Systems
September 4, 2007
IL2206 Embedded Systems
Synchronous DRAM
Clock signal is used internally to pipeline accesses
Memory must be fast enough to respond to request Request takes multiple clock cycles
Flash issues
Flash is programmed at system voltages Erasure time is long Must be erased in blocks Limited number of erasures
A Flash Memory is very useful in combination with SRAM or SDRAM devices, since it can load these devices at power-on
9 September 4, 2007 IL2206 Embedded Systems 10
Provides burst mode access:
1, 2, 4, 8 locations
September 4, 2007
IL2206 Embedded Systems
Memory Access Times and Costs
Memory Technology SRAM DRAM Magnetic disk Typical Access Time 0.5 ns -5 ns 50 ns 70 ns 5,000,000 ns 20,000,000 ns $ per GB in 2004 $4000 - $10000 $100 - $200 $0.5 - $2
Embedded system memories
Large fast memories are very expensive Embedded systems have to be produced at a low cost
single SRAM main memory is in general too expensive combination of fast and slow memories is often still feasible
Source: Patterson and Hennessy, 2004
September 4, 2007 IL2206 Embedded Systems 11 September 4, 2007 IL2206 Embedded Systems 12
Caches
Large fast memories are too expensive, but small fast memories are feasible A cache memory is a small, but fast memory that is located near the CPU to reduce memory access times Ideally the processor does only need to access the cache and not the main memory
Memory is a bottleneck
While the CPU is fast, each memory access takes long time and slows down the system Caches can increase the performance, if most memory requests do not need to access the main memory
CPU
(fast)
CPU
(fast)
Memory
(very slow)
Memory Cache
(fast) (very slow)
Bus
(slow)
Bus
(slow)
September 4, 2007
IL2206 Embedded Systems
13
September 4, 2007
IL2206 Embedded Systems
14
Caches and CPUs
address cache controller data cache address data main memory
Cache operation
Many main memory locations are mapped onto one cache entry May have caches for:
instructions; data; data + instructions (unified).
CPU data
2000 Wolf (Morgan Kaufman)
Memory access time is no longer 2000 Wolf (Morgan deterministic! Kaufman)
IL2206 Embedded Systems 15 September 4, 2007 IL2206 Embedded Systems 16
September 4, 2007
Terms
Cache hit: required location is in cache. Cache miss: required location is not in cache. Working set: set of locations used by program in a time interval.
Types of misses
Compulsory (cold): location has never been accessed. Capacity: working set is too large. Conflict: multiple locations in working set map to same cache entry.
2000 Wolf (Morgan Kaufman)
2000 Wolf (Morgan Kaufman)
September 4, 2007
IL2206 Embedded Systems
17
September 4, 2007
IL2206 Embedded Systems
18
Memory system performance
h = cache hit rate. tcache = cache access time, tmain = main memory access time. Average memory access time:
tav = htcache + (1-h)tmain
Write operations
Write-through: immediately copy write to main memory
Causes unnecessary memory communication Memory has always a valid copy of the cache block
Write-back: write to main memory only when location is removed from cache
Tries to minimize communication with memory Memory may have an invalid copy of the cache block. Must be updated, when a cache block is replaced
2000 Wolf (Morgan Kaufman)
September 4, 2007
IL2206 Embedded Systems
19
September 4, 2007
IL2206 Embedded Systems
20
Replacement
Replacement policy: strategy for choosing which cache entry to throw out to make room for a new memory location. Two popular strategies:
Random. Least-recently used (LRU).
Cache performance benefits
Keep frequently-accessed locations in fast cache. Cache retrieves more than one word at a time.
Sequential accesses are faster after first access.
In case of a modified cache entry in a write-back cache replacement means also to write the contents of the dirty cache entry back to the memory. Thus a cache miss can be expensive!
September 4, 2007 IL2206 Embedded Systems 21
2000 Wolf (Morgan Kaufman)
September 4, 2007
IL2206 Embedded Systems
22
Data Transfer to Cache
Words are transferred between cache and processor Blocks (of multiple words, given by the block size) are transferred between cache and memory
Word Transfer Block Transfer
Cache organizations
Fully-associative: any memory location can be stored anywhere in the cache (almost never implemented). Direct-mapped: each memory location maps onto exactly one cache entry. N-way set-associative: each memory location can go into one of N entries.
CPU
Cache
Main Memory
September 4, 2007
IL2206 Embedded Systems
23
September 4, 2007
IL2206 Embedded Systems
24
Direct-mapped cache
A direct-mapped cache consists of several cache lines, where each cache line has a status bit, a tag and data (cache block) There is a given mapping for each memory location!
Cache Line 0 1 Cache Block Tag Wd 0 Wd 0 Wd 1 Wd 1 Wd 2 Wd 2 Wd 3 Wd 3 Memory Address 0 10 Block 1 20 Block 2 30 Block 3 7 Status Bit Wd 0 Wd 1 Wd 2 Wd 3 40 Block 4 50 Block 5 60 Block 6 70 Block 7 80 Block 8 FF0
September 4, 2007 IL2206 Embedded Systems
Example Direct Mapped Cache
Cache has 2 KBytes (512 words), organized as 64 cache lines with a block size of 8 words Memory has 64 Kbytes (16 KWords), which can be seen as 2048 blocks of 8 Words Address size is 16 bits The direct map technique uses the modulo (remainder) operation to map on a cache block
Block 0, 64, 128, ... is mapped on Block 0 in the cache Block 1, 65, 129, is mapped on Block 1 in the cache
Block 0
Block 1024
25
September 4, 2007
IL2206 Embedded Systems
26
Example Direct Mapped Cache
Main Memory Memory Address
5 Tag 6 Block 3 2 Word Byte Offset
Direct-mapped cache
Block 0 Block 1 0x0000 0x0020
Cache Line 0 Line 1
Block 63 Block 64 Block 65
0 4 1 5 2 6 3 7
1 valid
0xabcd tag
byte byte byte data cache line
Line 63
Block 127
1 5 32 Data (8 words)
A block has 8 words
tag
index offset = hit value byte
( or halfword/word)
Valid Tag
Block 2047
0xFFE0
27 September 4, 2007 IL2206 Embedded Systems 28
September 4, 2007
IL2206 Embedded Systems
Direct-mapped cache locations
Many locations map onto the same cache block. Conflict misses are easy to generate:
Array a[] uses locations 0, 1, 2, Array b[] uses locations 1024, 1025, 1026, Operation a[i] + b[i] generates conflict misses.
2000 Wolf (Morgan Kaufman)
Example 2-way set-associative cache
Memory Address
6 Tag 5 Set
Set 0 Set 1
Main Memory
Block 0
5 Offset
Cache Way 1 Way 1
Block 1
Way 0 Way 0
Block 31 Block 32 Block 33
0 4
1 5
2 6
3 7
Set 31
Way 0
Way 1
A block has 8 words
Block 127
1 6 32 Data (8 words) Valid Tag
Block 2043
IL2206 Embedded Systems 29 September 4, 2007 IL2206 Embedded Systems 30
September 4, 2007
Set-Associative Caches
One-way set associative (direct-mapped)
Block (Set) 0 1 2 3 4 5 6 7 Tag Tag Tag Tag Tag Tag Tag Tag Data Data Data Data Data Data Data Data
Fully associative cache
Data Data Data Data Tag Tag Tag Tag Data Data Data Data
Two-way set associative
Set 0 Tag Tag Tag Tag
1 element per set
1 2 3
2 elements per set
There is a complete freedom, where to place a block in the cache But all blocks have to be searched for the correct tag pattern In order to have an acceptable performance, the tags must be searched in parallel
Data
Eight-way set associative (fully associative)
Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag
8 elements per set
September 4, 2007 IL2206 Embedded Systems 31 September 4, 2007 IL2206 Embedded Systems 32
Example caches
StrongARM:
16 Kbyte, 32-way, 32-byte block instruction cache. 16 Kbyte, 32-way, 32-byte block data cache (write-back).
Summary Memory Systems
Memory is a bottleneck in the system Different memories exist
Cost increases with memory performance
A cache memory can significantly decrease execution time at low cost
Execution time is very hard to predict Problem for design of real-time systems Locality is important to utilize caches efficiently There can be several level of different caches Embedded systems have usually only one cache level
33 September 4, 2007 IL2206 Embedded Systems 34
Nios II
512 Bytes to 64KBytes direct-mapped I- and Dcache with a cache block size of 4 (D), 16(D) or 32(I&D) Bytes
September 4, 2007 IL2206 Embedded Systems
Input and Output Devices
Input/Output
Input/Output Devices are used to communicate with the environment An example is a UART (Universal Asynchronous Receiver/Transmitter) These devices (like other peripheral devices) are controlled by reading and writing to registers
Register Select Data Bus
Control Signals
Status Register Mode Register
Output Input
Data Register
I/O Device
September 4, 2007 IL2206 Embedded Systems 36
Serial communication
Characters are transmitted separately
Universal Asynchronous Receiver/Transmitter (UART)
Component for serial to parallel conversion Has a serial receiver/transmitter Many parameters can be configured
Baud rate Number of bits per character Parity bits Length of Stop Bit
no char start bit 0 bit 1 ... bit n-1 stop time
2000 Morgan Kaufman (Wayne Wolf)
September 4, 2007
IL2206 Embedded Systems
37
September 4, 2007
IL2206 Embedded Systems
38
Memory-Mapped I/O
Peripheral Components can be connected to the processor by memory-mapped I/O The components can be reached via a separate address space Memory-mapped I/O requires extra hardware for address decoding
Memory-Mapped I/O
The output chip-enable has to be active, when the input of the decoder is a correct address Other address bits are used for register select The decoder can be implemented with a small block of programmable logic or custom hardware (VHDL)
Register Select
Addressbus CPU
Decoder
Chip Enable Read/Write Peripheral
Interface to Environment
Databus
September 4, 2007 IL2206 Embedded Systems 39 September 4, 2007 IL2206 Embedded Systems 40
Example Memory-Mapped I/O
A device with 8 8-bit-registers shall be connected to the address 0x1000
0x00001002
Addressbus (ADR31-ADR0) ADR3 -ADR31 Decoder ADR2 ADR1 ADR0
Accessing Memory Locations in C
R0 R1 ... R7 Databus (D31-D0) (D7-D0)
0 1 0
RS2 RS1 RS0
Symbolic names can be defined for memory locations #define MEM_LOCATION 0x18 Functions can be defined to access memory
peek can be used to read a memory location (byte) char peek(char *location) {return *location;} poke can be used to write to a memory location (byte) void poke(char *location, char newval) {*location = newval;}
41 September 4, 2007 IL2206 Embedded Systems 42
1
CE
Registers
Dont do this!
Active when ADR12=1 and all others are 0!
The registers can now be accessed in the address space 0x1000 (R0) until 0x1007 (R7) movia r1, 0x1002 movi r3, 0x08 stb r3, (r1) set bit 3 and clears all other bits device register R2
September 4, 2007 IL2206 Embedded Systems
Memory Locations shouldnt be accessed directly!
Software shall be flexible
Hardware could change
Busy Wait I/O
Busy Wait I/O is the most basic way to communicate with an I/O-device The processor wait until the I/O-device has completed its current task Disadvantage: Processor cannot be used for other tasks during the waiting period! This method is also often called polling!
Example: Sending string via serial link Busy Wait I/O Pseudo Code:
Characters = String; While not all characters sent Send next character; While Sender = Busy Wait; Done!
Programmers may make mistakes that the compiler would not do (e.g. memory alignment) HAL (Hardware Abstraction Layer) offers optimized device drivers to access peripheral devices and memory
September 4, 2007
IL2206 Embedded Systems
43
September 4, 2007
IL2206 Embedded Systems
44
C-Programming Testing of Bits
In order to test specific bits, it is needed to mask the other bits Example: Busy Flag: Busy = 1; Non-Busy = 0
7 0x1000 0x1001
September 4, 2007 IL2206 Embedded Systems
C-Programming Testing of Bits
define Status 0x1000 define SendBuf 0x1001 char *myString = Hello World; char *current_char;
7 Status Sender Sender Buffer
45 September 4, 2007
5 BF
0 0x1000 0x1001
5 BF
0 Status Sender Sender Buffer
IL2206 Embedded Systems
46
C-Programming Testing of Bits
Here you should use HAL functions!
Simultaneous busy/wait input and output
Example: Copying Characters from Input to Output Busy Wait I/O Pseudo Code:
Loop While inBuffer busy Wait; Read Character Copy Character to Output Buffer Send Character While outBuffer busy Wait;
while (current_char != \0) { poke(SendBuf, *current_char++); while ((peek(Status) & 0x20) != 0) ; } /* Mask needed, since other bits */ /* in status register may not be zero */
7 0x1000 0x1001
September 4, 2007 IL2206 Embedded Systems
5 BF
0 Status Sender Sender Buffer
47
September 4, 2007
IL2206 Embedded Systems
48
Interrupt I/O
Busy/wait is very inefficient.
CPU cant do other work while testing device. Hard to do simultaneous I/O.
Interrupt Scheme
Interrupt Request
CPU
Interrupt Acknowledge Data/Address
Device
Interrupts allow a device to change the flow of control in the CPU.
Causes subroutine call to handle device.
2000 Wolf (Morgan Kaufman)
September 4, 2007
IL2206 Embedded Systems
49
September 4, 2007
IL2206 Embedded Systems
50
Interrupt physical interface
CPU and device are connected by CPU bus CPU and device handshake:
device asserts interrupt request; CPU asserts interrupt acknowledge when it can handle the interrupt.
Interrupt behavior
Based on subroutine call mechanism Interrupt forces next instruction to be a subroutine call to a predetermined location
Return address is saved to resume executing foreground program
2000 Wolf (Morgan Kaufman)
2000 Wolf (Morgan Kaufman)
September 4, 2007
IL2206 Embedded Systems
51
September 4, 2007
IL2206 Embedded Systems
52
Programming Interrupt
Foreground Program
Do something Interrupt Event
Receive-Send with Polling
Assume a program that as part of its duties receives characters and sends them further to another device Solution with polling:
loop
Wait for new character; Do something; Send character;
Interrupt Handler
Save Registers Handle Interrupt Restore Registers Restore PC Clear interrupt disable flag
Interrupt Vector
Branch to Interrupt Handler
end loop; System cannot do anything while it waits for a new character until the sender is ready System resources are utilized very inefficiently!
September 4, 2007
IL2206 Embedded Systems
53
September 4, 2007
IL2206 Embedded Systems
54
Better Receive-Send Implementation with Interrupt
Parallization of duties
Wait for new character (interrupt)
If character is received it is stored in a buffer
Better Receive-Send Implementation with Interrupt
System can do other thing while waiting for receiver or sender Buffer is needed to store elements Size of buffer must be chosen carefully
too small => buffer overflow too large => too expensive design
Do Something (foreground program)
Work with the stored buffer elements
Send character if transmitter ready (interrupt)
Check if transmitter is ready and send the first character of the buffer
September 4, 2007 IL2206 Embedded Systems 55
September 4, 2007
IL2206 Embedded Systems
56
Typical Embedded Design Problems
Embedded Systems are inherently parallel (concurrent), since they interact with heterogeneous environment
Parallization allows for a faster processing, since work can be done in parallel Waiting times can be avoided
Send-Receive with Circular Buffer (Wolf)
Independent receive, send realized by two interrupt routines Receive-interrupt routine Puts a character into queue Send-interrupt routine Sends a character, when sender ready
The need for buffers is a logical consequence of parallization
System designer needs to find the right amount of parallization and the right buffer size!
September 4, 2007 IL2206 Embedded Systems 57
head headtail
September 4, 2007
tail
IL2206 Embedded Systems 58
Send-Receive with Circular Buffer (Wolf)
A circular buffer can be realised in a memory with a pointer for head and tail If a pointer is at the end of the buffer, the next position is the start of the buffer
i f g h
Send-Receive sequence diagram (Wolf)
:foreground :input :output :queue empty a empty b bc
tail
September 4, 2007
head
IL2206 Embedded Systems 59 September 4, 2007 IL2206 Embedded Systems
c
2000 Wolf (Morgan Kaufman)
60
Debugging interrupt code
What if you forget to change registers?
Foreground program can exhibit mysterious bugs Bugs will be hard to repeat---depend on interrupt timing It is difficult to debug an interrupt routine!
Prioritized Interrupts
Some CPUs (as Nios II) support several interrupt levels by their hardware Otherwise extra hardware (priority decoder) can be used to create several levels of interrupt
2000 Wolf (Morgan Kaufman)
September 4, 2007
IL2206 Embedded Systems
61
September 4, 2007
IL2206 Embedded Systems
62
Interrupt prioritization
Masking: interrupt with priority lower than current priority is not recognized until pending interrupt is complete. Non-maskable interrupt (NMI): highestpriority, never masked.
Often used for power-down.
2000 Wolf (Morgan Kaufman)
Example: Prioritized I/O
:interrupts B C A A,B
2000 Wolf (Morgan Kaufman)
:foreground
:A
:B
:C
September 4, 2007
IL2206 Embedded Systems
63
September 4, 2007
IL2206 Embedded Systems
64
Sources of interrupt overhead
Handler execution time Interrupt mechanism overhead Register save/restore Pipeline-related penalties Cache-related penalties
Summary
Peripherals can be made accessible for software by memory mapped I/O Two basic approaches for communication with I/O device
polling processor checks, if data has arrived interrupt processor is notified, if data has arrived
Interrupt is not always better than polling!
September 4, 2007 IL2206 Embedded Systems 65 September 4, 2007 IL2206 Embedded Systems 66