Republic of the Philippines
President Ramon Magsaysay State University
(Formerly Ramon Magsaysay Technological University)
Iba, Zambales, Philippines
Tel/Fax No.: (047) 811-1683
College of Communication and Information Technology
BACHELOR OF SCIENCE IN COMPUTER SCIENCE
Memory Models
Members:
Aireen Grace L.Sion
Maria Althea Montejo
Camile Reyes
King Augustine Sarmiento
Jhon Romer Torres
Submitted to:
Mr. Isarael M. Cabasug
Republic of the Philippines
President Ramon Magsaysay State University
(Formerly Ramon Magsaysay Technological University)
Iba, Zambales, Philippines
Tel/Fax No.: (047) 811-1683
College of Communication and Information Technology
BACHELOR OF SCIENCE IN COMPUTER SCIENCE
0.0 Memory Models Introduction…………………………………..1
1.0 Memory Hardware Technology ………………………………. 1
1.1 Memory System Architecture ………………………………….2
1.2 High-level Memory Architecture…………………………… 3
1.3 Basic Issues in Cache Design ………………………………. 4
1.4 The Cache Coherence Problem…………………………….. 5
2.0. User-Level Memory Models………………………………….. 6
2.1. Shared Memory v. Message Passing……………………….. 6
2.2 Implementation of Shared-Memory Models ……………….. 7
2.3 Performance of Shared-Memory Models …………………. 9
3.0 Memory Consistency Models ………………………………… 10
4.0 Implementation and Performance of Memory Consistency Models……….11
4.1. Hardware Implementations ………………………………...11
4.2. Software Implementations ………………………………..12
5.0 Conclusions and Trends………………………………………
I
Republic of the Philippines
President Ramon Magsaysay State University
(Formerly Ramon Magsaysay Technological University)
Iba, Zambales, Philippines
Tel/Fax No.: (047) 811-1683
College of Communication and Information Technology
BACHELOR OF SCIENCE IN COMPUTER SCIENCE
II
Republic of the Philippines
President Ramon Magsaysay State University
(Formerly Ramon Magsaysay Technological University)
Iba, Zambales, Philippines
Tel/Fax No.: (047) 811-1683
College of Communication and Information Technology
BACHELOR OF SCIENCE IN COMPUTER SCIENCE
Introduction
Memory models are the rule-books for computer memory operations,
particularly in concurrent or multi-threaded setups. They ensure consistency and
order, allowing multiple processors to work together without errors. These models
help keep data accurate and accessible, making sure everything in the system runs
smoothly.
1.0 Memory Hardware Technology
Programmers ideally want memory systems that are both infinitely large and fast.
However, larger memory tends to be slower. Memory systems are measured by size
(storage capacity), latency (speed of data access), and bandwidth (data transfer rate).
Latency includes access time (time to retrieve data) and cycle time (time between
requests).Bandwidth measures data supply per unit time. It's linked to cycle time but a
lso depends on memory width and the number of memory banks.
There are two main types of memory chips: DRAM and SRAM. DRAM needs
periodic refreshing and has longer cycle times because reading data also requires
writing it back. SRAM does not need refreshing or write-back and has equal cycle and
access times but is less dense and more expensive. SRAM is about 16 times faster
than DRAM but holds about 16 times less data.
Besides SRAM and DRAM, hybrid memory options like cached DRAMs combine the
best features of both. They integrate a small SRAM buffer with a larger DRAM
memory, caching recently-accessed data to speed up average cycle times. Though
slower than processor-local caches, cached DRAMs benefit all processors and don’t
require coherence maintenance. DRAMs use a two-step addressing process, first
1
Republic of the Philippines
President Ramon Magsaysay State University
(Formerly Ramon Magsaysay Technological University)
Iba, Zambales, Philippines
Tel/Fax No.: (047) 811-1683
College of Communication and Information Technology
BACHELOR OF SCIENCE IN COMPUTER SCIENCE
specifying a row and then a column. If accesses are to the same row, the first step can
be skipped, speeding up access time. Bandwidth relates to cycle time, but it can be
increased by parallel access or interleaving memory references across different
modules. Interleaving allows faster memory access rates and lets DRAM rewrite data
before being accessed again.
Most memory chips are 1 to 8 bits wide, but systems need larger data chunks, so
multiple chips are used in parallel, forming a memory bank. High-density chips
minimize cost and physical demands but clash with interleaving goals. For instance,
16 Mbit chips, each supplying 4 bits, require 8 chips to deliver a 32-bit word, making
each memory bank 16 Mbytes. Increasing interleaving with lower-density DRAM
loses cost benefits compared to SRAM. Most microprocessor memory systems use
DRAM with processor caches. Vector supercomputers, like the Cray Research C-90
with up to 1024 memory banks, use highly-interleaved SRAM without caches, as their
workloads don't benefit much from caches. Supercomputer customers pay extra for
SRAM's performance.
1.1. Memory System Architecture
Designing a multiprocessor memory system involves key architectural
decisions beyond chip technology. This includes organizing memory into modules,
determining their physical location relative to processors, structuring the address
space, and defining the role of caches. It also involves lower-level cache design
considerations, relevant to both uni-processors and multiprocessors. Additionally,
addressing coherence is critical in shared-memory systems where data copies may
exist in multiple locations.
2
Republic of the Philippines
President Ramon Magsaysay State University
(Formerly Ramon Magsaysay Technological University)
Iba, Zambales, Philippines
Tel/Fax No.: (047) 811-1683
College of Communication and Information Technology
BACHELOR OF SCIENCE IN COMPUTER SCIENCE
3
Republic of the Philippines
President Ramon Magsaysay State University
(Formerly Ramon Magsaysay Technological University)
Iba, Zambales, Philippines
Tel/Fax No.: (047) 811-1683
College of Communication and Information Technology
BACHELOR OF SCIENCE IN COMPUTER SCIENCE
1.2. High-level Memory Architecture
Multiprocessor memory system designs rely heavily on the concept of locality.
Programs with high locality use different address space portions at different times.
Architects leverage this by caching frequently accessed information near the
processors, creating the illusion of a memory that is both large and fast.
There are two dimensions of locality of importance on a uniprocessor, and a
third of importance on parallel machines:
1. Temporal Locality. If a data item is referenced once, then it is likely to be
referenced again in the near future.
2. Spatial Locality. If a data item is referenced, then items with nearby
addresses are likely to be referenced in the near future.
3. Processor Locality. If a data item item is referenced by a given processor,
then that data item (and others at nearby addresses) are likely to be referenced by that
same processor in the near future.
The performance of a memory hierarchy relies on how well caches handle
requests, minimizing the need to access slow main memory. Effective caches reduce
average memory latency and the bandwidth required, making cheaper parts usable.
However, if cache hit rates are low, performance drops, so flat memory systems might
be better for programs with low locality. Supercomputers, which often use flat
memory due to high bandwidth needs, rely on highly-interleaved SRAM and
aggressive prefetching. Trends indicate that large, multi-bank caches could improve
future supercomputer workloads by addressing bandwidth issues.
4
Republic of the Philippines
President Ramon Magsaysay State University
(Formerly Ramon Magsaysay Technological University)
Iba, Zambales, Philippines
Tel/Fax No.: (047) 811-1683
College of Communication and Information Technology
BACHELOR OF SCIENCE IN COMPUTER SCIENCE
Designers must decide where to place main memory, either co-locating it with
processors (distributed memory architecture) or separating it across an
interconnection network (dance-hall architecture). Distributed memory improves
scalability for applications with high processor locality but complicates addressing
schemes, requiring node controllers to manage non-local memory requests.
1.3. Basic Issues in Cache Design
To judge cache success in a hierarchical memory system, we use metrics like
hit rate, miss rate, and mean cost per reference (MCPR). The hit rate is the percentage
of memory requests satisfied by the cache, while the miss rate is the percentage that
isn't. Combined, they total 100%. A cache hit occurs when the desired data is found in
the cache and returned; a miss occurs when the data must be fetched from the main
memory, potentially replacing existing cache data. MCPR measures the average
memory access cost, considering both hits and misses, providing a detailed
performance evaluation.
MCP R = HitRate ∗ HitT ime + M issRate ∗ M issT ime
The actual Mean Cost Per Reference (MCPR) depends on both application and
architecture factors. Application factors relate to locality properties, while architecture
factors include cache size, block size, associativity, tag/index organization,
replacement policy, write policy, and coherence mechanism. These parameters impact
performance by affecting cache hit/miss costs and ratios. Most parameters are relevant
to both multiprocessor and uniprocessor caches.
5
Republic of the Philippines
President Ramon Magsaysay State University
(Formerly Ramon Magsaysay Technological University)
Iba, Zambales, Philippines
Tel/Fax No.: (047) 811-1683
College of Communication and Information Technology
BACHELOR OF SCIENCE IN COMPUTER SCIENCE
1.4. The Cache Coherence Problem
The coherence problem ensures no processor reads outdated data when multiple
copies exist. It occurs in uniprocessors with DMA I/O devices, typically resolved by
ad-hoc OS-level mechanisms. In multiprocessors with caches, the problem is more
complex. Small-scale multiprocessors address this by snooping on a shared bus,
where processors broadcast write actions and others monitor these transactions to
maintain cache consistency.
The main issue with this approach is the assumption that cache hits cost a single
cycle, which is often inaccurate. On-chip caches typically take 2 or 3 cycles to
respond. Compilers try to mask these extra cycles by scheduling independent
operations, but this isn't always feasible.
With current tech, buses can transfer up to 1.2 Gbytes/sec, while processors
consume data at around 200 Mbytes/sec. This limits the number of processors sharing
a bus to 20-30, as caches handle most requests. Without a fast broadcast mechanism,
cache coherence is managed with directory structures, which maintain data
consistency but pose scalability issues. Generally, a small number of pointers for each
sharable line works effectively. Directory-based systems like CC-NUMA store cache
line info at the main memory processor, while COMA uses more dynamic structures.
Examples include the Stanford Dash (Θ(P2) directory) and MIT Alewife (limited
pointers, software traps on overflow). The Convex Exemplar uses IEEE SCI standards
for a distributed directory with linear space overhead
6
Republic of the Philippines
President Ramon Magsaysay State University
(Formerly Ramon Magsaysay Technological University)
Iba, Zambales, Philippines
Tel/Fax No.: (047) 811-1683
College of Communication and Information Technology
BACHELOR OF SCIENCE IN COMPUTER SCIENCE
2.0. User-Level Memory Models
2.1 Shared Memory v.s Message Passing
Memory models matter at both hardware and programmer levels. They
influence the effort needed to create effective parallel programs. Hardware memory
models are either message-passing or shared-memory, and programmer-level models
follow this too. In shared-memory, processes access variables directly by their names.
In message-passing, variables are divided among processor instances, requiring
message exchanges to access non-local variables. Both models offer diverse
programming approaches.
• via library routines linked into programs written in a conventional sequential
language;
• via features added to some existing sequential language by means of preprocessors
or compiler extensions of various levels of complexity; or
• via special-purpose parallel or distributed programming languages.
Library packages offer portability and simplicity but are limited to subroutine
calls and can't utilize compiler-based optimizations. Languages and extensions can
use more complex syntax and optimizations. Message-passing models, implemented
as subroutine calls, tend to be more successful than shared-memory models, which
must access shared data through pointers from library routines. Andrews and
Schneider offer a solid intro to parallel programming, with Andrews' later book
adding more detail. Bal, Steiner, and Tanenbaum survey message-passing models,
while Cheng's report covers various programming models from the early '90s. The
widely-used library-based shared-memory model for Fortran and C by Argonne
7
Republic of the Philippines
President Ramon Magsaysay State University
(Formerly Ramon Magsaysay Technological University)
Iba, Zambales, Philippines
Tel/Fax No.: (047) 811-1683
College of Communication and Information Technology
BACHELOR OF SCIENCE IN COMPUTER SCIENCE
National Laboratory researchers is notable. Shared-memory models are generally seen
as easier than message-passing models, evidenced by their market dominance and
efforts to provide shared memory on message-passing hardware. The rest of this
chapter focuses on shared-memory models.
2.2. Implementation of Shared-Memory Models
Any user-level parallel programming model must deal with several issues. These
include
• specifying the computational tasks to be performed and the data structures to be
used;
• identifying data dependences among tasks;
• allocating tasks to processes, and scheduling and synchronizing processes on
processors, in a way that respects the dependences;
• determining (which copies of) which data should be located at which processors at
which points in time, so that processes have the data they need when they need it;
Arranging for communication among processors to effect the location
decisions.
Programming models differ in how much responsibility they place on users,
compilers, run-time systems, operating systems, and hardware. The programmer's
view of memory is closely tied to parallelization and data placement. Some models,
like message-passing and non-cache-coherent systems, make users handle all aspects.
For instance, Split-C offers various mechanisms for data management in C programs
on multicomputers. Other models, like Sun’s LWP and OSF’s pthreads, require
explicit process management by users but use hardware for data placement.
8
Republic of the Philippines
President Ramon Magsaysay State University
(Formerly Ramon Magsaysay Technological University)
Iba, Zambales, Philippines
Tel/Fax No.: (047) 811-1683
College of Communication and Information Technology
BACHELOR OF SCIENCE IN COMPUTER SCIENCE
In large-scale scientific computing, where multicomputers are prevalent, there
has been significant focus on developing efficient shared-memory programming
models. High Performance Fortran (HPF) is a key outcome, developed through
collaboration among many research groups. HPF merges Fortran-90 syntax with
Fortran-D's data distribution and alignment concepts, letting users specify data
placement and parallel loop execution. The compiler follows the "owner computes"
rule, assigning computations based on data location. Similar efforts include pC++ and
Jade, parallel versions of C++ and C, respectively.
2.3. Performance of Shared-Memory Models
Bus-based shared memory machines lead in general-purpose
multiprocessors, while message-passing machines dominate high-
performance markets. The challenges in building large-scale
shared-memory multiprocessors and the perceived inefficiency of
shared memory models are key reasons for this. However, the
inefficiencies in shared memory are believed to result from their
usage methods, not the model itself.
To achieve good performance in shared memory parallel
programs, reducing communication is crucial. Improving locality is
a complex, multi-faceted challenge actively being researched. Key
aspects include (but aren't limited to): optimizing data access
patterns, minimizing data movement, and enhancing data
placement strategies.
9
Republic of the Philippines
President Ramon Magsaysay State University
(Formerly Ramon Magsaysay Technological University)
Iba, Zambales, Philippines
Tel/Fax No.: (047) 811-1683
College of Communication and Information Technology
BACHELOR OF SCIENCE IN COMPUTER SCIENCE
Mismatch between the program data structures and the coherence units. If data
structures are not aligned with the coherence units, accesses to unrelated data
structures will cause unnecessary communication.
Relative location of computational tasks and the data they access. Random
placement of tasks (as achieved by centralized work queues) results in very high
communication rates. Programmer/compilers should attempt to place all tasks that
access the same data on the same processor to take advantage of processor locality.
Poor spatial locality. In some cases programs access array data with non-unit strides.
In that case a potentially large amount of data fetched in the cache remains untouched.
Blocking techniques and algorithmic restructuring can often help alleviate this
problem.
Poor temporal locality. Data is invalidated or evicted from a processor’s cache
before it is touched again. As in the previous case restructuring the program can help
improve its access pattern and as a consequence its temporal locality.
Conflicts between data structures. Data structures that map into the same cache
lines will cause this problem in caches with limited associativity. If these data
structures are used in the same computational phase performance can suffer severe
degradation. Skewing or relocating data structures can help to solve this problem.
Communication intrinsic in the algorithm. Such communication is necessary for
correctness and cannot be removed. However it may be tolerated if data is pre-fetched
before it is needed, or if the processor is able to switch to other computation while a
miss is being serviced.
10
Republic of the Philippines
President Ramon Magsaysay State University
(Formerly Ramon Magsaysay Technological University)
Iba, Zambales, Philippines
Tel/Fax No.: (047) 811-1683
College of Communication and Information Technology
BACHELOR OF SCIENCE IN COMPUTER SCIENCE
3.0. Memory Consistency Models
Shared-memory coherence protocols resolve conflicting accesses to the same
memory block by different processors. The method of resolving these conflicts
defines the memory consistency model, determining the outcomes of parallel reads
and writes. Sequential consistency, aiming for behavior similar to sequential
machines, has been the traditional choice. However, it restricts performance
optimizations. By relaxing these constraints, significant performance improvements in
parallel programs are possible, though this leads to a more complex programming
model.
Processor 0 Processor 1
X = 10 Y = 10
A=Y B=X
Print A Print B
Sequential Consistency means all processors must agree on the order of memory
events. If two processors write to the same memory location, every processor must
concur on which write happened first.
Processor Consistency means processors don't need to agree on the order of reads
and writes by different processors. A write can even be delayed beyond some of the
processor's later reads.
11
Republic of the Philippines
President Ramon Magsaysay State University
(Formerly Ramon Magsaysay Technological University)
Iba, Zambales, Philippines
Tel/Fax No.: (047) 811-1683
College of Communication and Information Technology
BACHELOR OF SCIENCE IN COMPUTER SCIENCE
Under sequential consistency, instructions will ensure at least one value of 10 is
printed because all processors agree on the write order. In processor consistency,
processors don't need to agree on the write order, which can lead to different results.
4.0. Implementation and Performance of Memory Consistency Models
The implementation of memory consistency models presents architects with several
design decisions:
• Should coherence be implemented in hardware, software, or some combination of
the two?
• What should be the size of coherence blocks?
• Should copies of data be updated or invalidated on a write?
• Should the updates/invalidates be performed in an eager or lazy fashion?
While all combinations of answers to the above question are possible the choice
between hardware and software has a strong influence on the remaining questions.
For this reason we will discuss hardware and software implementation of consistency
models separately.
4.1. Hardware Implementations
In hardware implementations, the coherence unit is typically a cache line (32-
256 bytes). Hardware systems generally use eager protocols, updating or invalidating
data immediately when inconsistencies arise. Lazy protocols, which delay these
actions until needed, are usually too costly for hardware. Invalidation is preferred due
to less communication required, although hybrid protocols with some updates might
enhance performance. Relaxed consistency models let reads bypass writes,
completing before prior writes are done. Non-uniform models allow pipelining and
12
Republic of the Philippines
President Ramon Magsaysay State University
(Formerly Ramon Magsaysay Technological University)
Iba, Zambales, Philippines
Tel/Fax No.: (047) 811-1683
College of Communication and Information Technology
BACHELOR OF SCIENCE IN COMPUTER SCIENCE
merging of writes between synchronization accesses. Models differentiating
synchronization access types enable pipelining with adjacent data accesses.
4.2. Software Implementations
Eager hardware protocols run in parallel with applications, while software
protocols compete for processor cycles. Therefore, software protocols often have
higher overhead. Most DSM systems use lazy protocols due to this. In software, the
decision between updating and invalidating is complex; large coherence blocks make
reacquiring invalidated blocks costly, but updating small pieces is cheaper. Large
blocks can also cause false sharing. To reduce performance impact, some software
systems allow concurrent writes to a page without synchronization, assuming
synchronization protects truly shared data. At synchronization points, changes from
each processor are merged into a consistent copy, usually by comparing modified
pages to an unmodified shadow copy. Relaxed consistency models in software reduce
coherence transactions by removing the single writer restriction, allowing processors
to focus more on productive work at synchronization points. This decreases coherence
traffic, easing memory and network congestion, and speeding up other operations.
The debate between software and hardware implementations is ongoing. Recent
findings indicate that software coherence with lazy relaxed consistency rivals
hardware's eager relaxed consistency. Hardware architects also see the flexibility of
software as a key performance boost, with some designs integrating programmable
protocol engines.
13
Republic of the Philippines
President Ramon Magsaysay State University
(Formerly Ramon Magsaysay Technological University)
Iba, Zambales, Philippines
Tel/Fax No.: (047) 811-1683
College of Communication and Information Technology
BACHELOR OF SCIENCE IN COMPUTER SCIENCE
5.0. Conclusions and Trends
How we conceptualize memory influences system design across multiple
layers: hardware, architecture, programming models, and memory consistency
models. Key concepts for developing cost-effective, efficient, and user-friendly
memory systems include these principles, which lay the groundwork for optimizing
performance and reliability.
1. Referential transparency- Shared memory aligns more with the sequential
programming most programmers are used to, while message passing can complicate
parallel programming. However, message passing can enhance performance in certain
scenarios. Many newer machines and systems now support both models to take
advantage of their respective benefits.
2. The principle of locality- Locality plays a crucial role at every level of the
memory hierarchy: page mode DRAMs in hardware, hierarchical memory systems in
architecture, and NUMA programming models for users. Utilizing locality helps
create cost-effective memory systems and improves performance when programming
with locality principles in mind.
3. Sequentially consistent ordering of events- Weaker models improve speed but
complicate programming. Adopting programmer-centric consistency models, which
ensure sequential consistency with specific synchronization rules, offers a promising
balance between performance and ease of programming.
14
Republic of the Philippines
President Ramon Magsaysay State University
(Formerly Ramon Magsaysay Technological University)
Iba, Zambales, Philippines
Tel/Fax No.: (047) 811-1683
College of Communication and Information Technology
BACHELOR OF SCIENCE IN COMPUTER SCIENCE
Effective memory system design is a major research focus in multiprocessors.
Key trends include deeper memory hierarchies with possible third-level off-
chip caches and increasing importance of locality due to faster processors.
Current research aims to improve program locality via compiler techniques.
Programmable cache controllers might soon allow protocol customization
based on program needs. As memory costs rise, optimizing memory utilization
may become more important than processor utilization for peak performance.
15