Cache Coherence Research Papers

An efficent dynamic multicast routing protocol for distributing traffic in NOCs

2025, 2009 Design, Automation & Test in Europe Conference & Exhibition

descriptionView Paper arrow_downwardDownload

Consistency requirements of distributed shared memory for Lamport's bakery algorithm for mutual exclusion

2025

As is well known Lamport's Bakery algorithm for mutual exclusion of n processes is correct if a physically shared memory is used as the communication facility between processes. An application of weaker consistency models (e.g. causal,... more

descriptionView Paper arrow_downwardDownload

Consistency Model and Synchronization Primitives in SDSMS

by Parvinder Sandhu

2025, Zenodo (CERN European Organization for Nuclear Research)

This paper is on the general discussion of memory consistency model like Strict Consistency, Sequential Consistency, Processor Consistency, Weak Consistency etc. Then the techniques for implementing distributed shared memory Systems and... more

descriptionView Paper arrow_downwardDownload

TTL inter-task communication implementation on a shared-memory multiprocessor platform

by koen bertels

2025

descriptionView Paper arrow_downwardDownload

NUMACaches in the Single-Core Environment : NUCA

by Alessandro Bardine

2025

Definition  NUMA is the acronym for Non-Uniform Memory  Access. A NUMA cache is a cache memory in which  the access time is not uniform but depends on the posi tion of the involved block inside the cache. Among  NUMA caches, it... more

descriptionView Paper arrow_downwardDownload

A Dynamically Adaptable Hardware Transactional Memory

by Grigorios Magklis

2025, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture

We present Task Superscalar, an abstraction of instruction-level out-of-order pipeline that operates at the tasklevel. Like ILP pipelines, which uncover parallelism in a sequential instruction stream, task superscalar uncovers tasklevel... more

descriptionView Paper arrow_downwardDownload

Analysis of System Reliability for Cache Coherence Scheme in Multi-processor

by Donghui Guo

2025

In this paper, a cache coherence scheme in multiprocessor is introduced. There is a specific model in each kind of software; cache coherence can be solved in AHB bus by these models. First, we use dynamic address mapping policy to realize... more

descriptionView Paper arrow_downwardDownload

Cache-oblivious mesh layouts

by Peter Lindstrom

2025, ACM Transactions on Graphics

We present a novel method for computing cache-oblivious layouts of large meshes that improve the performance of interactive visualization and geometric processing algorithms. Given that the mesh is accessed in a reasonably coherent... more

descriptionView Paper arrow_downwardDownload

Buffer management in shared-memory Time Warp systems

by Richard Fujimoto

2025, Simuletter

descriptionView Paper arrow_downwardDownload

Santa Cruz Instruction Processor With Scoreboarding

by Jose Renau

2025

Lastly, and most importantly, I would like to thank my parents, grandparents and my sister for their constant motivation and encouragement. Big thanks goes to my Uncle, Aunt and Anika.

descriptionView Paper arrow_downwardDownload

Distributed multi-destination routing in hypercube multiprocessors

by Abdol-Hossein Esfahanian

2025, Proceedings of the third conference on Hypercube concurrent computers and applications Architecture, software, computer systems, and general issues -

An efficient interprocessor communication mechanism is essential to the performance of hypercube multiprocessors. All existing hypercube multiprocessors basically support one-to-one interprocessor communication only. However,... more

descriptionView Paper arrow_downwardDownload

Lazy Irrevocability for Best-Effort Transactional Memory Systems

by Ricardo Quislant

2025, IEEE Transactions on Parallel and Distributed Systems

IBM and Intel now offer commercial systems with Transactional Memory (TM), a programming paradigm whose aim is to facilitate concurrent programming while maximizing parallelism. These TM systems are implemented in hardware and provide a... more

descriptionView Paper arrow_downwardDownload

The FLASH Multiprocessor: Designing a Flexible and Scalable System

by John Hennessy

2025

The choice of a communication paradigm, or protocol, is central to the design of a largescale multiprocessor system. Unlike traditional multiprocessors, the FLASH machine uses a programmable node controller, called MAGIC, to implement all... more

The choice of a communication paradigm, or protocol, is central to the design of a largescale multiprocessor system. Unlike traditional multiprocessors, the FLASH machine uses a programmable node controller, called MAGIC, to implement all protocol processing. The architecture of the MAGIC chip allows FLASH to support multiple communication paradigms -in particular, cache-coherent shared memory and high-performance message passing -while minimizing both hardware and software overhead. Each node in FLASH contains a microprocessor, a portion of the machine's global memory, a port to the interconnection network, an I/O interface, and MAGIC, the custom node controller. The MAGIC chip handles all communication both within the node and among nodes, using hardwired data paths for efficient data movement and a programmable processor optimized for executing protocol operations. The result is a system that is flexible and scalable, yet competitive in performance with a traditional multiprocessor that implements a single communication paradigm completely in hardware. The application results are used to evaluate the performance costs of flexibility by comparing the performance of FLASH to that of a hardwired machine on representative parallel applications and multiprogramming workloads. These results show that poor application memory reference or load balancing characteristics cause the performance of the FLASH system to degrade more rapidly than the performance of the hardwired system; that is, FLASH's performance is less robust. For applications that incur a large number of remote misses or exhibit substantial hot-spotting, the increased remote access latencies or the occupancy of MAGIC lead to lower performance for the flexible design. Overall, however, the performance of FLASH can be competitive with the performance of the hardwired machine. Specifically, for a range of optimized parallel applications, the performance differences between the hardwired machine and FLASH are small, typically less than 10% at 32 processors and less than 15% at 64 processors. For these programs, either the processor cache miss rates are small or the latency of the programmable protocol processing can be hidden behind the memory access time.

descriptionView Paper arrow_downwardDownload

The performance impact of flexibility in the Stanford FLASH multiprocessor

by John Hennessy

2025

A flexible communication mechanism is a desirable feature in multiprocessors because it allows support for multiple communication protocols, expands performance monitoring capabilities, and leads to a simpler design and debug process. In... more

descriptionView Paper arrow_downwardDownload

Cache-coherent distributed shared memory: perspectives on its development and future challenges

by John Hennessy

2025, Proceedings of the IEEE

Distributed shared memory is an architectural approach that allows multiprocessors to support a single shared address space that is implemented with physically distributed memories. Hardware-supported distributed shared memory is becoming... more

descriptionView Paper arrow_downwardDownload

The Stanford FLASH multiprocessor

by John Hennessy

2025, International Symposium on Computer Architecture

The FLASH multiprocessor efficiently integrates support for cache-coherent shared memory and high-performance message passing, while minimizing both hardware and software overhead. Each node in FLASH contains a micropromssor, a portion of... more

descriptionView Paper arrow_downwardDownload

Analisa Pensinyalan Network Element Berbasis V5.2 Pada Jaringan Lokal Akses Fiber di STO Cawang

by rangga yudistira

2025

descriptionView Paper arrow_downwardDownload

An In-Depth Analysis of Modern Caching Strategies in Distributed Systems: Implementation Patterns and Performance Implications

by Akaash Vishal Hazarika

2024, International Journal of Science and Engineering Applications

In the architecture of contemporary distributed systems, caching serves as a vital optimization strategy. This study explores the theoretical foundations, implementation patterns, and performance implications of various caching... more

descriptionView Paper arrow_downwardDownload

DCCA: a versatile paradigm for the description and development of concurrent communicating systems

by Sandeep Mitra

2024, Proceedings of the Twenty-Eighth Hawaii International Conference on System Sciences

exist that facilitate the formal specification and prototyping of distributed systems. In this paper, we describe certain features of the Dynamic Coordinated Concurrent Activities (DCCA) model. Any DCCA specification consists of a set of... more

descriptionView Paper arrow_downwardDownload

Hosting an object heap on manycore hardware

by David Ungar

2024, ACM SIGPLAN Notices

In order to construct a test-bed for investigating new programming paradigms for future "manycore" systems (i.e. those with at least a thousand cores), we are building a Smalltalk virtual machine that attempts to efficiently use... more

descriptionView Paper arrow_downwardDownload

Louvre: Light-weight Ordering Using Versioning for Release Consistency

by Hyesoon Kim

2024, arXiv (Cornell University)

Fence instructions are fundamental primitives that ensure consistency in a weakly consistent shared memory multicore processor. The execution cost of these instructions is significant and adds a non-trivial overhead to parallel programs.... more

descriptionView Paper arrow_downwardDownload

Louvre: Light-weight Ordering Using Versioning for Release Consistency

by Hyesoon Kim

2024, arXiv (Cornell University)

Fence instructions are fundamental primitives that ensure consistency in a weakly consistent shared memory multicore processor. The execution cost of these instructions is significant and adds a non-trivial overhead to parallel programs.... more

descriptionView Paper arrow_downwardDownload

Digitizing Interval Duration Logic

by Paritosh Pandya

2024, Lecture Notes in Computer Science

In this paper, we study the verification of dense time properties by discrete time analysis. Interval Duration Logic, (IDL), is a highly expressive dense time logic for specifying properties of real-time systems. Validity checking of IDL... more

descriptionView Paper arrow_downwardDownload

Efficient Communication and Synchronization on Manycore Processors

by Darko Petrovic

2024

The increased number of cores integrated on a chip has brought about a number of challenges. Concerns about the scalability of cache coherence protocols have urged both researchers and practitioners to explore alternative programming... more

The increased number of cores integrated on a chip has brought about a number of challenges. Concerns about the scalability of cache coherence protocols have urged both researchers and practitioners to explore alternative programming models, where cache coherence is not a given. Message passing, traditionally used in distributed systems, has surfaced as an appealing alternative to shared memory, commonly used in multiprocessor systems. In this thesis, we study how basic communication and synchronization primitives on manycore processors can be improved, with an accent on taking advantage of message passing. We do this in two different contexts: (i) message passing is the only means of communication and (ii) it coexists with traditional cache-coherent shared memory. In the first part of the thesis, we analytically and experimentally study collective communication on a message-passing manycore processor. First, we devise broadcast algorithms for the Intel SCC, an experimental manycore platform without coherent caches. Our ideas are captured by OC-BCAST (on-chip broadcast), a tree-based broadcast algorithm. Two versions of OC-BCAST are presented: One for synchronous communication, suitable for use in high-performance libraries implementing the Message Passing Interface (MPI), and another for asynchronous communication, for use in distributed algorithms and general-purpose software. Both OC-BCAST flavors are based on one-sided communication and significantly outperform (by up to 3x) state-of-the-art two-sided algorithms. Next, we conceive an analytical communication model for the SCC. By expressing the latency and throughput of different broadcast algorithms through this model, we reveal that the advantage of OC-BCAST comes from greatly reducing the number of off-chip memory accesses on the critical path. The second part of the thesis focuses on lock-based synchronization. We start by introducing the concept of hybrid mutual exclusion algorithms, which rely both on cache-coherent shared memory and message passing. The hybrid algorithms we present, HYBLOCK and HYBCOMB, are shown to significantly outperform (by even 4x) their shared-memory-only counterparts, when used to implement concurrent counters, stacks and queues on a hybrid Tilera TILE-Gx processor. The advantage of our hybrid algorithms comes from the fact that their most critical parts rely on message passing, thereby avoiding the overhead of the cache coherence protocol. Still, we take advantage of shared memory, as shared state makes the implementation of certain mechanisms much more straightforward. Next, we try to profit from these insights even on processors without hardware support for message passing. Taking two classic x86 vii Preface processors from Intel and AMD, we come up with cache-aware optimizations that improve the performance of executing contended critical sections by as much as 6x.

descriptionView Paper arrow_downwardDownload

Tarantula

by Isaac Hernandez

2024, ACM SIGARCH Computer Architecture News

Tarantula is an aggressive floating point machine targeted at technical, scientific and bioinformatics workloads, originally planned as a follow-on candidate to the EV8 processor [6, 5]. Tarantula adds to the EV8 core a vector unit... more

descriptionView Paper arrow_downwardDownload

Architectural study of the opportunities for reconfigurable optical interconnects in distributed shared memory systems

by Khoi Viet

2024, Proceedings Symposium

An intriguing aspect of optical interconnects from an architectural point of view is their ability to reconfigure the topology in a data-transparent way. We focus in this work on the potentialities of such dynamically reconfigurable... more

descriptionView Paper arrow_downwardDownload

A low-cost high-speed twin-prefetching DSP-based shared- memory system for real-time image processing applications

by Charalambos Christou

2024

This dissertation introduces, investigates, and evaluates a low-cost high-speed twin-prefetching DSP-based bus- interconnected shared-memory system for real-time image processing applications. The proposed architecture can effectively... more

descriptionView Paper arrow_downwardDownload

Analyzing memory management methods on integrated CPU-GPU systems

by Mohammad Dashti

2024, ACM SIGPLAN Notices

Heterogeneous systems that integrate a multicore CPU and a GPU on the same die are ubiquitous. On these systems, both the CPU and GPU share the same physical memory as opposed to using separate memory dies. Although integration eliminates... more

descriptionView Paper arrow_downwardDownload

A comparison of three programming models for adaptive applications on the Origin2000

by hongzhang shan

2024, OSTI OAI (U.S. Department of Energy Office of Scientific and Technical Information)

Adaptive applications have computational workloads and communication patterns which change unpredictably at runtime, requiring dynamic load balancing to achieve scalable performance on parallel machines. Efficient parallel implementations... more

descriptionView Paper arrow_downwardDownload

Transactional memory

by Eliot Moss

2024, Computer architecture news

A shared data structure is lock-free if its operations do not require mutual exclusion. If one process is interrupted in the middle of an operation, other processes will not be prevented from operating on that object. In highly concurrent... more

descriptionView Paper arrow_downwardDownload

TagTM - accelerating STMs with hardware tags for fast meta-data access

by Sasa Tomic

2024, 2012 Design, Automation & Test in Europe Conference & Exhibition (DATE)

In this paper we introduce TagTM, a Software Transactional Memory (STM) system augmented with a new hardware mechanism that we call GTags. GTags are new hardware cache coherent tags that are used for fast meta-data access. TagTM uses... more

descriptionView Paper arrow_downwardDownload

Improving the Performance of OpenMP by Array Privatization

by Tien-Hsiung Weng

2024, Lecture Notes in Computer Science

The scalability of an OpenMP program in a ccNUMA system with a large number of processors suffers from remote memory accesses, cache misses and false sharing. Good data locality is needed to overcome these problems whereas OpenMP offers... more

descriptionView Paper arrow_downwardDownload

Study of Hierarchical N-Body Methods for Network-on-Chip Architectures

by Hannu Tenhunen

2024, Lecture Notes in Computer Science

In this paper, we study two hierarchical N-Body methods for Network-on-Chip (NoC) architectures. The modern Chip Multiprocessor (CMP) designs are mainly based on the shared-bus communication architecture. As the number of cores increases,... more

descriptionView Paper arrow_downwardDownload

Improving Web Server Performance Using Two-Tiered Web Caching

by Wan Mohd Nasir Wan Kadir

2024

The purpose of this study is to improve the web server performance. Bottlenecks such as network traffic overload and congested web server has still yet to be solved due to the increasing of internet usage. Caching is one of the popular... more

descriptionView Paper arrow_downwardDownload

Shared Memory Multiprocessor

by Ankush Saini

2024, International Journal of Technology Enhancements and Emerging Engineering Research

Synchronization is a critical operation in many parallel applications. Conservative Synchronization mechanisms are failing to keep up with the increasing demand for well-organized management operations as systems grow larger and network... more

descriptionView Paper arrow_downwardDownload

A survey of commercial parallel processors

by Ed Gehringer

2024, ACM SIGARCH Computer Architecture News

This paper compares eight commercial parallel processors along several dimensions. The processors include four shared-bus multiprocessors (the Encore Multimax, the Sequent Balance system, the Alliant FX series, and the ELXSI System 6400)... more

descriptionView Paper arrow_downwardDownload

Store operations to maintain cache coherence

by Constantinos Evangelinos

2024, OSTI OAI (U.S. Department of Energy Office of Scientific and Technical Information)

descriptionView Paper arrow_downwardDownload

Application and architectural bottlenecks in large scale distributed shared memory machines

by John Hennessy

2024, Proceedings of the 23rd annual international symposium on Computer architecture

Many of the programming challenges encountered in small to moderate-scale hardware cache-coherent shared memory machines have been extensively studied. While work remains to be done, the basic techniques needed to efficiently program such... more

Many of the programming challenges encountered in small to moderate-scale hardware cache-coherent shared memory machines have been extensively studied. While work remains to be done, the basic techniques needed to efficiently program such machines have been well explored. Recently, a number of researchers have presented architectural techniques for scaling a cache coherent shared address space to much larger processor counts. In this paper, we examine the extent to which applications can achieve reasonable performance on such large-scale, cache-coherent, distributed shared address space machines, by determining the problems sizes needed to achieve a reasonable level of efficiency. We also look at how much programming effort and optimization is needed to achieve high efficiency, beyond that needed at small processor counts. For each application, we discuss the main architectural bottlenecks that prevent smaller problem sizes or less optimized programs from achieving good efficiency. Our results show that while there are some applications that either do not scale or must be heavily optimized to do so, for most of the applications we studied it is not necessary to heavily modify the code or restructure algorithms to scale well upto several hundred processors, once the basic techniques for load balancing and data locality are used that are needed for small-scale systems as well. Programs written with some care perform well without substantially compromising the ease of programming advantage of a shared address space, and the problem sizes required to achieve good performance are surprisingly small. It is important to be careful about how data structures and layouts interact with system granularities, but these optimization are usually needed for moderate-scale machines as well. We begin by examining the first question listed above in Section 3. We use the most optimized versions of our applications, both with and without prefetching, to see whether the applications can perform effectively on large-scale machines. The metric we use to determine this, elaborated in the next section, is the minimum problem size needed to obtain a given level of parallel efficiency. We examine four machine sizes: 16, 64, 256, and 1024 processors. The first three are studied through detailed simulation, and the 1024 processor results are extrapolations based on results obtained from the smaller numbers of processors coupled with analytical modeling. The minimum problem size for the different numbers of processors tells us whether it is possible to achieve good performance on reasonably sized problems. We also examine the architectural bottlenecks that prevent the minimum problem sizes from being smaller. We then address the optimization and the programming effort it takes to get good performance as we go to larger machines, and how performance scales if certain types of optimization are not

descriptionView Paper arrow_downwardDownload

Load Balancing and Data Locality in Adaptive Hierarchical N-Body Methods: Barnes-Hut, Fast Multipole, and Radiosity

by John Hennessy

2024, Journal of Parallel and Distributed Computing

Hierarchical N-body methods, which are based on a fundamental insight into the nature of many physical processes, are increasingly being used to solve large-scale problems in a variety of scientific/engineering domains. Applications that... more

Hierarchical N-body methods, which are based on a fundamental insight into the nature of many physical processes, are increasingly being used to solve large-scale problems in a variety of scientific/engineering domains. Applications that use these methods are challenging to parallelize effectively, however, owing to their nonuniform, dynamically changing characteristics and their need for long-range communication. In this paper, we study the partitioning and scheduling techniques required to obtain effective parallel performance on applications that use a range of hierarchical N-body applications. To obtain representative coverage, we examine applications that use the three most promising methods used today. Two of these, the Barnes-Hut method and the Fast Multipole Method, are the best methods known for classical N-body problems (such as molecular or galactic simulations). The third is a recent hierarchical method for radiosity calculations in computer graphics, which applies the hierarchical N-body approach to a problem with very different characteristics. We find that straightforward decomposition techniques which an automatic scheduler might implement do not scale well, because they are unable to simultaneously provide load balancing and data locality. However, all the applications yield very good parallel performance if appropriate partitioning and scheduling techniques are implemented by the programmer. For the Barnes-Hut and Fast Multipole applications, we show that simple yet effective partitioning techniques can be developed by exploiting some key insights into both the solution methods and the classical problems that they solve. Using a novel partitioning technique, even relatively small problems achieve 45-fold speedups on a 48-processor Stanford DASH machine (a cache-coherent, shared address space multiprocessor) and 118-fold speedups on a 128-processor simulated architecture. The very different characteristics of the radiosity application require a different partitioning/scheduling approach to be used for it; however, it too yields very good parallel performance.

descriptionView Paper arrow_downwardDownload

Partition consistency

by steven cheng

2024, Distributed Computing

This paper provides a case study of specifying an abstract memory consistency model, providing possible implementations for the model, and proving the correctness of implementations. Specifically, we introduce a class of memory... more

descriptionView Paper arrow_downwardDownload

Exploiting parallelism in a network of workstations using COMA-BC

by Diego R. Llanos

2024, ACM SIGARCH Computer Architecture News

HG I QP SR TD U WV X `Y ba dc ¤e f X hg ia dp iV ¨q sr ut ¨V v 1w xV y bX e a dX i V ¨ r sV ¨ d 7 B x q sa dX v w xV ¨y y a dy r sv @d 1 e v gf he f X i V B r u v j 1V y bX e a dX i V ¨ k ml V ¨X mY ba i r uV ¨ a d e v gf r sX on p %r uX... more

descriptionView Paper arrow_downwardDownload

Improving Performance of Distributed Shared Memory (DSM) on Multiprocessor Framework with Software Approach

by Prof.(Dr.) Jagdish M Rathod

2024, Indian journal of science and technology

Objectives: To design Distributed Shared Memory (DSM) for the multiprocessor distributed framework using a different software parametric approach that provides significant performance improvement against convention software based... more

Objectives: To design Distributed Shared Memory (DSM) for the multiprocessor distributed framework using a different software parametric approach that provides significant performance improvement against convention software based architectures. Methods/Statistical Analysis: Software distributed shared memory can be architected by using a different concept of an operating system, by utilizing a programming library and by extending underlying virtual address space architecture. It incorporates various design options like granularity, consistency model, implementation level, data organization, algorithms, protocols, etc. We have proposed few software parameter choices and impact which gives significant performance improvement compared to past designs to manage software distributed shared memory. This paper also discusses various issues that exist while moving toward software distributed shared memory implementation. Findings: There are two methodologies by which it is possible to achieve distributed shared memory design are first in hardware like cache coherence circuits and network interfaces and the second is software. Here the proposed system architecture makes major impact on programming, performance, design and cost. An algorithm is designed such a unique manner which resides in memory controller and make efficient global virtual memories. It is using variable as granularity which are shared that is more flexible for complex data structure and large database. It is defined using unique identifier which makes its mapping and retrieval more manageable using proposed consistency mechanism. Application/Improvements: Distributed shared memory optimization is a most important area of improving distributed system performance. By taking care of good choice on underlying issues and according to system's design requirement, it possible to gain advantages of improved architecture which can be more used for various distributed applications where shared data plays a major role.

descriptionView Paper arrow_downwardDownload

Efficient implementation of tree-based multicast routing for distributed shared-memory multiprocessors

by Manuel P Malumbres

2024

This paper presents an efficient routing and flow control mechanism to implement multidestination message passing in wormhole networks.It is targeted to situations where the size of message data is very small, like in invalidation and... more

descriptionView Paper arrow_downwardDownload

An efficient implementation of tree-based multicast routing for distributed shared-memory multiprocessors

by Manuel P Malumbres

2024, Journal of Systems Architecture

This paper presents an efficient routing and flow control mechanism to implement multidestination message passing in wormhole networks.It is targeted to situations where the size of message data is very small, like in invalidation and... more

descriptionView Paper arrow_downwardDownload

Learning Memory Access Patterns

by Jichuan Chang

2024, ArXiv

The explosion in workload complexity and the recent slow-down in Moore's law scaling call for new approaches towards efficient computing. Researchers are now beginning to use recent advances in machine learning in software... more

descriptionView Paper arrow_downwardDownload

A Tight Bound on Time Complexity of Mutual Exclusion

by Ting-Lu Huang

2024

In distributed shared memory multiprocessors, remote memory accesses generate processor-tomemory traffic which may result in a bottleneck. It is therefore important to design algorithms that minimize the number of remote memory accesses.... more

descriptionView Paper arrow_downwardDownload

Transactional Prefetching: Narrowing the Window of Contention in Hardware Transaction Memory

by Adrian Cristal

2024

Memory access latency is the primary performance bottleneck in modern computer systems. Prefetching data before it is needed by a processing core allows substantial performance gains by overlapping significant portions of memory latency... more

descriptionView Paper arrow_downwardDownload

The directory-based cache coherence protocol for the DASH multiprocessor

by John Hennessy

2024, ACM SIGARCH Computer Architecture News

DASH is a scalable shared-memory multiprocessor currently being developed at Stanford's Computer Systems Laboratory. The architecture consists of powerful processing nodes, each with a portion of the shared-memory, connected to a... more

descriptionView Paper arrow_downwardDownload

The Stanford FLASH multiprocessor

by John Hennessy

2024, Proceedings of 21 International Symposium on Computer Architecture

The FLASH multiprocessor efficiently integrates support for cacheaherent shared memory and high-performance message passing. while minimizing both hardware and software overhead. Each node in FLASH contains a microprocessor. a portion of... more

descriptionView Paper arrow_downwardDownload

The performance impact of flexibility in the Stanford FLASH multiprocessor

by John Hennessy

2024, ACM SIGOPS Operating Systems Review

A flexible communication mechanism is a desirable feature in multiprocessors because it allows support for multiple communication protocols, expands performance monitoring capabilities, and leads to a simpler design and debug process. In... more

descriptionView Paper arrow_downwardDownload

Log In

Cache Coherence

Related Topics