Academia.eduAcademia.edu

Cache Coherence

1,337 papers
31 followers
AI Powered
Cache coherence refers to the consistency of data stored in local caches of a shared resource, ensuring that changes in one cache are reflected across all caches in a multiprocessor system. It addresses the challenges of maintaining uniform data visibility and integrity among multiple processors accessing shared memory.
As is well known Lamport's Bakery algorithm for mutual exclusion of n processes is correct if a physically shared memory is used as the communication facility between processes. An application of weaker consistency models (e.g. causal,... more
This paper is on the general discussion of memory consistency model like Strict Consistency, Sequential Consistency, Processor Consistency, Weak Consistency etc. Then the techniques for implementing distributed shared memory Systems and... more
Definition  NUMA is the acronym for Non-Uniform Memory  Access. A NUMA cache is a cache memory in which  the access time is not uniform but depends on the posi tion of the involved block inside the cache. Among  NUMA caches, it... more
We present Task Superscalar, an abstraction of instruction-level out-of-order pipeline that operates at the tasklevel. Like ILP pipelines, which uncover parallelism in a sequential instruction stream, task superscalar uncovers tasklevel... more
In this paper, a cache coherence scheme in multiprocessor is introduced. There is a specific model in each kind of software; cache coherence can be solved in AHB bus by these models. First, we use dynamic address mapping policy to realize... more
We present a novel method for computing cache-oblivious layouts of large meshes that improve the performance of interactive visualization and geometric processing algorithms. Given that the mesh is accessed in a reasonably coherent... more
Lastly, and most importantly, I would like to thank my parents, grandparents and my sister for their constant motivation and encouragement. Big thanks goes to my Uncle, Aunt and Anika.
An efficient interprocessor communication mechanism is essential to the performance of hypercube multiprocessors. All existing hypercube multiprocessors basically support one-to-one interprocessor communication only. However,... more
IBM and Intel now offer commercial systems with Transactional Memory (TM), a programming paradigm whose aim is to facilitate concurrent programming while maximizing parallelism. These TM systems are implemented in hardware and provide a... more
The choice of a communication paradigm, or protocol, is central to the design of a largescale multiprocessor system. Unlike traditional multiprocessors, the FLASH machine uses a programmable node controller, called MAGIC, to implement all... more
A flexible communication mechanism is a desirable feature in multiprocessors because it allows support for multiple communication protocols, expands performance monitoring capabilities, and leads to a simpler design and debug process. In... more
Distributed shared memory is an architectural approach that allows multiprocessors to support a single shared address space that is implemented with physically distributed memories. Hardware-supported distributed shared memory is becoming... more
The FLASH multiprocessor efficiently integrates support for cache-coherent shared memory and high-performance message passing, while minimizing both hardware and software overhead. Each node in FLASH contains a micropromssor, a portion of... more
In the architecture of contemporary distributed systems, caching serves as a vital optimization strategy. This study explores the theoretical foundations, implementation patterns, and performance implications of various caching... more
exist that facilitate the formal specification and prototyping of distributed systems. In this paper, we describe certain features of the Dynamic Coordinated Concurrent Activities (DCCA) model. Any DCCA specification consists of a set of... more
In order to construct a test-bed for investigating new programming paradigms for future "manycore" systems (i.e. those with at least a thousand cores), we are building a Smalltalk virtual machine that attempts to efficiently use... more
Fence instructions are fundamental primitives that ensure consistency in a weakly consistent shared memory multicore processor. The execution cost of these instructions is significant and adds a non-trivial overhead to parallel programs.... more
Fence instructions are fundamental primitives that ensure consistency in a weakly consistent shared memory multicore processor. The execution cost of these instructions is significant and adds a non-trivial overhead to parallel programs.... more
In this paper, we study the verification of dense time properties by discrete time analysis. Interval Duration Logic, (IDL), is a highly expressive dense time logic for specifying properties of real-time systems. Validity checking of IDL... more
The increased number of cores integrated on a chip has brought about a number of challenges. Concerns about the scalability of cache coherence protocols have urged both researchers and practitioners to explore alternative programming... more
Tarantula is an aggressive floating point machine targeted at technical, scientific and bioinformatics workloads, originally planned as a follow-on candidate to the EV8 processor [6, 5]. Tarantula adds to the EV8 core a vector unit... more
An intriguing aspect of optical interconnects from an architectural point of view is their ability to reconfigure the topology in a data-transparent way. We focus in this work on the potentialities of such dynamically reconfigurable... more
This dissertation introduces, investigates, and evaluates a low-cost high-speed twin-prefetching DSP-based bus- interconnected shared-memory system for real-time image processing applications. The proposed architecture can effectively... more
Heterogeneous systems that integrate a multicore CPU and a GPU on the same die are ubiquitous. On these systems, both the CPU and GPU share the same physical memory as opposed to using separate memory dies. Although integration eliminates... more
Adaptive applications have computational workloads and communication patterns which change unpredictably at runtime, requiring dynamic load balancing to achieve scalable performance on parallel machines. Efficient parallel implementations... more
A shared data structure is lock-free if its operations do not require mutual exclusion. If one process is interrupted in the middle of an operation, other processes will not be prevented from operating on that object. In highly concurrent... more
In this paper we introduce TagTM, a Software Transactional Memory (STM) system augmented with a new hardware mechanism that we call GTags. GTags are new hardware cache coherent tags that are used for fast meta-data access. TagTM uses... more
The scalability of an OpenMP program in a ccNUMA system with a large number of processors suffers from remote memory accesses, cache misses and false sharing. Good data locality is needed to overcome these problems whereas OpenMP offers... more
In this paper, we study two hierarchical N-Body methods for Network-on-Chip (NoC) architectures. The modern Chip Multiprocessor (CMP) designs are mainly based on the shared-bus communication architecture. As the number of cores increases,... more
The purpose of this study is to improve the web server performance. Bottlenecks such as network traffic overload and congested web server has still yet to be solved due to the increasing of internet usage. Caching is one of the popular... more
Synchronization is a critical operation in many parallel applications. Conservative Synchronization mechanisms are failing to keep up with the increasing demand for well-organized management operations as systems grow larger and network... more
This paper compares eight commercial parallel processors along several dimensions. The processors include four shared-bus multiprocessors (the Encore Multimax, the Sequent Balance system, the Alliant FX series, and the ELXSI System 6400)... more
Many of the programming challenges encountered in small to moderate-scale hardware cache-coherent shared memory machines have been extensively studied. While work remains to be done, the basic techniques needed to efficiently program such... more
Hierarchical N-body methods, which are based on a fundamental insight into the nature of many physical processes, are increasingly being used to solve large-scale problems in a variety of scientific/engineering domains. Applications that... more
This paper provides a case study of specifying an abstract memory consistency model, providing possible implementations for the model, and proving the correctness of implementations. Specifically, we introduce a class of memory... more
HG I QP SR TD U WV X `Y ba dc ¤e f X hg ia dp iV ¨q sr ut ¨V v 1w xV y bX e a dX i V ¨ r sV ¨ d 7 B x q sa dX v w xV ¨y y a dy r sv @d 1 e v gf he f X i V B r u v j 1V y bX e a dX i V ¨ k ml V ¨X mY ba i r uV ¨ a d e v gf r sX on p %r uX... more
Objectives: To design Distributed Shared Memory (DSM) for the multiprocessor distributed framework using a different software parametric approach that provides significant performance improvement against convention software based... more
This paper presents an efficient routing and flow control mechanism to implement multidestination message passing in wormhole networks.It is targeted to situations where the size of message data is very small, like in invalidation and... more
This paper presents an efficient routing and flow control mechanism to implement multidestination message passing in wormhole networks.It is targeted to situations where the size of message data is very small, like in invalidation and... more
The explosion in workload complexity and the recent slow-down in Moore's law scaling call for new approaches towards efficient computing. Researchers are now beginning to use recent advances in machine learning in software... more
In distributed shared memory multiprocessors, remote memory accesses generate processor-tomemory traffic which may result in a bottleneck. It is therefore important to design algorithms that minimize the number of remote memory accesses.... more
Memory access latency is the primary performance bottleneck in modern computer systems. Prefetching data before it is needed by a processing core allows substantial performance gains by overlapping significant portions of memory latency... more
DASH is a scalable shared-memory multiprocessor currently being developed at Stanford's Computer Systems Laboratory. The architecture consists of powerful processing nodes, each with a portion of the shared-memory, connected to a... more
The FLASH multiprocessor efficiently integrates support for cacheaherent shared memory and high-performance message passing. while minimizing both hardware and software overhead. Each node in FLASH contains a microprocessor. a portion of... more
A flexible communication mechanism is a desirable feature in multiprocessors because it allows support for multiple communication protocols, expands performance monitoring capabilities, and leads to a simpler design and debug process. In... more