Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2006, ArXiv
A theoretical memory with limited processing power and internal connectivity at each element is proposed. This memory carries out parallel processing within itself to solve generic array problems. The applicability of this in-memory finest-grain massive SIMD approach is studied in some details. For an array of N items, it reduces the total instruction cycle count of universal operations such as insertion/deletion and match finding to ~ 1, local operations such as filtering and template matching to ~ local operation size, and global operations such as sum, finding global limit and sorting to ~√N instruction cycles. It eliminates most streaming activities for data processing purpose on the system bus. Yet it remains general-purposed, easy to use, pin compatible with conventional memory, and practical for implementation. Keyword: SIMD processors; Parallel Processors; Memory Structures; Performance evaluation of algorithms and systems;
2012
Parallel processors are computers which carry out multiple tasks in parallel. Flynn gave the classification of computer architecture on the basis of multiplicity of instruction and data streams. According this classification Single Instruction Multiple Data (SIMD) processors are the ones which handle same instruction but operate on different data sets. This paper describes the architecture of MasPar MP-1. MP-1 belongs to the category of SIMD array processors. The paper further describes the features of this massively parallel processor along with its major applications. Keywords—Flynn’s classification, SIMD, MasPar MP-1, XNet, processing element array
A new concept of parallel architecture for image processing is presented in this paper. Named LAPMAM (Linear array processors with Multi-mode Access Memory), this architecture, for a 512 x 512 image, has 512 processors and four memory planes each of 512² memory modules. One important characteristic of this architecture is its memories modules that allow different access modes: RAM, FIFO, normal CAM and interactive CAM. This particular memory and a linear structure of RISC processors are combined with a tree interconnection network to obtain very efficient 1-d architecture suitable for real time image processing. The processor works in SIMD mode and allows a restricted MIMD mode without requiring a great hardware complexity. A hardware simulation of an architecture prototype has been accomplished to test its performance in low and intermediate level vision tasks. The performance of the LAPMAM is compared with that of different architectures.
IEEE Journal of Solid-State Circuits, 1990
Out+ g-Data 16 369 et t t i n n I 256,512,1024, ... 16 Data H W t 3 H 2 A Four-Processor Building Block for SIMD Processor Arrays Ahsfruct -A four-processor chip, for use in processor arrays for image computations, is described. The large degree of data parallelism available in image computations allows dense array implementations where all processors operate under the control of a single instruction stream. An instruction decoder shared by the four processors on the chip minimizes the pin count allocated for global control of the processors. The chip incorporates an interface to external SRAM for memory expansion without glue chips. The full-custom 2-pm CMOS chip contains 56669 transistors and runs instructions at 10 MHz. Five hundred twelve 16-b processors and 4 megabytes of distributed external memory fit on two industry standard cards to yield 5 billion instructions per second peak throughput. As image l/O can overlap perfectly with pixel computation, an array containing 128 of these chips can provide more than 600 16-b operations per pixel on 512 X 512 images at 30 Hz.
Pattern Recognition Letters, 1991
ABSTRACT We investigate the use of indirect addressing in processor arrays as a way to improve the processing of recursive neighbourhood (i.e., data-dependent) operations. The efficiency and speed for processing six such operations is measured for both window and crinkle mapping, and evaluated against the efficiency of traditional updating methods.
Journal of VLSI Signal Processing, 1991
In this paper we examine the usefulness of a simple memory array architecture to several image processing tasks. This architecture, called the Access Constrained Memory Array Architecture (ACMAA) has a linear array of processors which concurrently access distinct rows or columns of an array of memory modules. We have developed several parallel image processing algorithms for this architecture. All the algorithms presented in this paper achieve a linear speed-up over the corresponding fast sequential algorithms. This was made possible by exploiting the efficient local as well as global communication capabilities of the ACMAA.
The performance of SIMD processors is often limited by the time it takes to transfer data between the centralized control unit and the parallel processor array. This is especially true of hybrid SIMD models, such as associative computing, that make extensive use of global search operations. Pipelining instruction broadcast can help, but is not enough to solve the problem, especially for massively parallel processors with thousands of processing elements. In this paper, we describe a SIMD processor architecture that combines a fully pipelined broadcast/reduction network with hardware multithreading to reduce performance degradation as the number of processors is scaled up.
Proceedings of the twelfth annual ACM symposium on Parallel algorithms and architectures - SPAA '00, 2000
This paper presents mathematical foundations for the design of a memory controller subcomponent that helps to bridge the processor/memory performance gap for applications with strided access patterns. The Parallel Vector Access (PVA) unit exploits the regularity of vectors or streams to access them efficiently in parallel on a multi-bank SDRAM memory system. The PVA unit performs scatter/gather operations so that only the elements accessed by the application are transmitted across the system bus. Vector operations are broadcast in parallel to all memory banks, each of which implements an efficient algorithm to determine which vector elements it holds. Earlier performance evaluations have demonstrated that our PVA implementation loads elements up to 32.8 times faster than a conventional memory system and 3.3 times faster than a pipelined vector unit, without hurting the performance of normal cache-line fills. Here we present the underlying PVA algorithms for both word interleaved and cache-line interleaved memory systems.
Proceedings 20th IEEE International Parallel & Distributed Processing Symposium, 2006
Parallel programming is facilitated by constructs which, unlike the widely used SPMD paradigm, provide programmers with a global view of the code and data structures. These constructs could be compiler directives containing information about data and task distribution, language extensions specifically designed for parallel computation, or classes that encapsulate parallelism. In this paper, we describe a class developed at Illinois and its MATLAB implementation. This class can be used to conveniently express both parallelism and locality. A C++ implementation is now underway. Its characteristics will be reported in a future paper. We have implemented most of the NAS benchmarks using our HTA MATLAB extensions and found during that HTAs enable the fast prototyping of parallel algorithms and produce programs that are easy to understand and maintain.
2018
Single-Instruction-Multiple-Data (SIMD) architectures are widely used to accelerate applications involving Data-Level Parallelism (DLP); the on-chip memory system facilitates the communication between Processing Elements (PE) and on-chip vector memory. It is observed that inefficiency of the on-chip memory system is often a computational bottleneck. In this paper, we describe the design and implementation of an efficient vector data memory system. The proposed memory system consists of two novel parts: an access-pattern-aware memory controller and an automatic loading mechanism. The memory controller reduces the data reorganization overheads. The automatic loading mechanism loads data automatically according to the access patterns without load instructions. This eliminates overhead of fetching and decoding. The proposed design is implemented and synthesized with Cadence tools. Experimental results demonstrate that our design improves the performance of 8 application kernels by 44% a...
International Journal of Parallel Programming
Malleable applications may run with varying numbers of threads, and thus on varying numbers of cores, while the precise number of threads is irrelevant for the program logic. Malleability is a common property in data-parallel array processing. With ever growing core counts we are increasingly faced with the problem of how to choose the best number of threads. We propose a compiler-directed, almost automatic tuning approach for the functional array processing language SaC. Our approach consists of an offline training phase during which compiler-instrumented application code systematically explores the design space and accumulates a persistent database of profiling data. When generating production code our compiler consults this database and augments each data-parallel operation with a recommendation table. Based on these recommendation tables the runtime system chooses the number of threads individually for each data-parallel operation. With energy/power efficiency becoming an ever greater concern, we explicitly distinguish between two application scenarios: aiming at best possible performance or aiming at a beneficial trade-off between performance and resource investment.
Proceedings., Second Annual IEEE ASIC Seminar and Exhibit
An Application Specific Array Processor(ASAP) means a high-speed, appiicalion-dnuen, massrvely-parallel, modular, and programmable computing syslem. The ever-increasing super-high-speed requirement (in giga/tera FLOPS) in modern engineering applications suggests that mainframe scientific computers will not be adequate for many real-time signallimage processing and scientific computing applications. Therefore, the new trend of real-time comnuting systems points to special-purpose parallel processors, whose architecture is dictated by the very rich underlying algorithmic structures and therefore optimized for high-speed process ing of large arrays of data. It is also recognized that a fast turn-around design environment will be in a great demand for such parallel processing systems. This has become more realistic and more compelling with the increasingly matured VLSI and CAD technology. Therefore, a major advance in the state-of-the-art in the next decade or so is expected. The tutorial will discuss how to effectively design an application specific parallel processing system which leads to a fast-turnaround design methodology.
BIT, 1987
We present a new model of parallel computation called the "array processing machine" or APM (for short). The APM was designed to closely model the architecture of existing vector-and array processors, and to provide a suitable unifying framework for the complexity theory of parallel combinatorial and numerical algorithms. It is shown that every problem that is solvable in polynomial space on an ordinary, sequential random access machine can be solved in parallel polynomial time on an APM (and vice versa). The relationship to other models of parallel computation is discussed.
2018
Index structures designed for disk-based database systems do not fulfill the requirements for modern database systems. To improve the performance of these index structures, different approaches are presented by several authors, including horizontal vectorization with SIMD and efficient cache-line usage.
Parallel Computing, 1989
Array-based' computation is a style of programming in which most of the work is done by manipulating arrays as units with operations that are conceptually parallel. There are numerous problems that lend themselves to this style of programming. Array-based computation has held out the promise of highly parallel computation, but it is only with the advent of massively parallel architectures, that this promise can be fulfilled.
Proceedings of the International Symposium on Memory Systems, 2019
Recently, Processing In-Memory (PIM) has been shown as a promising solution to address data movement issue in the current processors. However, today's PIM technologies are mostly analog-based, which involve both scalability and efficiency issues. In this paper, we propose a novel digital-based PIM which accelerates fundamental operations and diverse data analytic procedures using processing in-memory technology. Instead of sending a large amount of data to the processing cores for computation, our design performs a large part of computation tasks inside the memory; thus the application performance can be accelerated significantly by avoiding the memory access bottleneck. Digital-based PIM supports bit-wise operations between two selected bit-line of the memory block and then extends it to support row-parallel arithmetic operations. CCS CONCEPTS • Computer systems organization → Architectures; • Hardware → Emerging technologies.
The Journal of Supercomputing, 1990
It has been observed by many researchers that systolic arrays are very suitable for certain high-speed computations. Using a formal methodology, we present a design for a single simple programmable linear systolic array capable of solving large numbers of problems drawn from a variety of applications. The methodology is applicable to problems solvable by sequential algorithms that can be specified as nested for-loops of arbitrary depth. The algorithms of this form that can be computed on the army presented in this paper include 25 algorithms dealing with signal and image processing, algebraic computations, matrix arithmetic, pattern matching, database operations, sorting, and transitive closure. Assuming bounded I/O, for 18 of those algorithms the time and storage complexities are optimal, and therefore no improvement can be expected by using dedicated special-purpose linear systolic arrays designed for individual algorithms. We also describe another design which, using a sufficient large local memory and allowing data to be preloaded and unloaded, has an optimal processor/time product.
We provide the first optimal algorithms in terms of the number of input/outputs (I/Os) required between internal memory and multiple secondary storage devices for the problems of sorting, FFT, matrix transposition, standard matrix multiplication, and related problems. Our two-level memory model is new and gives a realistic treatment of parallel block transfer, in which during a single I/O each of the P secondary storage devices can simultaneously transfer a contiguous block of B records. The model pertains to a large-scale uniprocessor system or parallel multiprocessor system with P disks. In addition, the sorting, FFT, permutation network, and standard matrix multiplication algorithms are typically optimal in terms of the amount of internal processing time. The difficulty in developing optimal algorithms is to cope with the partitioning of memory into P separate physical devices. Our algorithms' performances can be significantly better than those obtained by the well-known but nonoptimal technique of disk striping. Our optimal sorting algorithm is randomized, but practical; the probability of using more than I times the optimal number of I/Os is exponentially small in/(log/) log(M/B), where M is the internal memory size.
2021 29th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), 2021
Array Database Management Systems (Array databases) support query processing over multi-dimensional data. Data storage is implemented with non-linear structures to mitigate the shortcomings of the relational model when dealing with raw binary data, such as images, time series, and others. Due to data-hungry nature of multi-dimensional data applications, array databases must ideally provide a linear speedup when using a multi-processing system. When dealing with Non-Uniform Memory Access (NUMA) machines, array databases may require massive data movement across the nodes resulting in a severe performance impact, depending on the user operation. In this paper, we analyze the performance impact of the NUMA architecture in the SAVIME and SciDB array databases running five different well-known static thread pinning strategies. Our experiments showed a maximum speedup of these different strategies by 2.49x for SAVIME and up to 1.40x for SciDB. We also observed that these static strategies only yield 48% from the potential speedup (and 26% of the energy reduction), opening a new research topic.
Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2021
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.