Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
Citeseer
Modern processors have a small on-chip local memory for instructions. Usually it is in the form of a cache but in some cases it is an addressable memory. In the latter, the user is required to partition and arrange the code such that appropriate fragments are loaded into the memory at appropriate times. We explore automatic partitioning by defining an optimality criterion and provide a lazy algorithm which tries to combine procedures which should be loaded together. The procedures which do not fit into local memory are further partitioned. The lazy nature of the algorithm facilitates using multiple heuristics to identify good partitions. Our partitioner can be used to provide the much needed relief to a programmer and could be an important tool in the design space exploration of embedded processor architectures to study the possibility of replacing expensive cache memory by relatively inexpensive and larger RAM.
Proceedings of the 6th ACM conference on Computing frontiers - CF '09, 2009
There is a trend towards using accelerators to increase performance and energy efficiency of general-purpose processors. Adoption of accelerators, however, depends on the availability of tools to facilitate programming these devices.
… of the 3rd workshop on Memory …, 2004
In order to meet the requirements concerning both performance and energy consumption in embedded systems, new memory architectures are being introduced. Beside the well-known use of caches in the memory hierarchy, processor cores today also ...
Abstract. This paper presents our recent research efforts addressing the dynamic mapping of sections of execution to a coarse-grained reconfigurable array (CGRA) coupled to a General Purpose Processor (GPP). We are considering the common scenario of a GPP–a RISC processor–using the CGRA as a co-processor to speedup applications. We present a partitioning scheme based on large traces of instructions (named Megablock). We show estimations of the speedups achieved by considering the Megablock.
Arabian Journal for …, 2007
System-level design decisions such as HW/SW partitioning, target architecture selection and scheduler selection are some of the main concerns of current complex system-on-chip (SOC) designs. In this paper, a novel window-based heuristic is proposed that addresses the issue of design space exploration in applications that have a data flow characteristic. The objective in this paper is to partition the application into HW and SW components such that the execution time of the application is minimized while simultaneously satisfying the hard area constraints of the HW units. In this algorithm, the search space is divided into smaller intervals, referred to as windows. For each window the full search is performed to find the optimum partitioning and scheduling solution for that specific window. Moreover, in this paper a novel indexing mechanism is presented for identifying the nodes in the task graph. The proposed index specifies not only the relation of each node with respect to the other nodes in the graph, but also its position in the task graph. With the help of the proposed windowing and indexing techniques, the time required for partitioning is reduced significantly. Simulation results indicate that the proposed algorithm improves the search time by 74% compared to conventional optimization heuristics namely Genetic Algorithm (GA), Simulated Annealing (SA) and Tabu Search (TS), while providing comparable results in terms of the overall execution time of the partitioned system.
2015
This paper presents a comparative study between two algorithms of hardware/software partitioning which aim to minimize the logic area of System on a Programmable Chip (SOPC) while respecting a time constraint. The first algorithm is based on the genetic algorithm (AG), the second one is our proposed algorithm which is based on the principle of Binary Search Trees (BST) and genetic algorithms (AG). The two algorithms aim to define the tasks that will run on the Hardware (HW) part and those that will run on the Software (SW) part. They seek to find the efficient hardware/software partition that minimize the number of tasks used by the HW and increase the number of tasks used by the SW, in order to balance all the design parameters and have a better trade-off between the logic area of the application and its execution time.
1995
Design of embedded systems has brought the discipline of hardware software codesign into focus. A major task of such codesign activity is partitioning the functions into hardware and software implementation sets. In this paper, we propose an algorithm which performs such partitioning and also allocates the functions to modules. The task has been formulated as a consistent labeling problem. To deal with the combinatorial nature of the problem, a number of heuristics have been proposed and their relative performances have been evaluated experimentally. The algorithm has been applied to solve several design problems
1994
We present a fully automatic approach to hardware/software partitioning and memory allocation by applying compiler techniques to the hardware/software partitioning problem and linking a compiler to a behavioural VHDL generator and high level hardware synthesis tools. Our approach is based on a hierarchical candidate preselection technique and allows (a) efficient collection of profiling data, (b) fast partitioning, and (c) high complexity of the hardware partition.
Proceedings of the international conference on Compilers, architecture, and synthesis for embedded systems - CASES '01, 2001
In an embedded system, it is common to have several memory areas with different properties, such as access time and size. An access to a specific memory area is usually restricted to certain native pointer types. Different pointer types vary in size and cost. For example, it is typically cheaper to use an 8-bit pointer than a 16-bit pointer. The problem is to allocate data and select pointer types in the most effective way. Frequently accessed variables should be allocated in fast memory, and frequently used pointers and pointer expressions should be assigned cheap pointer types. Common practice is to perform this task manually. We present a model for storage allocation that is capable of describing architectures with irregular memory organization and with several native pointer types. This model is used in an integer linear programming (ILP) formulation of the problem. An ILP solver is applied to get an optimal solution under the model. We describe allocation of global variables and local variables with static storage duration. A whole program optimizing C compiler prototype was used to implement the allocator. Experiments were performed on the Atmel AVR 8-bit microcontroller [2] using small to medium sized C programs. The results varied with the benchmarks, with up to 8% improvement in execution speed and 10% reduction in code size.
2006 International Conference on Microelectronics, 2006
In this paper, a novel level-based hardware/ software partitioning heuristic has been proposed. The algorithm operates on functional blocks of designs represented as directed acyclic graphs (DAG), with the objective of minimizing the processing time under various hardware area constraints. In most existing methods, the communication overhead and the fact that the vertices mapped onto the same computing unit have less communication, is overlooked during the partitioning decision, while the proposed algorithm considers this fact during partitioning.
2000
Hardware and software co-design is a design technique which delivers computer systems comprising hardware and software components. A critical phase of co-design process is to decompose a program into hardware and software. This paper proposes an algebraic partitioning method whose correctness is verified in the algebra of programs. We introduce the program analysis phase before program partitioning and develop a collection of syntax-based splitting rules,
— memory management in computer operating systems is the process of subdividing the main memory to accommodate as many processes as possible. Placement algorithms are implemented to determine the slot that can be allocated a process amongst the available ones in the partitioned memory block. The common placement algorithms are first fit, best fit, next fit and worst fit. Memory slots allocated to processes might be too big when using the existing placement algorithms hence losing a lot of space due to internal fragmentation. Considering the next available slot to the allocated one would lead to reduced internal fragmentation if that next slot is smaller than the initial slot but large enough to accommodate the process. In this paper, we give a deeper understanding of how memory is dynamically allocated in operating systems using the existing algorithms. Usually when there is interaction with the computer, one does not realize how busy the operating system is in doing resource allocation. Through simulation using MATLAB, this paper introduced a placement algorithm in memory management which leads to reduced memory loss due to internal fragmentation. The simulation results prove that the proposed allocation strategy minimizes the amount of memory loss. Reduced fragmentation leads to improved operating systems better area utilization, lesser task rejection ratio and faster execution of the tasks. Keywords— Operating system, memory management, placement algorithm, internal fragmentation, first fit, best fit, next fit, worst fit.
International Journal of Parallel …, 2001
show that delays introduced in the issue and bypass logic will become critical for wide issue superscalar processors. One of the proposed solutions is clustering the processor core. Clustered architectures benefit from a less complex partitioned processor core and thus, incur in less critical delays. In this paper, we propose a dynamic instruction steering logic for these clustered architectures that decides at decode time the cluster where each instruction is executed. The performance of clustered architectures depends on the intercluster communication overhead and the workload balance. We present a scheme that uses runtime information to optimize the trade-off between these figures. The evaluation shows that this scheme can achieve an average speed-up of 35 0 over a conventional 8-way issue (4 int+4 fp) machine and that it outperforms other previous proposals, either static or dynamic.
IEEE Computer Architecture Letters, 2019
Instruction cache misses are a significant source of performance degradation in server workloads because of their large instruction footprints and complex control flow. Due to the importance of reducing the number of instruction cache misses, there has been a myriad of proposals for hardware instruction prefetchers in the past two decades. While effectual, state-of-the-art hardware instruction prefetchers either impose considerable storage overhead or require significant changes in the frontend of a processor. Unlike hardware instruction prefetchers, code-layout optimization techniques profile a program and then reorder the code layout of the program to increase spatial locality, and hence, reduce the number of instruction cache misses. While an active area of research in the 1990s, code-layout optimization techniques have largely been neglected in the past decade. We evaluate the suitability of code-layout optimization techniques for modern server workloads and show that if we combine these techniques with a simple next-line prefetcher, they can significantly reduce the number of instruction cache misses. Moreover, we propose a new code-layout optimization algorithm and show that along with a next-line prefetcher, it offers the same performance improvement as the state-of-the-art hardware instruction prefetcher, but with almost no hardware overhead.
2003 IEEE Workshop on Signal Processing Systems (IEEE Cat. No.03TH8682), 2003
Real-time signal, image, and control applications have very important time constraints, involving the use of several powerful numerical calculation units. The aim of our work is to develop a fast and automatic prototyping process dedicated to parallel architectures made of both PC and several last generation Texas Instruments digital signal processors: TMS320C6X DSP. The process is based on SynDEx, a CAD software improving algorithm implementation onto multi-processor architectures by finding the best matching between an algorithm and an architecture. SynDEx kernels for automatic PC and DSP dedicated code generation have been developed with the new SynDEx repetition feature. A full coding application (LAR) illustrates the results.
IEEE Transactions on Parallel and Distributed Systems, 1992
An important problem facing numerous research projects on parallelizing compilers for distributed memory machines is that of automatically determining a suitable data partitioning scheme for a program. Most of the current projects leave this tedious problem almost entirely to the user. In this paper, we present a novel approach to the problem of automatic data partitioning. We introduce the notion of constraints on data distribution, and show how, based on performance considerations, a compiler identi es constraints to be imposed on the distribution of various data structures. These constraints are then combined by the compiler to obtain a complete and consistent picture of the data distribution scheme, one that o ers good performance in terms of the overall execution time. We present results of a study we performed on Fortran programs taken from the Linpack and Eispack libraries and the Perfect Benchmarks to determine the applicability of our approach to real programs. The results are very encouraging, and demonstrate the feasibility of automatic data partitioning for programs with regular computations that may be statically analyzed, which covers an extremely signi cant class of scienti c application programs.
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2004
Optimizations aimed at improving the efficiency of on-chip memories in embedded systems are extremely important. Using a suitable combination of program transformations and memory design space exploration aimed at enhancing data locality enables significant reductions in effective memory access latencies. While numerous compiler optimizations have been proposed to improve cache performance, there are relatively few techniques that focus on software-managed on-chip memories. It is well-known that software-managed memories are important in real-time embedded environments with hard deadlines as they allow one to accurately predict the amount of time a given code segment will take. In this paper, we propose and evaluate a compiler-controlled dynamic on-chip scratch-pad memory (SPM) management framework. Our framework includes an optimization suite that uses loop and data transformations, an on-chip memory partitioning step, and a code-rewriting phase that collectively transform an input code automatically to take advantage of the on-chip SPM. Compared with previous work, the proposed scheme is dynamic, and allows the contents of the SPM to change during the course of execution, depending on the changes in the data access pattern. Experimental results from our implementation using a source-to-source translator and a generic cost model indicate significant reductions in data transfer activity between the SPM and off-chip memory.
The processor architecture available on the K computer (SPARC64 VIIIfx) features an hardware cache partitioning mechanism called sector cache. This facility enables software to split the memory cache in two independent sectors: data loads in one sector cannot trigger the eviction of data in the second one. Moreover, software is responsible for data placement in each sector by issuing special instructions tagging the various memory loads performed during execution. The implementation details of this cache partitioning mechanism also enable fast redistribution of the cache during an application's runtime, without any cost, allowing any optimization using the sector cache to be applied multiple times, with different setups, in the event of phase changes.
2005
In this paper, we propose a hardware/software partitioning method for improving applications' performance in embedded systems. Critical software parts are accelerated on hardware of a single-chip generic system comprised by an embedded processor and coarse-grain reconfigurable hardware. The reconfigurable hardware is realized by a 2-Dimensional array of Processing Elements. The partitioning flow utilizes an analysis procedure at the basic-block level for detecting kernels in software. A list-based mapping algorithm has been developed for estimating the execution cycles of kernels on Coarse-Grain Reconfigurable Arrays. The proposed partitioning flow has been largely automated for a program description in C language. Extensive hardware/software experiments on five real-life applications are presented. It is shown that the benchmarks spend an average of 69% of their instruction count in 11% on average of their code that correspond to the kernels' code. The results illustrate that by mapping critical code on coarse-grain reconfigurable hardware, speedups ranging from 1.2 to 3.7, with an average value of 2.2, are achieved.
… , 1993, with the European Event in …, 1994
Proceedings of the …, 2005
Modern embedded microprocessors use low power on-chip memories called scratch-pad memories to store frequently executed instructions and data. Unlike traditional caches, scratch-pad memories lack the complex tag checking and comparison logic, thereby proving to be efficient in area and power. In this work, we focus on exploiting scratch-pad memories for storing hot code segments within an application. Static placement techniques focus on placing the most frequently executed portions of programs into the scratch-pad. ...
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.