Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
1988, Computer architecture news
We describe the MaRS machine: a parallel, distributed control multiprocessor for graph reduction using a functional machine language. The object code language is based on an optimized set of combinators, and its functional character allows an automatic parallelisation of the execution. A programming language, "MARS LISP", has also been developed. A prototype of MaRS is currently being designed in VLSI 1.5-micron CMOS technology with 2 levels of metal, by means of a CAD system. The machine uses three basic types of processors for Reduction, Memory and Communication, plus auxiliary 1/0 and Arithmetic Processors; communications do not constitute an operational bottleneck, as interprocessor messages are routed via an Omega switching network. Initially, a Host Computer will be used for startup, testing and direct memory access. The machine architecture and its functional organization are described, as well as the theoretical execution model. We conclude on a number of specialized hardware and software mechanism that differentiate MaRS machine from other similar projects currently going on.
Concurrency: Practice and Experience, 1991
This paper is concerned with the implementation of functional languages on a parallel architecture, using graph reduction as a model of computation. Parallelism in such systems is automatically derived by the compiler but a major problem is the fine granularity, illustrated in Divide-and-Conquer problems at the leaves of the computational tree. The paper addresses this issue and proposes a method based on static analysis combined with run-time tests to remove the excess in parallelism. We report experiments on a prototype machine, simulated on several connected INMOS transputers. Performance figures show the benefits in adopting the method and the difficulty of automatically deriving the optimum partitioning due to differences among the problems.
ACM SIGARCH Computer Architecture News, 1986
The G-machine provides architecture support for the evaluation of functional programming languages by graph reduction. This paper describes an instruction fetch unit for such an architecture that provides a high throughput of instructions, low latency and adequate elasticity in the instruction pipeline. This performance is achieved by a hybrid instruction set and a decoupled RISC architecture. The hybrid instruction set consists of complex instructions that reflect the abstract architecture and simple instructions that reflect the hardware implementation. The instruction fetch unit performs translation from complex instruction to a sequence of simple instructions which can be executed rapidly. A suitable mix of techniques, including cache, buffers and the translation scheme, provide the memory bandwidth required to feed a RISC execution unit. The simulation results identify the performance gains, maximum throughput and minimum latency achieved by various techniques. Results achieved...
Lecture Notes in Computer Science, 1989
Programmed graph reduction has been shown to be an e cient implementation technique for lazy functional languages on sequential machines. Considering programmed graph reduction as a generalization of conventional environment-based implementations where the activation records are allocated in a graph instead of on a stack it becomes very easy to use this technique for the execution of functional programs in a parallel machine with distributed memory. We describe in this paper the realization of programmed graph reduction in PAM | a parallel abstract machine with distributed memory. Results of our implementation of PAM on an Occam-Transputersystem are given.
8th International Conference on Parallel and Distributed Computing Applications and Technologies (PDCAT '07), 2007
Abstract We present the architecture of nreduce, a distributed virtual machine which uses parallel graph reduction to run programs across a set of computers. It executes code written in a simple functional language which supports lazy evaluation and automatic parallelisation. The execution engine abstracts away details of parallelism and distribution, and uses JIT compilation to produce efficient code. Abstract This work is part of a broader project to provide a programming environment for developing distributed applications which hides low-level details from the application developer. The language we have designed plays the role of an intermediate form into which existing functional languages can be transformed. The runtime system demonstrates how distributed execution can be implemented directly within a virtual machine, instead of a separate piece of middleware that coordinates the execution of external programs.
Fourth IEEE Region 10 International Conference TENCON, 1989
This paper describes a new concept for the parallel implementation of functional languages on a network of processors. The implementation uses a special variant of annotated graph reduction 3]. The main features of it are the following: We employ active waiting 6], to avoid complicated runtime data structures. We use a global address space, and a random distribution of the graph nodes over the local memories of the processors, in order to overcome the problems of load-balancing and scheduling. The reduction is organized in cycles during which, all annotated redices are reduced. This notion of "cycles" enables us, to restrict communication between the processors to the execution of a global permutation, de ned by an array of messages M = L LocalMessages P processors ]. This two dimensional (2D) permutation is realized by a simple and fast algorithm, that permutes all messages of M in 2L + 6L log(P) steps, for any L suciently large. This algorithm actually maps any 2D-permutation to a double 2D-transpose operation 19]. Hence the implementation can be used for any network topology that supports the transpose operation (namely Shu e Exchange 1]). The simple syntactic system, and the mapping to one basic transpose operation, produce signi cant simplicity and understanding of the implementation. The potential speedup of graph reduction programs, is compared with the overhead of the implementation, giving deeper insight to parallel graph reductions.
Proceedings of the fourth international conference on Functional programming languages and computer architecture - FPCA '89, 1989
We have implemented a parallel graph reducer on a commercially available shared memory multiprocessor (a Sequent SymmetryTM), h t h' t a ac reves real speedup compared to a a fast compiled implementation of the conventional Gmachine. Using 15 processors, this speedup ranges between 5 and 11, depending on the program. Underlying the implementation is an abstract machine called the (v, G)-machine. We describe the sequential and the parallel (v, G)-machine, and our implementation of them. We provide performance and speedup figures and graphs.
Parallel Computing, 1991
This paper is concerned with the design of a multiprocessor system supporting the parallel execution of functional programs. Parallelism in such systems is automatically derived by the compiler but this parallelism is unlikely to meet the physical constraints of the target machine. In this paper, these problems are identified for the class of Divide-and-Conquer algorithms and a solution which consists of reducing the depth of the computational tree is proposed. A parallel graph reduction machine simulated on a network of transputers, developed for testing the proposed solutions, is described. Experiments have been conducted on some simple Divide-and-Conquer programs and the results are presented. Lastly, some proposals are made for an automatic system that would efficiently execute any problem belonging to the same class taking into account the nature of the problem as well as the physical characteristics of the implementation.
Lecture Notes in Computer Science, 1991
Graph rewriting models are very suited to serve as the basic computational model for functional languages and their implementation. Graphs are used to share computations which is needed to make efficient implementations of functional languages on sequential hardware possible. When graphs are rewritten (reduced) on parallel loosely coupled machine architectures, subgraphs have to be copied from one processor to another such that sharing is lost. In this paper we introduce the notion of lazy copying. With lazy copying it is possible to duplicate a graph without duplicating work. Lazy copying can be combined with simple mmotations which control the order of reduction. In principle, only interleaved execution of the individual reduction steps is possible. However, a condition is deduced under which parallel execution is allowed. When only certain combinations of lazy copying and annotations are used it is guarantied that this so-called non-interference condition is fulfilled. Abbreviations for these combinations are introduced. Now complex process behavlours, such as process communication on a loosely coupled parallel machine architecture, can be modelled. This also includes a special case: modelling mnltlprocessing on a single processor. Arbitrary process topologies can be created. Synchronous and asyncbronons process communication can be modelled. The implementation of the language Concurrent Clean, which is based on the proposed graph rewriting model, has shown that complicated parallel algorithms which can go far beyond divide-and-conquar like applications can be expressed.
Future Generation Computer Systems, 1997
This paper presents the results of a simulation study of cache coherency issues in parallel implementations of functional programming languages. Parallel graph reduction uses a heap shared between processors for all synchronisation and communication. We show that a high degree of spatial locality is often present and that the rate of synchronisation is much greater than for imperative programs. We propose a modi ed coherency protocol with static cache line ownership and show that this allows locality to be exploited to at least the level of a conventional protocol, but without the unnecessary serialisation and network transactions this usually causes. The new protocol avoids false sharing, and makes it possible to reduce the number of messages exchanged, but relies on increasing the size of the cache lines exchanged to do so. It is therefore of most bene t with a high-bandwidth interconnection network with relatively high communication latencies or message handling overheads.
Future Generation Computer Systems, 1993
A clustered architecture has been designed to exploit divide and conquer parallelism in functional programs. The programming methodology developed for the machine is based on explicit annotations and program transformations. It has been successfully applied to a number of algorithms resulting in a benchmark of small and medium size parallel functional programs. Sophisticated compilation techniques are used such as strictness analysis on non-flat domains and RISC and VLIW code generation. Parallel jobs are distributed by an efficient hierarchical scheduler. A special processor for graph reduction has been designed as a basic building block for the machine. A prototype of a single cluster machine has been constructed with stock hardware. This paper describes the experience with the project and its current state.
Knowledge-Based Systems, 1995
The paper explores the implementation of rule-based patterndirected inference systems on parallel computers. The paper discusses one of these approaches in detail, the use of a graphreduction machine such as ALICE. The technique is illustrated through two example domains: automobile fault diagnosis and organic psychiatric mental disorders. The paper discusses extensions to the graph reduction technique as applied to knowledge-based systems, including partitioning, time considerations and input data types. The paper shows that the graph-reduction technique has significant advantages for knowledge-based system implementation over conventional approaches, and it demonstrates that this programming style is amenable to knowledge engineering domains.
Intel Xeon Phi (MIC architecture) is a relatively new accelerator chip, which combines large-scale shared memory parallelism with wide SIMD lanes. Mapping applications on a node with such an architecture to achieve high parallel efficiency is a major challenge. In this paper, we focus on developing a system for heterogeneous graph processing, which is able to utilize both a many-core Xeon Phi and a multi-core CPU on one node. We propose a simple programming API with an intuitive interface for expressing SIMD parallelism. We develop efficient techniques for supporting our high-level API, focusing on exploiting wide SIMD lanes, massive number of cores, and partitioning of the work across CPU and accelerator, while handling the irregularity of graph applications. The components of our runtime system include a condensed static memory buffer, which supports efficient message insertion and SIMD message reduction while keeping memory requirements low, and specifically for MIC, a pipelining scheme for efficient message generation by avoiding frequent locking operations. Besides, a hybrid graph partitioning module is able to effectively partition the workload between the CPU and the MIC, ensuring balanced workload and low communication overhead. The main observations from our experimental evaluation using five popular applications are: for MIC executions, pipelining scheme is up to 3.36x faster than a naive approach using locking based message generation, and the speedup over OpenMP ranges from 1.17 to 4.15. Heterogeneous CPU-MIC execution achieves a speedup of up to 1.41 over the better of the CPU-only and MIC-only executions.
We describe a system that uses automated planning to synthesize correct and efficient parallel graph programs from high-level algorithmic specifications. Automated planning allows us to use constraints to declaratively encode program transformations such as scheduling, implementation selection, and insertion of synchronization. Each plan emitted by the planner satisfies all constraints simultaneously, and corresponds to a composition of these transformations. In this way, we obtain an integrated compilation approach for a very challenging problem domain. We have used this system to synthesize parallel programs for four graph problems: triangle counting, maximal independent set computation, preflow-push maxflow, and connected components. Experiments on a variety of inputs show that the synthesized implementations perform competitively with handwritten , highly-tuned code.
In this paper, we discuss the issues involved in de- signing ASIPs and more speciflcally when combined with GPP's. Such a design process involves several steps and here we focus on one of them: the pro- gram transformation phase, the second step of the design process that is required when certain parts of the application are extracted and will be exe- cuted on a reconflgurable component. Over the last years, many algorithms have been developed and pro- posed. In this paper we provide an overview of exist- ing program transformation approaches and for each of these, we present a critical evaluation, underlining the differentiating features as well as the strengths and weaknesses. We discuss to what extent they can be used for reconflgurable hardware and propose po- tential improvements.
Proceedings of the 15th ACM SIGPLAN International Conference on Software Language Engineering
Domain-specific language compilers need to close the gap between the domain abstractions of the language and the lowlevel concepts of the target platform. This can be challenging to achieve for compilers targeting multiple platforms with potentially very different computing paradigms. In this paper, we present a multi-target, multi-paradigm DSL compiler for algorithmic graph processing. Our approach centers around an intermediate representation and reusable, composable transformations to be shared between the different compiler targets. These transformations embrace abstractions that align closely with the concepts of a particular target platform, and disallow abstractions that are semantically more distant. We report on our experience implementing the compiler and highlight some of the challenges and requirements for applying language workbenches in industrial use cases. CCS Concepts: • Software and its engineering → Domain specific languages.
Journal of …, 1997
A generalized computational model based on graph rewriting is presented along with Dactl, an associated compiler target (intermediate) language. An illustration of the capability of graph rewriting to model a variety of computational formalisms is presented by showing how some examples written originally in a number of languages can be described as graph rewriting transformations using Dactl notation. This is followed by a formal presentation of the Dactl model before giving a formal definition of the syntax and semantics of the language. Some implementation issues are also discussed.
Lecture Notes in Computer Science, 2006
A variety of computation models have been developed using graphs and graph transformations. These include models for sequential, distributed, parallel or mobile computation. A graph may represent, in an abstract way, the underlying structure of a computer system, or it may stand for the computation steps running on such a system. In the former, the computation can be carried on the corresponding graph, implying a simplification of the complexity of the system. The aim of the workshop is to bring together researchers interested in all aspects of computation models based on graphs, and their applications. A particular emphasis will be made for models and tools describing general solutions.
2005
We are interested in performing reduction operations with distributed memory machines whose interconnection networks are reconfigurable. More precisely, we focus on machines whose interconnection graph can be configured as any graph of maximum degree d. We discuss the best way of interconnecting the p processors as a function of p , d, and some problem-and machine dependent-parameters that characterize the ratio communication/arithmetic for the reduction operation. Experiments on Transputer-based networks are in good accordance with the theoretical results.
2009
A redex in a graph G is a triple r = (u, c, v) of distinct vertices that determine a 2-star. Shrinking r means deleting the center c and merging u with v into one vertex. Reduction of G entails shrinking all of its redexes in a recursive way, and, at the same time, deleting all loops that are created during this process. It is shown that reduction can be implemented in O(m) time, where m is the number of edges in G.
Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization - CGO '14, 2014
Large-scale graph processing, with its massive data sets, requires distributed processing. However, conventional frameworks for distributed graph processing, such as Pregel, use non-traditional programming models that are well-suited for parallelism and scalability but inconvenient for implementing non-trivial graph algorithms. In this paper, we use Green-Marl, a Domain-Specific Language for graph analysis, to intuitively describe graph algorithms and extend its compiler to generate equivalent Pregel implementations. Using the semantic information captured by Green-Marl, the compiler applies a set of transformation rules that convert imperative graph algorithms into Pregel's programming model. Our experiments show that the Pregel programs generated by the Green-Marl compiler perform similarly to manually coded Pregel implementations of the same algorithms. The compiler is even able to generate a Pregel implementation of a complicated graph algorithm for which a manual Pregel implementation is very challenging.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.