Skip to main content

Aart Bik

Followers

11

Following

3

Co-authors

3

Public Views

Jesse Richardson

West Virginia University

Prakalp Srivastava

IIT Kanpur

Armin Größlinger

Frederic J . Schwartz

University College London

Berthold Reinwald

Interests

Uploads

Papers by Aart Bik

Composable and Modular Code Generation in MLIR: A Structured and Retargetable Approach to Tensor Compiler Construction

Despite significant investment in software infrastructure, machine learning systems, runtimes and... more Despite significant investment in software infrastructure, machine learning systems, runtimes and compilers do not compose properly. We propose a new design aiming at providing unprecedented degrees of modularity, composability and genericity. This paper discusses a structured approach to the construction of domain-specific code generators for tensor compilers, with the stated goal of improving the productivity of both compiler engineers and end-users. The approach leverages the natural structure of tensor algebra. It has been the main driver for the design of progressive lowering paths in \MLIR. The proposed abstractions and transformations span data structures and control flow with both functional (SSA form) and imperative (side-effecting) semantics. We discuss the implications of this infrastructure on compiler construction and present preliminary experimental results.

$Fig. 14. Memory bandwidth for bandwidth-bound kernels on problem sizes fitting the L2 cache. Measured peak copy bandwidth is 83.9GB/s (1 read and 1 write per byte). Reductions can go past this because they perform an amortized 2 reads and a fraction of a write per iteration. Despite the extra compute, dropping most writes is still a net win.$

Reshaping Access Patterns for Improving Data Locality

In this paper, we present a method to construct a loop transformation that simultaneously reshape... more In this paper, we present a method to construct a loop transformation that simultaneously reshapes the access patterns of several occurrences of multidimensional arrays along certain desired access directions. First, the method determines a direction through the original iteration space along which these desired access directions are induced. Subsequently, a unimodular transformation is constructed that changes the iteration space traversal accordingly. Finally, data dependences are accounted for. In particular, this reshaping method can be used to improve data locality in a program.

Implementation of fourier - motzkin elimina - tion

Packet communication is used in the architecture of a memory system having hierarchical structure... more

A Strategy for Exploiting Implicit Loop Parallelism in Java Programs

In this paper, we explore a strategy that can be used by a source to source restructuring compile... more In this paper, we explore a strategy that can be used by a source to source restructuring compiler to exploit implicit loop parallelism in Java programs. First, the compiler must identify the parallel loops in a program. Thereafter, the compiler explicitly expresses this parallelism in the transformed program using the multithreading mechanism of Java. Finally, after a single compilation of the transformed program into Java byte-code, speedup can be obtained on any platform on which the Java byte-code interpreter supports actual concurrent execution of threads, whereas threads only induce a slight overhead for serial execution. In addition, this approach can enable a compiler to explicitly express the scheduling policy of each parallel loop in the program.

Exploiting implicit loop parallelism using multiple multithreaded servers in Java Fabian Breg

Since its introduction in the late eighties, the global Internet has grown from a wide area infor... more Since its introduction in the late eighties, the global Internet has grown from a wide area information repository to a large metacomputer consisting of a great number of high performance computers. To exploit this computing power the heterogeneous nature of the Internet has to be overcome. With the introduction of Java [6] and its accompanying Bytecode [7], true portable programs can be written in a high level language that can be downloaded to and run on any computer that hosts a Bytecode interpreter. This feature, together with the communication that the API provides, makes Java a suitable language to implement distributed software systems. The interpretative nature of the Java programs makes it less suitable for high performance computing. Currently several Just In Time (JIT) compilers are available, which translate the Java Bytecode to native machine code just prior to execution. Other attempts to make Java suitable for high performance computing consists of optimizing Java com...

A Note on Exhaustive State Space Search for Efficient Code Generation

This note explores state space search to find efficient instruction sequences that perform partic... more This note explores state space search to find efficient instruction sequences that perform particular data manipulations. Once found, the instruction sequences are hard-wired in the code generator that needs these data manipulations. Since state space is only searched while developing the compiler, search time is not at a premium, which allows exhaustively searching for the best possible instruction sequences.

Confined recovery in a distributed computing system

The Software Vectorization Handbook: Apply-ing Multimedia Extensions for Maximum Performance

A boat can be automatically locked on a boat trailer by impacting the bow eye of the boat with th... more A boat can be automatically locked on a boat trailer by impacting the bow eye of the boat with the Automatic Boat Trailer Latch. The striker of the latch opens and permits the eye to enter and then closes, locking the boat on the trailer. Opening the striker permits the boat to be floated free of the trailer and launched.

Computing Deep Perft and Divide Numbers for Checkers

ICGA Journal, 2012

The perft method, originating from the chess programming community, has become a widespread way t... more The perft method, originating from the chess programming community, has become a widespread way to test the correctness and performance of move generators. Although its usefulness diminishes with depth, computing deep perft numbers poses an interesting computational challenge by itself. This paper presents perft and corresponding divide numbers for American checkers up to depth 28 together with background on the distributed implementation used to compute deep numbers.

Resolving conflicting graph mutations

Software Vectorization Handbook, The: Applying Intel Multimedia Extensions for Maximum Performance

Label propagation in a distributed system

On Reducing Overhead in Loops

Reordering transformations in the framework of linearloop transformations tend to generate comple... more Reordering transformations in the framework of linearloop transformations tend to generate complexloop nests that scan the transformed iteration spaces.These loop nests have an incurred overhead due tocomplex loop bounds and guards that guarantee thecorrect execution of the operations in the reshapediteration space. In this paper we discuss several techniquesfor reducing overhead in the resulting loopstructures. In particular, we describe two

The Software Vectorization Handbook

Optimum checkpoint frequency

Efficient Exploitation of Parallelism on Pentium ® III and Pentium® 4 Processor-Based Systems

Systems based on the Pentium® III and Pentium® 4 processors enable the exploitation of parallelis... more Systems based on the Pentium® III and Pentium® 4 processors enable the exploitation of parallelism at a fine- and medium-grained level. Dual- and quad-processor systems, for example, enable the exploitation of medium- grained parallelism by using multithreaded code that takes advantage of multiple control and arithmetic logic units. Streaming Single-Instruction-Multiple-Data (SIMD) extensions, on the other hand, enable the exploitation of

On automatic data structure selection and code generation for sparse computations

Lecture Notes in Computer Science, 1994

Traditionally restructuring compilers were only able to apply program transformations in order to... more Traditionally restructuring compilers were only able to apply program transformations in order to exploit certain characteristics of the target architecture. Adaptation of data structures was limited to e.g. linearization or transposing of arrays. However, as more complex data structures are required to exploit characteristics of the data operated on, current compiler support appears to be inappropriate. In this paper we present the implementation issues of a restructuring compiler that automatically converts programs operating on dense matrices into sparse code, i.e. after a suited data structure has been selected for every dense matrix that in fact is sparse, the original code is adapted to operate on these data structures. This simpli es the task of the programmer and, in general, enables the compiler to apply more optimizations.

Compilation techniques for sparse matrix computations

Proceedings of the 7th international conference on Supercomputing - ICS '93, 1993

The problem of compiler optimization of sparse codes is well known and no satisfactory solutions ... more The problem of compiler optimization of sparse codes is well known and no satisfactory solutions have been found yet. One of the major obstacles is formed by the fact that sparse programs deal explicitly with the particular data structures selected for storing sparse matrices. This explicit data structure handling obscures the functionality of a code to such a degree that the optimization of the code is prohibited, e.g. by the introduction of indirect addressing. The method presented in this paper postpones data structure selection until the compile phase, thereby allowing the compiler to combine code optimization with explicit data structure selection. Not only enables this method the compiler to generate e cient code for sparse computations, also the task of the programmer is greatly reduced in complexity.

Automatic Intra-Register Vectorization for the Intel® Architecture

Recent extensions to the Intel ® Architecture feature the SIMD technique to enhance the performan... more Recent extensions to the Intel ® Architecture feature the SIMD technique to enhance the performance of computational intensive applications that perform the same operation on different elements in a data set. To date, much of the code that exploits these extensions has been hand-coded. The task of the programmer is substantially simplified, however, if a compiler does this exploitation automatically. The high-performance Intel ® C++/Fortran compiler supports automatic translation of serial loops into code that uses the SIMD extensions to the Intel ® Architecture. This paper provides a detailed overview of the automatic vectorization methods used by this compiler together with an experimental validation of their effectiveness.

Reshaping access patterns for generating sparse codes

Lecture Notes in Computer Science, 1995

In a new approach to the development of sparse codes, the programmer de nes a particular algorith... more In a new approach to the development of sparse codes, the programmer de nes a particular algorithm on dense matrices which are actually sparse. The sparsity of the matrices as indicated by the programmer is only dealt with at compile-time. The compiler selects appropriate compact data structure and automatically converts the algorithm into code that takes advantage of the sparsity of the matrices. In order to achieve e cient sparse codes, the compiler must be able to reshape some access patterns before a data structure is selected. In this paper, we discuss a reshaping method that is based on unimodular transformations.

Composable and Modular Code Generation in MLIR: A Structured and Retargetable Approach to Tensor Compiler Construction

Despite significant investment in software infrastructure, machine learning systems, runtimes and... more Despite significant investment in software infrastructure, machine learning systems, runtimes and compilers do not compose properly. We propose a new design aiming at providing unprecedented degrees of modularity, composability and genericity. This paper discusses a structured approach to the construction of domain-specific code generators for tensor compilers, with the stated goal of improving the productivity of both compiler engineers and end-users. The approach leverages the natural structure of tensor algebra. It has been the main driver for the design of progressive lowering paths in \MLIR. The proposed abstractions and transformations span data structures and control flow with both functional (SSA form) and imperative (side-effecting) semantics. We discuss the implications of this infrastructure on compiler construction and present preliminary experimental results.

$Fig. 14. Memory bandwidth for bandwidth-bound kernels on problem sizes fitting the L2 cache. Measured peak copy bandwidth is 83.9GB/s (1 read and 1 write per byte). Reductions can go past this because they perform an amortized 2 reads and a fraction of a write per iteration. Despite the extra compute, dropping most writes is still a net win.$

Reshaping Access Patterns for Improving Data Locality

In this paper, we present a method to construct a loop transformation that simultaneously reshape... more In this paper, we present a method to construct a loop transformation that simultaneously reshapes the access patterns of several occurrences of multidimensional arrays along certain desired access directions. First, the method determines a direction through the original iteration space along which these desired access directions are induced. Subsequently, a unimodular transformation is constructed that changes the iteration space traversal accordingly. Finally, data dependences are accounted for. In particular, this reshaping method can be used to improve data locality in a program.

Implementation of fourier - motzkin elimina - tion

Packet communication is used in the architecture of a memory system having hierarchical structure... more

A Strategy for Exploiting Implicit Loop Parallelism in Java Programs

In this paper, we explore a strategy that can be used by a source to source restructuring compile... more In this paper, we explore a strategy that can be used by a source to source restructuring compiler to exploit implicit loop parallelism in Java programs. First, the compiler must identify the parallel loops in a program. Thereafter, the compiler explicitly expresses this parallelism in the transformed program using the multithreading mechanism of Java. Finally, after a single compilation of the transformed program into Java byte-code, speedup can be obtained on any platform on which the Java byte-code interpreter supports actual concurrent execution of threads, whereas threads only induce a slight overhead for serial execution. In addition, this approach can enable a compiler to explicitly express the scheduling policy of each parallel loop in the program.

Exploiting implicit loop parallelism using multiple multithreaded servers in Java Fabian Breg

Since its introduction in the late eighties, the global Internet has grown from a wide area infor... more Since its introduction in the late eighties, the global Internet has grown from a wide area information repository to a large metacomputer consisting of a great number of high performance computers. To exploit this computing power the heterogeneous nature of the Internet has to be overcome. With the introduction of Java [6] and its accompanying Bytecode [7], true portable programs can be written in a high level language that can be downloaded to and run on any computer that hosts a Bytecode interpreter. This feature, together with the communication that the API provides, makes Java a suitable language to implement distributed software systems. The interpretative nature of the Java programs makes it less suitable for high performance computing. Currently several Just In Time (JIT) compilers are available, which translate the Java Bytecode to native machine code just prior to execution. Other attempts to make Java suitable for high performance computing consists of optimizing Java com...

A Note on Exhaustive State Space Search for Efficient Code Generation

This note explores state space search to find efficient instruction sequences that perform partic... more This note explores state space search to find efficient instruction sequences that perform particular data manipulations. Once found, the instruction sequences are hard-wired in the code generator that needs these data manipulations. Since state space is only searched while developing the compiler, search time is not at a premium, which allows exhaustively searching for the best possible instruction sequences.

Confined recovery in a distributed computing system

The Software Vectorization Handbook: Apply-ing Multimedia Extensions for Maximum Performance

A boat can be automatically locked on a boat trailer by impacting the bow eye of the boat with th... more A boat can be automatically locked on a boat trailer by impacting the bow eye of the boat with the Automatic Boat Trailer Latch. The striker of the latch opens and permits the eye to enter and then closes, locking the boat on the trailer. Opening the striker permits the boat to be floated free of the trailer and launched.

Computing Deep Perft and Divide Numbers for Checkers

ICGA Journal, 2012

The perft method, originating from the chess programming community, has become a widespread way t... more The perft method, originating from the chess programming community, has become a widespread way to test the correctness and performance of move generators. Although its usefulness diminishes with depth, computing deep perft numbers poses an interesting computational challenge by itself. This paper presents perft and corresponding divide numbers for American checkers up to depth 28 together with background on the distributed implementation used to compute deep numbers.

Resolving conflicting graph mutations

Software Vectorization Handbook, The: Applying Intel Multimedia Extensions for Maximum Performance

Label propagation in a distributed system

On Reducing Overhead in Loops

Reordering transformations in the framework of linearloop transformations tend to generate comple... more Reordering transformations in the framework of linearloop transformations tend to generate complexloop nests that scan the transformed iteration spaces.These loop nests have an incurred overhead due tocomplex loop bounds and guards that guarantee thecorrect execution of the operations in the reshapediteration space. In this paper we discuss several techniquesfor reducing overhead in the resulting loopstructures. In particular, we describe two

The Software Vectorization Handbook

Optimum checkpoint frequency

Efficient Exploitation of Parallelism on Pentium ® III and Pentium® 4 Processor-Based Systems

Systems based on the Pentium® III and Pentium® 4 processors enable the exploitation of parallelis... more Systems based on the Pentium® III and Pentium® 4 processors enable the exploitation of parallelism at a fine- and medium-grained level. Dual- and quad-processor systems, for example, enable the exploitation of medium- grained parallelism by using multithreaded code that takes advantage of multiple control and arithmetic logic units. Streaming Single-Instruction-Multiple-Data (SIMD) extensions, on the other hand, enable the exploitation of

On automatic data structure selection and code generation for sparse computations

Lecture Notes in Computer Science, 1994

Traditionally restructuring compilers were only able to apply program transformations in order to... more Traditionally restructuring compilers were only able to apply program transformations in order to exploit certain characteristics of the target architecture. Adaptation of data structures was limited to e.g. linearization or transposing of arrays. However, as more complex data structures are required to exploit characteristics of the data operated on, current compiler support appears to be inappropriate. In this paper we present the implementation issues of a restructuring compiler that automatically converts programs operating on dense matrices into sparse code, i.e. after a suited data structure has been selected for every dense matrix that in fact is sparse, the original code is adapted to operate on these data structures. This simpli es the task of the programmer and, in general, enables the compiler to apply more optimizations.

Compilation techniques for sparse matrix computations

Proceedings of the 7th international conference on Supercomputing - ICS '93, 1993

The problem of compiler optimization of sparse codes is well known and no satisfactory solutions ... more The problem of compiler optimization of sparse codes is well known and no satisfactory solutions have been found yet. One of the major obstacles is formed by the fact that sparse programs deal explicitly with the particular data structures selected for storing sparse matrices. This explicit data structure handling obscures the functionality of a code to such a degree that the optimization of the code is prohibited, e.g. by the introduction of indirect addressing. The method presented in this paper postpones data structure selection until the compile phase, thereby allowing the compiler to combine code optimization with explicit data structure selection. Not only enables this method the compiler to generate e cient code for sparse computations, also the task of the programmer is greatly reduced in complexity.

Automatic Intra-Register Vectorization for the Intel® Architecture

Recent extensions to the Intel ® Architecture feature the SIMD technique to enhance the performan... more Recent extensions to the Intel ® Architecture feature the SIMD technique to enhance the performance of computational intensive applications that perform the same operation on different elements in a data set. To date, much of the code that exploits these extensions has been hand-coded. The task of the programmer is substantially simplified, however, if a compiler does this exploitation automatically. The high-performance Intel ® C++/Fortran compiler supports automatic translation of serial loops into code that uses the SIMD extensions to the Intel ® Architecture. This paper provides a detailed overview of the automatic vectorization methods used by this compiler together with an experimental validation of their effectiveness.

Reshaping access patterns for generating sparse codes

Lecture Notes in Computer Science, 1995

In a new approach to the development of sparse codes, the programmer de nes a particular algorith... more In a new approach to the development of sparse codes, the programmer de nes a particular algorithm on dense matrices which are actually sparse. The sparsity of the matrices as indicated by the programmer is only dealt with at compile-time. The compiler selects appropriate compact data structure and automatically converts the algorithm into code that takes advantage of the sparsity of the matrices. In order to achieve e cient sparse codes, the compiler must be able to reshape some access patterns before a data structure is selected. In this paper, we discuss a reshaping method that is based on unimodular transformations.