Papers by Stephen A. Edwards

In this paper we solve the problem of identifying a "matching" between two logic circui... more In this paper we solve the problem of identifying a "matching" between two logic circuits or "networks". A matching is a functions that maps each gate or "node" in the new circuit into one in the old circuit (if a matching does not exist it maps it to null). We present both an exact and a heuristic way to solve the maximal matching problem. The matching problem does not require any input correspondences; the purpose is to identify structurally identical regions in the networks. We apply this solution to the problem of incremental design. Logic de- sign is usually an iterative process where errors are corrected and optimiza- tions performed repeatedly. A designer rectifies, re-optimizes, and rechecks a design many times. In practice, it is common for small, incremental changes to be made to the design, rather than changing the entirety of the design. Cur- rently, each time the system is modified, the entire set of computations (syn- thesis,verification) ...
International Workshop on Synchronous Languages, Applications, and Programming (SLAP���06), Vienna, Austria, Mar 1, 2006
Embedded systems often suffer from severe resource constraints such as limited memory for program... more Embedded systems often suffer from severe resource constraints such as limited memory for programs and data. In this work, we address the problem of compiling the Esterel synchronous language for processors with such constraints. ... We introduce a virtual machine that executes a compact bytecode designed specifically for executing Esterel and present a compiler for it. Our technique generates code that is roughly half the size of optimized C code compiled using existing techniques. ... We demonstrate the utility of our approach on the Lego RCX ...

Partial evaluation has been applied to compiler optimization and generation for decades. Most of ... more Partial evaluation has been applied to compiler optimization and generation for decades. Most of the successful partial evaluators have been designed for general-purpose languages. Our observation is that domain-specific languages are also suitable targets for partial evaluation. The unusual computational models in many DSLs bring challenges as well as optimization opportunities to the compiler. To enable aggressive optimization, partial evaluation has to be specialized to fit the specific paradigm of a DSL. In this dissertation, we present three such specialized partial evaluation techniques designed for specific languages that address a variety of compilation concerns. The first algorithm provides a low-cost solution for simulating concurrency on a single-threaded processor. The second enables a compiler to compile modest-sized synchronous programs in pieces that involve communication cycles. The third statically elaborates recursive function calls that enable programmers to dynamically create a system's concurrent components in a convenient and algorithmic way. Our goal is to demonstrate the potential of partial evaluation to solve challenging issues in code generation for domain-specific languages. Naturally, we do not cover all DSL compilation issues. We hope our work will enlighten and encourage future research on the application of partial evaluation to this area.

Modern society is irreversibly dependent on computers and, consequently, on software. However, as... more Modern society is irreversibly dependent on computers and, consequently, on software. However, as the complexity of programs increase, so does the number of defects within them. To alleviate the problem, automated techniques are constantly used to improve software quality. Static analysis is one such approach in which violations of correctness properties are searched and reported. Static analysis has many advantages, but it is necessarily conservative because it symbolically executes the program instead of using real inputs, and it considers all possible executions simultaneously. Being conservative often means issuing false alarms, or missing real program errors. Pointer variables are a challenging aspect of many languages that can force static analyis tools to be overly conservative. It is often unclear what variables are affected by pointer-manipulating expressions, and aliasing between variables is one of the banes of program analysis. To alleviate that, a common solution is to allow the programmer to provide annotations such as declaring a variable as unaliased in a given scope, or providing special constructs such as the "never-null" pointer of Cyclone. However, programmers rarely keep these annotations up-to-date. The solution is to provide some form of pointer analysis, which derives useful information about pointer variables in the program. An appropriate pointer analysis equips the static tool so that it is capable of reporting more errors without risking too many false alarms. This dissertation proposes a methodology for pointer analysis that is specially tailored for "modular bug finding." It presents a new analysis space for pointer analysis, defined by finer-grain "dimensions of precision," which allows us to explore and evaluate a variety of different algorithms to achieve better trade-offs between analysis precision and efficiency. This framework is developed around a new abstraction for computing points-to sets, the Assign-Fetch Graph, that has many interesting features. Empirical evaluation shows promising results, as some unknown errors in wellknown applications were discovered.
To simplify the implementation of dataflow systems in hardware, we present a technique for design... more To simplify the implementation of dataflow systems in hardware, we present a technique for designing latencyinsensitive dataflow blocks. We provide buffering with backpressure, resulting in blocks that compose into deep, high-speed pipelines without introducing long combinational paths. Our input and output buffers are easy to assemble into simple unitrate dataflow blocks, arbiters, and blocks for Kahn networks. We prove the correctness of our buffers, illustrate how they can be used to assemble arbitrary dataflow blocks, discuss pitfalls, and present experimental results that suggest our pipelines can operate at a high clock rate independent of length.

Parallel architectures are the way of the future, but are notoriously difficult to program. In ad... more Parallel architectures are the way of the future, but are notoriously difficult to program. In addition to the low-level constructs they often present (e.g., locks, DMA, and non-sequential memory models), most parallel programming environments admit data races: the environment may make nondeterministic scheduling choices that can change the function of the program. We believe the solution is model-based design, where the programmer is presented with a constrained higher-level language that prevents certain unwanted behavior. In this paper, we describe a compiler for the SHIM scheduling-independent concurrent language that generates code for the Cell Broadband heterogeneous multicore processor. The complexity of the code our compiler generates relative to the source illustrates how difficult it is to manually write code for the Cell. We demonstrate the efficacy of our compiler on two examples. While the SHIM language is (by design) not ideal for every algorithm, it works well for certain applications and simplifies the parallel programming process, especially on the Cell architecture.

We explore on-chip network topologies for the Q100, an analytic query accelerator for relational ... more We explore on-chip network topologies for the Q100, an analytic query accelerator for relational databases. In such data-centric accelerators, interconnects play a critical role by moving large volumes of data. In this paper we show that various interconnect topologies can trade a factor of 2.5× in performance for 3.3× area. Moreover, standard topologies (e.g., ring or mesh) are not optimal. Significant prior work on network topology specialization augments generic topologies with additional dedicated links. In this paper, we present a network specialization algorithm that builds a specialized network first then introduces a generic network as a fallback. We find our algorithm produces networks that are 1.24× slower than the highest-performance generic topology considered (a fat tree), and 18% smaller than the least expensive (a double ring). Moreover, our method produces topologies that outperform those produced by others by 1.21× while being 25% smaller.
Proceedings of the IEEE, Mar 1, 1997
This paper addresses the design of reactive real-time embedded systems. Such systems are often he... more This paper addresses the design of reactive real-time embedded systems. Such systems are often heterogeneous in implementation technologies and design styles, for example by combining hardware ASICs with embedded software. The concurrent design process for such embedded systems involves solving the specification, validation, and synthesis problems. We review the variety of approaches to these problems that have been taken.

Hardware accelerators are one promising solution to contend with the end of Dennard scaling and t... more Hardware accelerators are one promising solution to contend with the end of Dennard scaling and the slowdown of Moore's law. For mature workloads that are regular and have high compute per byte, hardening an application into one or more hardware modules is a standard approach. However, for some applications, we find that a programmable homogeneous architecture is preferable. This paper compares a previously proposed heterogeneous hardware accelerator for analytical query processing to a homogeneous systolic array alternative. We find that the heterogeneous and homogeneous accelerators are equivalent for large designs, while for small designs the homogeneous is better. Our analysis explains this counter-intuitive result, finding that the homogeneous architecture has higher average resource utilization and lower relative costs for the communication infrastructure. CCS CONCEPTS • Computer systems organization → Architectures; Systolic arrays; • Information systems → Database query processing.
To provide a superior way of coding, compiling, and optimizing parallel algorithms, we are develo... more To provide a superior way of coding, compiling, and optimizing parallel algorithms, we are developing techniques for synthesizing hardware from functional specifications. Recursion, fundamental to functional languages, does not translate naturally to hardware, but tail recursion is iteration and easily implemented as a finite-state machine. In this paper, we show how to translate general recursion into tail recursion with an explicit stack that can be implemented in hardware. We give examples, describe the algorithm, and argue for its correctness. We also present experimental result that demonstrate the method is effective.

Designers and high-level synthesis tools can introduce unwanted cycles in digital circuits, and f... more Designers and high-level synthesis tools can introduce unwanted cycles in digital circuits, and for certain combinational functions, cyclic circuits that are stable and do not hold state are the smallest or most natural representations. Cyclic combinational circuits have well-defined functional behavior yet wreak havoc with most logic synthesis and timing tools, which require combinational logic to be acyclic. As such, some sort of cycle-removal step is necessary to handle these circuits with existing tools. We present a two-stage algorithm for transforming a combinational cyclic circuit into an equivalent acyclic circuit. The first part quickly and exactly characterizes all combinational behavior of a cyclic circuit. It starts by applying input patterns to each input and examining the boundary between gates whose outputs are and are not defined to find additional input patterns that make the circuit behave combinationally. It produces sets of assignments to inputs that together cover all combinational behavior. This can be used to report errors, as an optimization aid, or to restructure the circuit into an acyclic equivalent. The second stage of our algorithm does this restructuring by creating an acyclic circuit fragment from each of these assignments and assembles these fragments into an acyclic circuit that reproduces all the combinational behavior of the original cyclic circuit. Experiments show that our algorithm runs in seconds on real-life cyclic circuits, making it useful in practice.

Dataflow analysis is a well-understood and very powerful technique for analyzing programs as part... more Dataflow analysis is a well-understood and very powerful technique for analyzing programs as part of the compilation process. Virtually all compilers use some sort of dataflow analysis as part of their optimization phase. However, despite being well-understood theoretically, such analyses are often difficult to code, making it difficult to quickly experiment with variants. To address this, we developed a domain-specific language, Analyzer Generator (AG), that synthesizes dataflow analysis phases for Microsoft's Phoenix compiler framework. AG hides the fussy details needed to make analyses modular, yet generates code that is as efficient as the hand-coded equivalent. One key construct we introduce allows IR object classes to be extended without recompiling. Experimental results on three analyses show that AG code can be one-tenth the size of the equivalent handwritten C++ code with no loss of performance. It is our hope that AG will make developing new dataflow analyses much easier.

Proceedings, 2007
We argue that at least for embedded software applications, computer architecture, software, and n... more We argue that at least for embedded software applications, computer architecture, software, and networking have gone too far down the path of emphasizing average case performance over timing predictability. In architecture, techniques such as multi-level caches and deep pipelines with dynamic dispatch and speculative execution make worst-case execution times (WCET) highly dependent on both implementation details of the processor and on the context in which the software is executed. Yet virtually all real-time programming methodologies depend on WCET. When timing properties are important in the software and when concurrent execution is affected by timing, the result is brittle designs. In this paper, we argue for precision timed (PRET) machines, which deliver high performance, but not at the expense of timing predictability. We summarize a number of research approaches that can be used to create PRET machines, and discuss how the software, operating system, and networking abstractions built above the machine architecture will have to change.
2022 20th ACM-IEEE International Conference on Formal Methods and Models for System Design (MEMOCODE)
Proceedings of Cyber-Physical Systems and Internet of Things Week 2023
Timing is an essential feature of reactive software. It is not just a performance metric, but rat... more Timing is an essential feature of reactive software. It is not just a performance metric, but rather forms a core part of the semantics of programs. This paper argues for a notion of logical time that serves as an engineering model to complement a notion of physical time, which models the physical passage of time. Programming models that embrace logical time can provide deterministic concurrency, better analyzability, and practical realizations of timing-sensitive applications. We give definitions for physical and logical time and review some languages and formalisms that embrace logical time.
Springer eBooks, 1996
We address the problem of selecting the minimum sized nite state machine consistent with given in... more We address the problem of selecting the minimum sized nite state machine consistent with given input/output samples. The problem can be solved by computing the minimum nite state machine equivalent to a nite state machine without loops obtained from the training set. We compare the performance of four algorithms for this task: two algorithms for incompletely speci ed nite state machine reduction, an algorithm based on a well known explicit search procedure and an algorithm based on a new implicit search procedure that is introduced in this paper.

Dagstuhl Reports, 2013
Synchronous programming languages are programming languages with an abstract (logical) notion of ... more Synchronous programming languages are programming languages with an abstract (logical) notion of time: The execution of such programs is divided into discrete reaction steps, and in each of these reactions steps, the program reads new inputs and reacts by computing corresponding outputs of the considered reaction step. The programs are called synchronous because all outputs are computed together in zero time within a step and because parallel components synchronize their reaction steps by the semantics of the languages. For this reason, the synchronous composition is deterministic, which is a great advantage concerning predictability, verification of system design, and embedded code generation. Starting with the definition of the classic synchronous languages Esterel, Lustre and Signal in the late 1980's, the research during the past 20 years was very fruitful and lead to new languages, compilation techniques, software and hardware architectures, as well as extensions, transform...
Uploads
Papers by Stephen A. Edwards