Papers by Jos Van Eijndhoven
New generation System-on-Chips will be extremely complex devices, composed from complex subsystem... more New generation System-on-Chips will be extremely complex devices, composed from complex subsystems, relying on abstraction from implementation details. These chips will support the execution of a mix of concurrent applications that are not known in detail at chip design time. These SoCs require a significant degree of programmability to configure both the set of functions that must execute as well as the structure of the dataflow between these functions. To ease the programming effort multiprocessor computers have employed cache coherent share memory for decades, abstracting the average programmer from system complexity issues such as multiple processors and memory hierarchies.

<title>Dynamic reconfiguration of streaming graphs on a heterogeneous multiprocessor architecture</title>
Embedded Processors for Multimedia and Communications II, 2005
Consumer electronics products are multi-functional devices that combine a set of media applicatio... more Consumer electronics products are multi-functional devices that combine a set of media applications. Media data in such products is largely processed in heterogeneous multiprocessor subsystems that are integrated into a system on chip (SoC). A product engineer configures each subsystem for a collection of predefined applications when deploying the SoC in a product. Oftentimes, the system supports a large number of desired application configurations, or 'use cases". The system moves from one configuration to the next by adapting the configuration of a running application, referred to as 'dynamic reconfiguration". This paper presents a practical approach to dynamic application reconfiguration in a heterogeneous multiprocessor subsystem. The targeted media applications are constructed as a graph of concurrently executing interconnected tasks that exchange information through streams of data. Configuring such a streaming graph entails the instantiation and interconnection of tasks, setting of task parameters, assignment of tasks to coprocessors, and the allocation of communication buffers in memory. The paper derives a reconfiguration interface that can be supported in hardware, yet isolates application configuration knowledge from the coprocessor hardware. Though simple and easy to use, the interface addresses the key challenge of reconfiguring individual tasks while maintaining real-time behavior and data integrity of the overall set of concurrently executing applications.
In the world of complex SoCs for consumer applica- tions, multiprocessor architectures usually de... more In the world of complex SoCs for consumer applica- tions, multiprocessor architectures usually deploy caching techniques to alleviate the cost of data com- munication between processing elements. In this appli- cation domain, the characteristics of streaming appli- cations play a dominant role in the design of the multi- processor architectures. These characteristics not only influence the design at SoC level,
The ASCIS data flow graph : semantics and textual format
Journal of The American Chemical Society, 1991
International Conference on Hardware Software Codesign, 2002
Eclipse defines a heterogeneous multiprocessor architecture template for data-dependent stream pr... more Eclipse defines a heterogeneous multiprocessor architecture template for data-dependent stream processing. Intended as a scalable and flexible subsystem of forthcoming media-processing systems-on-a-chip, Eclipse combines application configuration flexibility with the efficiency of function-specific hardware, or coprocessors. To facilitate reuse, Eclipse separates coprocessor functionality from generic support that addresses multi-tasking, inter-task synchronization, and data transport. Five interface primitives accomplish this separation. The

Proceedings Design, Automation and Test in Europe. Conference and Exhibition 2001, 2001
A compiler-simulator framework must be retargetable to enable platform-based processor design as ... more A compiler-simulator framework must be retargetable to enable platform-based processor design as well as proper processor architecture design space exploration. This paper describes the design decisions taken for the retargetability mechanism of the Philips Research compiler-simulator framework driven by a central machine description file. The format of the machine description file plays an important role in defining the scope of retargetability of a compilersimulator framework. The machine description format PRMDL used in Philips Research supports a wide variety of VLIW architectures. In particular, PRMDL is capable of expressing clustered architecture features such as incomplete bypass networks, multiple register files, along with functional units shared or distributed among multiple issue slots, diverse conditional operation mappings, and more. The structure of PRMDL features separate software and hardware views on a processor. This insures robustness of retargetability built into tools across several processor generations.
On design rule correct maze routing
Proceedings of European Design and Test Conference EDAC-ETC-EUROASIC, 1994
Page 1. On Design Rule Correct Maze Routing Ed. P. Huijbregtsl, Jos TJ van Eijndhoven and Jochen ... more Page 1. On Design Rule Correct Maze Routing Ed. P. Huijbregtsl, Jos TJ van Eijndhoven and Jochen AG Jess Eindhoven University of Technology, Department of Electrical Engineering Design Automation Section. The Netherlands ...
The yorktown silicon compiler
Proceedings 1999 IEEE International Conference on Computer Design: VLSI in Computers and Processors (Cat. No.99CB37040), 1999
The architecture of the TriMedia CPU64 is based on the TM1000 DSPCPU. The original VLIW architect... more The architecture of the TriMedia CPU64 is based on the TM1000 DSPCPU. The original VLIW architecture has been extended with the concepts of vector processing and superoperations. The new vector operations and superoperations need to be supported by the compiler and simulator to make them accessible to application programmers. It was our intention to support these new features while remaining compliant with the ANSI C standard. This paper describes the mechanisms which were implemented to achieve this goal. Furthermore, the optimization of applications needs to address the vectorization of the functions to be implemented. Some general guidelines for producing efficient vectorized code are given.

Proceedings 1999 IEEE International Conference on Computer Design: VLSI in Computers and Processors (Cat. No.99CB37040), 1999
We present a new VLIW core as a successor to the TriMedia TM1000. The processor is targeted for e... more We present a new VLIW core as a successor to the TriMedia TM1000. The processor is targeted for embedded use in media-processing devices like DTVs and set-top boxes. Intended as a core, its design must be supplemented with on-chip co-processors to obtain a cost-effective system. Good performance is obtained through a uniform 64-bit 5 issue-slot VLIW design, supporting subword parallelism with an extensive instruction set optimized with respect to media-processing. Multi-slot 'super-ops' allow powerful multi-argument and multi-result operations. As an example, an IDCT algorithm shows a very low instruction count in comparison with other processors. To achieve good performance, critical sections in the application program source code need to be rewritten with vector data types and function calls for media operations. Benchmarking with several media applications was used to tune the instruction set and study cache behavior. This resulted in a VLIW architecture with wide data paths and relatively simple cpu control.
Proc. IEEE Symp. Field- …, 2002
The paper presents a Design Space Exploration (DSE) experiment which has been carried out in orde... more The paper presents a Design Space Exploration (DSE) experiment which has been carried out in order to determine the optimum FPGA–based Variable-Length Decoder (VLD) computing resource and its associated instructions, with respect to an entropy decoding ...

Proceedings IEEE International Conference on Application-Specific Systems, Architectures, and Processors. ASAP 2003, 2003
[email protected] http://ce.et.tudelft.nl/˜mihai Abstract A case study on Color Space Conversi... more [email protected] http://ce.et.tudelft.nl/˜mihai Abstract A case study on Color Space Conversion (CSC) for MPEG decoding, carried out on FPGAaugmented TriMedia processor is presented. That is, a transform from ¼ Ö color space to Ê ¼ ¼ ¼ color space is addressed. First, we outline the extension of TriMedia architecture consisting of FPGA-based Reconfigurable Functional Units (RFU) and associated instructions. Then we analyse a CSC (RFU-specific) instruction which can process four pixels per call, and propose a scheme to implement the CSC operation on RFU(s). When mapped on an ACEX EP1K100 FPGA, the proposed CSC exhibits a latency of 10 and a recovery of 2 TriMedia@200 MHz cycles, and occupies 57% of the device. By configuring the CSC facility on the RFU(s) at application load-time, color space conversion can be computed on FPGA-augmented TriMedia with 40% speed-up over the standard TriMedia. ½º ÁÒØÖÓ Ù Ø ÓÒ

Proceedings 2001 IEEE International Conference on Computer Design: VLSI in Computers and Processors. ICCD 2001, 2001
This paper describes an experiment which aims to reveal the potential impact on performance yield... more This paper describes an experiment which aims to reveal the potential impact on performance yielded by augmenting a TriMedia-CPU64 processor with a multiple-context FPGA core. We first propose an extension of the TriMedia-CPU64 architecture, which consists of a Reconfigurable Functional Unit and its associated instructions. Then, we address the decoding of variable-length codes on such extended TriMedia and describe the architecture and FPGAimplementation of a Variable-Length Decoder (VLD) computing facility. When mapped on an ACEX EP1K100 FPGA, the proposed VLD exhibits a latency of cycles. Preliminary results indicate that by configuring each of the VLD and 1-D IDCT (which is described elsewhere) facilities on a different FPGA context, and by activating the contexts as needed, the augmented TriMedia can perform macroblock parsing followed up by pel reconstruction with an improvement of ¾¼ ¾ ± over the standard TriMedia.
Proceedings of the European Design Automation Conference, 1990., EDAC., 1990
Multirate integration is a technique in which a set of differential equations is solved with diff... more Multirate integration is a technique in which a set of differential equations is solved with different timesteps assigned to subsets of equations [4][10]. In circuit simulation this is commonly used in the waveform relaxation method, where different subcircuits are analyzed independently from the others. An important and obvious advantage is the simulation efficiency: subcircuits which are temporarily changing relatively slowly,
PLATO: a new piecewise linear simulation tool
Proceedings of the European Design Automation Conference, 1990., EDAC., 1990
... S b BE TR BDF2 ACT2 h Vsh ... The corresponding LU decomposition can be updated by a very eff... more ... S b BE TR BDF2 ACT2 h Vsh ... The corresponding LU decomposition can be updated by a very efficient algorithm devised by Bennett [7]. A sparse matrix implementation of this algorithm was already presented in [8]. Note ... The transient analysis proceeds in an event driven manner ...
1988., IEEE International Symposium on Circuits and Systems, 1988
The most important operations for a circuit simulator are component model linearization, updating... more The most important operations for a circuit simulator are component model linearization, updating the network matrix, performing large unsymmetric decomposition on this matrix, and solving the network variables by forward and backward substitution. Methods are presented to keep all these operations localized to the part of the network that is active at the current time point, thus obtaining a considerable

The paper presents a case study on augmenting a TriMedia/CPU64 processor with a Reconfigurable (F... more The paper presents a case study on augmenting a TriMedia/CPU64 processor with a Reconfigurable (FPGA-based) Functional Unit (RFU). We first propose an extension of the TriMedia/CPU64 architecture, which consists of a RFU and its associated instructions. Then, we address the computation of the ¢ IDCT on such extended TriMedia, and propose a scheme to implement an 8-point IDCT operation on the RFU. Further, we address the decoding of Variable Length Codes and describe the FPGA implementation of a Variable Length Decoder (VLD) computing facility. When mapped on an ACEX EP1K100 FPGA from Altera, our 8-point IDCT exhibits a latency of 16 and a recovery of 2 Tri-Media cycles, and occupies 42% of the FPGA's logic array blocks. The proposed VLD exhibits a latency of 7 TriMedia cycles when mapped on the same FPGA, and utilizes 6 of its embedded array blocks. By using the 8-point IDCT computing facility, an ¢ IDCT including all overheads can be computed with the throughput of 1/32 IDCT/cycle. Also, with the proposed VLD computing facility, a single DCT coefficient can be decoded in 11 cycles including all overheads. Simulation results indicate that by configuring each of the 8-point IDCT and VLD computing facilities on a different FPGA context, and by activating the contexts as needed, the augmented TriMedia can perform MPEG macroblock parsing followed up by a pel reconstruction with an improvement of 20-25% over the standard TriMedia.
Uploads
Papers by Jos Van Eijndhoven