Skip to main content

Jos Van Eijndhoven

Followers

10

Following

8

Co-authors

8

Public Views

Stanford University

Jan Hoogerbrugge

Andrei Terechko

Shilongo Alfeus

Ashish Shrivastava

Anastasiia BUTKO

Lawrence Berkeley National Laboratory

J. van Eijndhoven

Interests

Uploads

Papers by Jos Van Eijndhoven

Cache-Coherent Heterogeneous Multiprocessing as Basis for Streaming Applications

by P. Stravers and Jos Van Eijndhoven

New generation System-on-Chips will be extremely complex devices, composed from complex subsystem... more New generation System-on-Chips will be extremely complex devices, composed from complex subsystems, relying on abstraction from implementation details. These chips will support the execution of a mix of concurrent applications that are not known in detail at chip design time. These SoCs require a significant degree of programmability to configure both the set of functions that must execute as well as the structure of the dataflow between these functions. To ease the programming effort multiprocessor computers have employed cache coherent share memory for decades, abstracting the average programmer from system complexity issues such as multiple processors and memory hierarchies.

Unknown

by Sorin Cotofana and Jos Van Eijndhoven

<title>Dynamic reconfiguration of streaming graphs on a heterogeneous multiprocessor architecture</title>

Embedded Processors for Multimedia and Communications II, 2005

Consumer electronics products are multi-functional devices that combine a set of media applicatio... more Consumer electronics products are multi-functional devices that combine a set of media applications. Media data in such products is largely processed in heterogeneous multiprocessor subsystems that are integrated into a system on chip (SoC). A product engineer configures each subsystem for a collection of predefined applications when deploying the SoC in a product. Oftentimes, the system supports a large number of desired application configurations, or 'use cases". The system moves from one configuration to the next by adapting the configuration of a running application, referred to as 'dynamic reconfiguration". This paper presents a practical approach to dynamic application reconfiguration in a heterogeneous multiprocessor subsystem. The targeted media applications are constructed as a graph of concurrently executing interconnected tasks that exchange information through streams of data. Configuring such a streaming graph entails the instantiation and interconnection of tasks, setting of task parameters, assignment of tasks to coprocessors, and the allocation of communication buffers in memory. The paper derives a reconfiguration interface that can be supported in hardware, yet isolates application configuration knowledge from the coprocessor hardware. Though simple and easy to use, the interface addresses the key challenge of reconfiguring individual tasks while maintaining real-time behavior and data integrity of the overall set of concurrently executing applications.

TriMedia CPU64

by Jos Van Eijndhoven and G. Hekstra

Caching Techniques for Multi-Processor Streaming Architectures

In the world of complex SoCs for consumer applica- tions, multiprocessor architectures usually de... more In the world of complex SoCs for consumer applica- tions, multiprocessor architectures usually deploy caching techniques to alleviate the cost of data com- munication between processing elements. In this appli- cation domain, the characteristics of streaming appli- cations play a dominant role in the design of the multi- processor architectures. These characteristics not only influence the design at SoC level,

The ASCIS data flow graph : semantics and textual format

Journal of The American Chemical Society, 1991

Design of multi-tasking coprocessor control for Eclipse

International Conference on Hardware Software Codesign, 2002

Eclipse defines a heterogeneous multiprocessor architecture template for data-dependent stream pr... more Eclipse defines a heterogeneous multiprocessor architecture template for data-dependent stream processing. Intended as a scalable and flexible subsystem of forthcoming media-processing systems-on-a-chip, Eclipse combines application configuration flexibility with the efficiency of function-specific hardware, or coprocessors. To facilitate reuse, Eclipse separates coprocessor functionality from generic support that addresses multi-tasking, inter-task synchronization, and data transport. Five interface primitives accomplish this separation. The

$Research paper thumbnail of \0$

\0

PRMDL: a machine description language for clustered VLIW architectures

by Jos Van Eijndhoven and Andrei Terechko

Proceedings Design, Automation and Test in Europe. Conference and Exhibition 2001, 2001

A compiler-simulator framework must be retargetable to enable platform-based processor design as ... more A compiler-simulator framework must be retargetable to enable platform-based processor design as well as proper processor architecture design space exploration. This paper describes the design decisions taken for the retargetability mechanism of the Philips Research compiler-simulator framework driven by a central machine description file. The format of the machine description file plays an important role in defining the scope of retargetability of a compilersimulator framework. The machine description format PRMDL used in Philips Research supports a wide variety of VLIW architectures. In particular, PRMDL is capable of expressing clustered architecture features such as incomplete bypass networks, multiple register files, along with functional units shared or distributed among multiple issue slots, diverse conditional operation mappings, and more. The structure of PRMDL features separate software and hardware views on a processor. This insures robustness of retargetability built into tools across several processor generations.

On design rule correct maze routing

Proceedings of European Design and Test Conference EDAC-ETC-EUROASIC, 1994

Page 1. On Design Rule Correct Maze Routing Ed. P. Huijbregtsl, Jos TJ van Eijndhoven and Jochen ... more

The yorktown silicon compiler

TriMedia CPU64 application development environment

Proceedings 1999 IEEE International Conference on Computer Design: VLSI in Computers and Processors (Cat. No.99CB37040), 1999

The architecture of the TriMedia CPU64 is based on the TM1000 DSPCPU. The original VLIW architect... more The architecture of the TriMedia CPU64 is based on the TM1000 DSPCPU. The original VLIW architecture has been extended with the concepts of vector processing and superoperations. The new vector operations and superoperations need to be supported by the compiler and simulator to make them accessible to application programmers. It was our intention to support these new features while remaining compliant with the ANSI C standard. This paper describes the mechanisms which were implemented to achieve this goal. Furthermore, the optimization of applications needs to address the vectorization of the functions to be implemented. Some general guidelines for producing efficient vectorized code are given.

TriMedia CPU64 architecture

Proceedings 1999 IEEE International Conference on Computer Design: VLSI in Computers and Processors (Cat. No.99CB37040), 1999

We present a new VLIW core as a successor to the TriMedia TM1000. The processor is targeted for e... more We present a new VLIW core as a successor to the TriMedia TM1000. The processor is targeted for embedded use in media-processing devices like DTVs and set-top boxes. Intended as a core, its design must be supplemented with on-chip co-processors to obtain a cost-effective system. Good performance is obtained through a uniform 64-bit 5 issue-slot VLIW design, supporting subword parallelism with an extensive instruction set optimized with respect to media-processing. Multi-slot 'super-ops' allow powerful multi-argument and multi-result operations. As an example, an IDCT algorithm shows a very low instruction count in comparison with other processors. To achieve good performance, critical sections in the application program source code need to be rewritten with vector data types and function calls for media operations. Benchmarking with several media applications was used to tune the instruction set and study cache behavior. This resulted in a VLIW architecture with wide data paths and relatively simple cpu control.

MPEG-compliant Entropy Decoding on FPGA-augmented TriMedia/CPU64

by Sorin Cotofana and Jos Van Eijndhoven

Proc. IEEE Symp. Field- …, 2002

The paper presents a Design Space Exploration (DSE) experiment which has been carried out in orde... more

Color space conversion for MPEG decoding on FPGA-augmented TriMedia processor

Proceedings IEEE International Conference on Application-Specific Systems, Architectures, and Processors. ASAP 2003, 2003

[email protected] http://ce.et.tudelft.nl/˜mihai Abstract A case study on Color Space Conversi... more [email protected] http://ce.et.tudelft.nl/˜mihai Abstract A case study on Color Space Conversion (CSC) for MPEG decoding, carried out on FPGAaugmented TriMedia processor is presented. That is, a transform from ¼ Ö color space to Ê ¼ ¼ ¼ color space is addressed. First, we outline the extension of TriMedia architecture consisting of FPGA-based Reconfigurable Functional Units (RFU) and associated instructions. Then we analyse a CSC (RFU-specific) instruction which can process four pixels per call, and propose a scheme to implement the CSC operation on RFU(s). When mapped on an ACEX EP1K100 FPGA, the proposed CSC exhibits a latency of 10 and a recovery of 2 TriMedia@200 MHz cycles, and occupies 57% of the device. By configuring the CSC facility on the RFU(s) at application load-time, color space conversion can be computed on FPGA-augmented TriMedia with 40% speed-up over the standard TriMedia. ½º ÁÒØÖÓ Ù Ø ÓÒ

MPEG macroblock parsing and pel reconstruction on an FPGA-augmented TriMedia processor

by Jos Van Eijndhoven and Kees Vissers

Proceedings 2001 IEEE International Conference on Computer Design: VLSI in Computers and Processors. ICCD 2001, 2001

This paper describes an experiment which aims to reveal the potential impact on performance yield... more This paper describes an experiment which aims to reveal the potential impact on performance yielded by augmenting a TriMedia-CPU64 processor with a multiple-context FPGA core. We first propose an extension of the TriMedia-CPU64 architecture, which consists of a Reconfigurable Functional Unit and its associated instructions. Then, we address the decoding of variable-length codes on such extended TriMedia and describe the architecture and FPGAimplementation of a Variable-Length Decoder (VLD) computing facility. When mapped on an ACEX EP1K100 FPGA, the proposed VLD exhibits a latency of cycles. Preliminary results indicate that by configuring each of the VLD and 1-D IDCT (which is described elsewhere) facilities on a different FPGA context, and by activating the contexts as needed, the augmented TriMedia can perform macroblock parsing followed up by pel reconstruction with an improvement of ¾¼ ¾ ± over the standard TriMedia.

Multirate integration in a direct simulation method

Proceedings of the European Design Automation Conference, 1990., EDAC., 1990

Multirate integration is a technique in which a set of differential equations is solved with diff... more Multirate integration is a technique in which a set of differential equations is solved with different timesteps assigned to subsets of equations [4][10]. In circuit simulation this is commonly used in the waveform relaxation method, where different subcircuits are analyzed independently from the others. An important and obvious advantage is the simulation efficiency: subcircuits which are temporarily changing relatively slowly,

PLATO: a new piecewise linear simulation tool

Proceedings of the European Design Automation Conference, 1990., EDAC., 1990

... S b BE TR BDF2 ACT2 h Vsh ... The corresponding LU decomposition can be updated by a very eff... more

Latency exploitation in circuit simulation by sparse matrix techniques

1988., IEEE International Symposium on Circuits and Systems, 1988

The most important operations for a circuit simulator are component model linearization, updating... more The most important operations for a circuit simulator are component model linearization, updating the network matrix, performing large unsymmetric decomposition on this matrix, and solving the network variables by forward and backward substitution. Methods are presented to keep all these operations localized to the part of the network that is active at the current time point, thus obtaining a considerable

A Reconfigurable Functional Unit for TriMedia/CPU64. A Case Study

by Sorin Cotofana and Jos Van Eijndhoven

The paper presents a case study on augmenting a TriMedia/CPU64 processor with a Reconfigurable (F... more The paper presents a case study on augmenting a TriMedia/CPU64 processor with a Reconfigurable (FPGA-based) Functional Unit (RFU). We first propose an extension of the TriMedia/CPU64 architecture, which consists of a RFU and its associated instructions. Then, we address the computation of the ¢ IDCT on such extended TriMedia, and propose a scheme to implement an 8-point IDCT operation on the RFU. Further, we address the decoding of Variable Length Codes and describe the FPGA implementation of a Variable Length Decoder (VLD) computing facility. When mapped on an ACEX EP1K100 FPGA from Altera, our 8-point IDCT exhibits a latency of 16 and a recovery of 2 Tri-Media cycles, and occupies 42% of the FPGA's logic array blocks. The proposed VLD exhibits a latency of 7 TriMedia cycles when mapped on the same FPGA, and utilizes 6 of its embedded array blocks. By using the 8-point IDCT computing facility, an ¢ IDCT including all overheads can be computed with the throughput of 1/32 IDCT/cycle. Also, with the proposed VLD computing facility, a single DCT coefficient can be decoded in 11 cycles including all overheads. Simulation results indicate that by configuring each of the 8-point IDCT and VLD computing facilities on a different FPGA context, and by activating the contexts as needed, the augmented TriMedia can perform MPEG macroblock parsing followed up by a pel reconstruction with an improvement of 20-25% over the standard TriMedia.

Cache-Coherent Heterogeneous Multiprocessing as Basis for Streaming Applications

by P. Stravers and Jos Van Eijndhoven

New generation System-on-Chips will be extremely complex devices, composed from complex subsystem... more New generation System-on-Chips will be extremely complex devices, composed from complex subsystems, relying on abstraction from implementation details. These chips will support the execution of a mix of concurrent applications that are not known in detail at chip design time. These SoCs require a significant degree of programmability to configure both the set of functions that must execute as well as the structure of the dataflow between these functions. To ease the programming effort multiprocessor computers have employed cache coherent share memory for decades, abstracting the average programmer from system complexity issues such as multiple processors and memory hierarchies.

Unknown

by Sorin Cotofana and Jos Van Eijndhoven

<title>Dynamic reconfiguration of streaming graphs on a heterogeneous multiprocessor architecture</title>

Embedded Processors for Multimedia and Communications II, 2005

Consumer electronics products are multi-functional devices that combine a set of media applicatio... more Consumer electronics products are multi-functional devices that combine a set of media applications. Media data in such products is largely processed in heterogeneous multiprocessor subsystems that are integrated into a system on chip (SoC). A product engineer configures each subsystem for a collection of predefined applications when deploying the SoC in a product. Oftentimes, the system supports a large number of desired application configurations, or 'use cases". The system moves from one configuration to the next by adapting the configuration of a running application, referred to as 'dynamic reconfiguration". This paper presents a practical approach to dynamic application reconfiguration in a heterogeneous multiprocessor subsystem. The targeted media applications are constructed as a graph of concurrently executing interconnected tasks that exchange information through streams of data. Configuring such a streaming graph entails the instantiation and interconnection of tasks, setting of task parameters, assignment of tasks to coprocessors, and the allocation of communication buffers in memory. The paper derives a reconfiguration interface that can be supported in hardware, yet isolates application configuration knowledge from the coprocessor hardware. Though simple and easy to use, the interface addresses the key challenge of reconfiguring individual tasks while maintaining real-time behavior and data integrity of the overall set of concurrently executing applications.

TriMedia CPU64

by Jos Van Eijndhoven and G. Hekstra

Caching Techniques for Multi-Processor Streaming Architectures

In the world of complex SoCs for consumer applica- tions, multiprocessor architectures usually de... more In the world of complex SoCs for consumer applica- tions, multiprocessor architectures usually deploy caching techniques to alleviate the cost of data com- munication between processing elements. In this appli- cation domain, the characteristics of streaming appli- cations play a dominant role in the design of the multi- processor architectures. These characteristics not only influence the design at SoC level,

The ASCIS data flow graph : semantics and textual format

Journal of The American Chemical Society, 1991

Design of multi-tasking coprocessor control for Eclipse

International Conference on Hardware Software Codesign, 2002

Eclipse defines a heterogeneous multiprocessor architecture template for data-dependent stream pr... more Eclipse defines a heterogeneous multiprocessor architecture template for data-dependent stream processing. Intended as a scalable and flexible subsystem of forthcoming media-processing systems-on-a-chip, Eclipse combines application configuration flexibility with the efficiency of function-specific hardware, or coprocessors. To facilitate reuse, Eclipse separates coprocessor functionality from generic support that addresses multi-tasking, inter-task synchronization, and data transport. Five interface primitives accomplish this separation. The

$Research paper thumbnail of \0$

\0

PRMDL: a machine description language for clustered VLIW architectures

by Jos Van Eijndhoven and Andrei Terechko

Proceedings Design, Automation and Test in Europe. Conference and Exhibition 2001, 2001

A compiler-simulator framework must be retargetable to enable platform-based processor design as ... more A compiler-simulator framework must be retargetable to enable platform-based processor design as well as proper processor architecture design space exploration. This paper describes the design decisions taken for the retargetability mechanism of the Philips Research compiler-simulator framework driven by a central machine description file. The format of the machine description file plays an important role in defining the scope of retargetability of a compilersimulator framework. The machine description format PRMDL used in Philips Research supports a wide variety of VLIW architectures. In particular, PRMDL is capable of expressing clustered architecture features such as incomplete bypass networks, multiple register files, along with functional units shared or distributed among multiple issue slots, diverse conditional operation mappings, and more. The structure of PRMDL features separate software and hardware views on a processor. This insures robustness of retargetability built into tools across several processor generations.

On design rule correct maze routing

Proceedings of European Design and Test Conference EDAC-ETC-EUROASIC, 1994

Page 1. On Design Rule Correct Maze Routing Ed. P. Huijbregtsl, Jos TJ van Eijndhoven and Jochen ... more

The yorktown silicon compiler

TriMedia CPU64 application development environment

Proceedings 1999 IEEE International Conference on Computer Design: VLSI in Computers and Processors (Cat. No.99CB37040), 1999

The architecture of the TriMedia CPU64 is based on the TM1000 DSPCPU. The original VLIW architect... more The architecture of the TriMedia CPU64 is based on the TM1000 DSPCPU. The original VLIW architecture has been extended with the concepts of vector processing and superoperations. The new vector operations and superoperations need to be supported by the compiler and simulator to make them accessible to application programmers. It was our intention to support these new features while remaining compliant with the ANSI C standard. This paper describes the mechanisms which were implemented to achieve this goal. Furthermore, the optimization of applications needs to address the vectorization of the functions to be implemented. Some general guidelines for producing efficient vectorized code are given.

TriMedia CPU64 architecture

Proceedings 1999 IEEE International Conference on Computer Design: VLSI in Computers and Processors (Cat. No.99CB37040), 1999

We present a new VLIW core as a successor to the TriMedia TM1000. The processor is targeted for e... more We present a new VLIW core as a successor to the TriMedia TM1000. The processor is targeted for embedded use in media-processing devices like DTVs and set-top boxes. Intended as a core, its design must be supplemented with on-chip co-processors to obtain a cost-effective system. Good performance is obtained through a uniform 64-bit 5 issue-slot VLIW design, supporting subword parallelism with an extensive instruction set optimized with respect to media-processing. Multi-slot 'super-ops' allow powerful multi-argument and multi-result operations. As an example, an IDCT algorithm shows a very low instruction count in comparison with other processors. To achieve good performance, critical sections in the application program source code need to be rewritten with vector data types and function calls for media operations. Benchmarking with several media applications was used to tune the instruction set and study cache behavior. This resulted in a VLIW architecture with wide data paths and relatively simple cpu control.

MPEG-compliant Entropy Decoding on FPGA-augmented TriMedia/CPU64

by Sorin Cotofana and Jos Van Eijndhoven

Proc. IEEE Symp. Field- …, 2002

The paper presents a Design Space Exploration (DSE) experiment which has been carried out in orde... more

Color space conversion for MPEG decoding on FPGA-augmented TriMedia processor

Proceedings IEEE International Conference on Application-Specific Systems, Architectures, and Processors. ASAP 2003, 2003

[email protected] http://ce.et.tudelft.nl/˜mihai Abstract A case study on Color Space Conversi... more [email protected] http://ce.et.tudelft.nl/˜mihai Abstract A case study on Color Space Conversion (CSC) for MPEG decoding, carried out on FPGAaugmented TriMedia processor is presented. That is, a transform from ¼ Ö color space to Ê ¼ ¼ ¼ color space is addressed. First, we outline the extension of TriMedia architecture consisting of FPGA-based Reconfigurable Functional Units (RFU) and associated instructions. Then we analyse a CSC (RFU-specific) instruction which can process four pixels per call, and propose a scheme to implement the CSC operation on RFU(s). When mapped on an ACEX EP1K100 FPGA, the proposed CSC exhibits a latency of 10 and a recovery of 2 TriMedia@200 MHz cycles, and occupies 57% of the device. By configuring the CSC facility on the RFU(s) at application load-time, color space conversion can be computed on FPGA-augmented TriMedia with 40% speed-up over the standard TriMedia. ½º ÁÒØÖÓ Ù Ø ÓÒ

MPEG macroblock parsing and pel reconstruction on an FPGA-augmented TriMedia processor

by Jos Van Eijndhoven and Kees Vissers

Proceedings 2001 IEEE International Conference on Computer Design: VLSI in Computers and Processors. ICCD 2001, 2001

This paper describes an experiment which aims to reveal the potential impact on performance yield... more This paper describes an experiment which aims to reveal the potential impact on performance yielded by augmenting a TriMedia-CPU64 processor with a multiple-context FPGA core. We first propose an extension of the TriMedia-CPU64 architecture, which consists of a Reconfigurable Functional Unit and its associated instructions. Then, we address the decoding of variable-length codes on such extended TriMedia and describe the architecture and FPGAimplementation of a Variable-Length Decoder (VLD) computing facility. When mapped on an ACEX EP1K100 FPGA, the proposed VLD exhibits a latency of cycles. Preliminary results indicate that by configuring each of the VLD and 1-D IDCT (which is described elsewhere) facilities on a different FPGA context, and by activating the contexts as needed, the augmented TriMedia can perform macroblock parsing followed up by pel reconstruction with an improvement of ¾¼ ¾ ± over the standard TriMedia.

Multirate integration in a direct simulation method

Proceedings of the European Design Automation Conference, 1990., EDAC., 1990

Multirate integration is a technique in which a set of differential equations is solved with diff... more Multirate integration is a technique in which a set of differential equations is solved with different timesteps assigned to subsets of equations [4][10]. In circuit simulation this is commonly used in the waveform relaxation method, where different subcircuits are analyzed independently from the others. An important and obvious advantage is the simulation efficiency: subcircuits which are temporarily changing relatively slowly,

PLATO: a new piecewise linear simulation tool

Proceedings of the European Design Automation Conference, 1990., EDAC., 1990

... S b BE TR BDF2 ACT2 h Vsh ... The corresponding LU decomposition can be updated by a very eff... more

Latency exploitation in circuit simulation by sparse matrix techniques

1988., IEEE International Symposium on Circuits and Systems, 1988

The most important operations for a circuit simulator are component model linearization, updating... more The most important operations for a circuit simulator are component model linearization, updating the network matrix, performing large unsymmetric decomposition on this matrix, and solving the network variables by forward and backward substitution. Methods are presented to keep all these operations localized to the part of the network that is active at the current time point, thus obtaining a considerable

A Reconfigurable Functional Unit for TriMedia/CPU64. A Case Study

by Sorin Cotofana and Jos Van Eijndhoven

The paper presents a case study on augmenting a TriMedia/CPU64 processor with a Reconfigurable (F... more The paper presents a case study on augmenting a TriMedia/CPU64 processor with a Reconfigurable (FPGA-based) Functional Unit (RFU). We first propose an extension of the TriMedia/CPU64 architecture, which consists of a RFU and its associated instructions. Then, we address the computation of the ¢ IDCT on such extended TriMedia, and propose a scheme to implement an 8-point IDCT operation on the RFU. Further, we address the decoding of Variable Length Codes and describe the FPGA implementation of a Variable Length Decoder (VLD) computing facility. When mapped on an ACEX EP1K100 FPGA from Altera, our 8-point IDCT exhibits a latency of 16 and a recovery of 2 Tri-Media cycles, and occupies 42% of the FPGA's logic array blocks. The proposed VLD exhibits a latency of 7 TriMedia cycles when mapped on the same FPGA, and utilizes 6 of its embedded array blocks. By using the 8-point IDCT computing facility, an ¢ IDCT including all overheads can be computed with the throughput of 1/32 IDCT/cycle. Also, with the proposed VLD computing facility, a single DCT coefficient can be decoded in 11 cycles including all overheads. Simulation results indicate that by configuring each of the 8-point IDCT and VLD computing facilities on a different FPGA context, and by activating the contexts as needed, the augmented TriMedia can perform MPEG macroblock parsing followed up by a pel reconstruction with an improvement of 20-25% over the standard TriMedia.