The growing complexity and diversity of embedded systems-combined with continuing demands for hig... more The growing complexity and diversity of embedded systems-combined with continuing demands for higher performance and lower power consumption-places increasing pressure on embedded platforms designers. The target of the ERA project is to offer a holistic, multi-dimensional methodology to address these problems in a unified framework exploiting the inter-and intra-synergism between the reconfigurable hardware (core, memory, and network resources), the reconfigurable software (compiler and tools), and the run-time system. Starting from the hardware level, we design our platform via a structured approach that allows integration of reconfigurable computing elements, network fabrics, and memory hierarchy components. These hardware elements can adapt their composition, organization, and even instruction-set architectures to exploit tradeoffs in performance and power. Appropriate hardware resources can be selected both statically at design time and dynamically at run time. Hardware details are exposed to our custom operating system, our custom runtime system, and our adaptive compiler, and are even visible all the way up to the application level. The design philosophy followed in the ERA project proved efficient enough not only to enable a better choice of power/performance trade-offs but also to support fast platform prototyping of high-efficiency embedded system designs. In this paper, we present a brief overview of the design approach, the major outcomes, and the lessons learned in the ERA project.
... Issue-slots in a Chip Multiprocessor Fakhar Anjam, Muhammad Nadeem, and Stephan Wong Computer... more ... Issue-slots in a Chip Multiprocessor Fakhar Anjam, Muhammad Nadeem, and Stephan Wong Computer Engineering Laboratory Delft University of Technology, Delft, The Netherlands E-mail: {F.Anjam, M.Nadeem, JSSMWong}@tudelft.nl ... [7] MAR Saghir, M. El-Majzoub, and P ...
2011 9th IEEE International Conference on Industrial Informatics, 2011
The growing complexity and diversity of embedded systems -combined with continuing demands for hi... more The growing complexity and diversity of embedded systems -combined with continuing demands for higher performance and lower power consumption -place increasing pressure on embedded platforms designers. To address these problems, the Embedded Reconfigurable Architectures project (ERA), investigates innovations in both hardware and tools to create next-generation embedded systems. Leveraging adaptive hardware enables maximum performance for given power budgets. We design our platform via a structured approach that allows integration of reconfigurable computing elements, network fabrics, and memory hierarchy components. Commercially available, off-the-shelf processors are combined with other proprietary and application-specific, dedicated cores. These computing and network elements can adapt their composition, organization, and even instruction-set architectures in an effort to provide the best possible trade-offs in performance and power for the given application(s). Likewise, network elements and topologies and memory hierarchy organization can be selected both statically at design time and dynamically at run-time. Hardware details are exposed to the operating system, run-time system, compiler, and applications. This combination supports fast platform prototyping of high-efficient embedded system designs. Our design philosophy supports the freedom to flexibly tune all these hardware elements, enabling a better choice of power/performance trade-offs than that afforded by the current state of the art.
2010 International Conference on Field-Programmable Technology, 2010
... Wong, and Faisal Nadeem Computer Engineering Laboratory Delft University of Technology, Delft... more ... Wong, and Faisal Nadeem Computer Engineering Laboratory Delft University of Technology, Delft, The Netherlands E-mail: {F.Anjam, JSSMWong, MFNadeem}@tudelft.nl AbstractIn this paper, we present the design and implemen-tation of a BRAM-based multiported ...
2010 International Conference on Field-Programmable Technology, 2010
In this paper, we present a very long instruction word (VLIW) softcore processor implemented in a... more In this paper, we present a very long instruction word (VLIW) softcore processor implemented in an FPGA. The processor instruction set architecture (ISA) is based on the VEX ISA. The issue-width of the processor can be dynamically adjusted. The processor has two 2-issue cores, which can be run independently. If not in use, each core can be taken to a lower power mode by gating off the source clock. The two 2issue cores can be combined at run-time to form one larger 4-issue core. Applications/kernels with larger instruction level parallelism (ILP), such as matrix multiplication, FFT, DFT, etc., can be run on the larger 4-issue core to exploit the available ILP. Applications with more data level parallelism (DLP), such as AES encryption/decryption, ADPCM encode/decode etc., can be run on the two 2-issue cores with the data divided among the two cores. We utilize the Xilinx partial reconfiguration flow to implement our design. The size of the partial bitstreams to combine the two 2-issue cores to one 4-issue core or split vice versa is 59 kbytes. The minimum time required to reconfigure the processor or adjust the issue-slots are 0.893 ms and 0.148 ms for the Xilinx Virtex-II Pro and Virtex-4 FPGAs, respectively.
2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010
In this paper, we present the design and implementation of an open-source reconfigurable very lon... more In this paper, we present the design and implementation of an open-source reconfigurable very long instruction word (VLIW) multiprocessor system. This processor is implemented as a softcore on a field-programmable gate arrays (FPGA) and its instruction set architecture (ISA) is based on the Lx/ST200 ISA. This multiprocessor design is based on our earlier ρ-VEX processor design. Since the ρ-VEX processor is a parameterized processor, our multiprocessor design is also parameterized. By utilizing a freely available compiler and simulator in our development framework, we are able to optimize our design and map any application written in C to our multiprocessor system. This VLIW multiprocessor can exploit data level as well as instruction level parallelism inherent in an application and make its execution faster. More importantly, we achieve our results by saving expensive FPGA area through the sharing of resources. The results show that we can achieve two times better performance for our dual-processor system (with shared resources) compared to a uni-processor system or a 2-cluster processor system for applications having data level and instruction level parallelism.
ABSTRACT This paper presents the design and implementation of configurable fault-tolerance techni... more ABSTRACT This paper presents the design and implementation of configurable fault-tolerance techniques for a configurable VLIW processor. The processor can be configured for 2, 4, or 8 issue-slots with different types of execution functional units (FUs), and its instruction set architecture (ISA) is based on the VEX ISA. Separate techniques are employed to protect different modules of the processor from single event upsets (SEU) errors. Parity checking is utilized to detect errors in the instruction and data memories and the general register file (GR), while triple modular redundancy (TMR) approach is employed for all the synchronous flip-flops (FFs). At design-time, a user can choose between the standard non fault-tolerant design, a fault-tolerant design where the fault tolerance is permanently enabled, and a fault-tolerant design where the fault tolerance can be enabled and disabled at run-time. These options enable a user to trade-off between hardware resources, performance, and power consumption. A simulation based technique is utilized for testing purposes. The processor is implemented in a Xilinx Virtex-6 FPGA as well as synthesized to a 90 nm ASIC technology. Compared to the permanently enabled fault-tolerance, in scenarios, where fault-tolerance is not required at some point in time, considerable power savings (up to 25.93% for the FPGA and 70.22% for the ASIC) can be achieved by disabling the fault-tolerance at run-time.
2010 International Symposium on System on Chip, 2010
In this paper, we present a low-power, high-throughput hardware implementation of deblocking filt... more In this paper, we present a low-power, high-throughput hardware implementation of deblocking filter core in H.264/AVC for battery-powered multimedia electronic devices. The hardware implementation is based an optimized deblocking filter algorithm with 50% less number of addition operations. The evaluation of full or partial filtering skip scenarios is employed at an early stage in the filter processing chain to avoid unnecessary operations. Moreover, independent processing blocks are identified and are implemented with gated clock. Thus an efficient control block to activate/deactivate these independent processing blocks dynamically and pipeline implementation enable us to achieve low-power at one hand and high-throughput design for deblocking filter on the other. Experimental results suggest that the dynamic power consumption is reduced up to 50%, when compared with state-of-the-art designs in the literature. The deblocking filter core consumes 43 mW dynamic power on a Xilinx Virtex II FPGA and consumes 16.36 μW, when synthesized using 0.18μm CMOS standard cell library. The FPGA implementation on Virtex II can work at 76 MHz whereas the maximum operating frequency for 0.18μm process technology is 200 MHz. Our deblocking filter hardware implementation can easily provide real-time filtering operation for full-HD video format (1920×1080) @ 30 fps with an operating frequency as low as 59 MHz.
2013 11th IEEE International Conference on Industrial Informatics (INDIN), 2013
The growing complexity and diversity of embedded systems-combined with continuing demands for hig... more The growing complexity and diversity of embedded systems-combined with continuing demands for higher performance and lower power consumption-places increasing pressure on embedded platforms designers. The target of the ERA project is to offer a holistic, multi-dimensional methodology to address these problems in a unified framework exploiting the inter-and intra-synergism between the reconfigurable hardware (core, memory, and network resources), the reconfigurable software (compiler and tools), and the run-time system. Starting from the hardware level, we design our platform via a structured approach that allows integration of reconfigurable computing elements, network fabrics, and memory hierarchy components. These hardware elements can adapt their composition, organization, and even instruction-set architectures to exploit tradeoffs in performance and power. Appropriate hardware resources can be selected both statically at design time and dynamically at run time. Hardware details are exposed to our custom operating system, our custom runtime system, and our adaptive compiler, and are even visible all the way up to the application level. The design philosophy followed in the ERA project proved efficient enough not only to enable a better choice of power/performance trade-offs but also to support fast platform prototyping of high-efficiency embedded system designs. In this paper, we present a brief overview of the design approach, the major outcomes, and the lessons learned in the ERA project.
The growing complexity and diversity of embedded systems-combined with continuing demands for hig... more The growing complexity and diversity of embedded systems-combined with continuing demands for higher performance and lower power consumption-places increasing pressure on embedded platforms designers. The target of the ERA project is to offer a holistic, multi-dimensional methodology to address these problems in a unified framework exploiting the inter-and intra-synergism between the reconfigurable hardware (core, memory, and network resources), the reconfigurable software (compiler and tools), and the run-time system. Starting from the hardware level, we design our platform via a structured approach that allows integration of reconfigurable computing elements, network fabrics, and memory hierarchy components. These hardware elements can adapt their composition, organization, and even instruction-set architectures to exploit tradeoffs in performance and power. Appropriate hardware resources can be selected both statically at design time and dynamically at run time. Hardware details are exposed to our custom operating system, our custom runtime system, and our adaptive compiler, and are even visible all the way up to the application level. The design philosophy followed in the ERA project proved efficient enough not only to enable a better choice of power/performance trade-offs but also to support fast platform prototyping of high-efficiency embedded system designs. In this paper, we present a brief overview of the design approach, the major outcomes, and the lessons learned in the ERA project.
... Issue-slots in a Chip Multiprocessor Fakhar Anjam, Muhammad Nadeem, and Stephan Wong Computer... more ... Issue-slots in a Chip Multiprocessor Fakhar Anjam, Muhammad Nadeem, and Stephan Wong Computer Engineering Laboratory Delft University of Technology, Delft, The Netherlands E-mail: {F.Anjam, M.Nadeem, JSSMWong}@tudelft.nl ... [7] MAR Saghir, M. El-Majzoub, and P ...
2011 9th IEEE International Conference on Industrial Informatics, 2011
The growing complexity and diversity of embedded systems -combined with continuing demands for hi... more The growing complexity and diversity of embedded systems -combined with continuing demands for higher performance and lower power consumption -place increasing pressure on embedded platforms designers. To address these problems, the Embedded Reconfigurable Architectures project (ERA), investigates innovations in both hardware and tools to create next-generation embedded systems. Leveraging adaptive hardware enables maximum performance for given power budgets. We design our platform via a structured approach that allows integration of reconfigurable computing elements, network fabrics, and memory hierarchy components. Commercially available, off-the-shelf processors are combined with other proprietary and application-specific, dedicated cores. These computing and network elements can adapt their composition, organization, and even instruction-set architectures in an effort to provide the best possible trade-offs in performance and power for the given application(s). Likewise, network elements and topologies and memory hierarchy organization can be selected both statically at design time and dynamically at run-time. Hardware details are exposed to the operating system, run-time system, compiler, and applications. This combination supports fast platform prototyping of high-efficient embedded system designs. Our design philosophy supports the freedom to flexibly tune all these hardware elements, enabling a better choice of power/performance trade-offs than that afforded by the current state of the art.
2010 International Conference on Field-Programmable Technology, 2010
... Wong, and Faisal Nadeem Computer Engineering Laboratory Delft University of Technology, Delft... more ... Wong, and Faisal Nadeem Computer Engineering Laboratory Delft University of Technology, Delft, The Netherlands E-mail: {F.Anjam, JSSMWong, MFNadeem}@tudelft.nl AbstractIn this paper, we present the design and implemen-tation of a BRAM-based multiported ...
2010 International Conference on Field-Programmable Technology, 2010
In this paper, we present a very long instruction word (VLIW) softcore processor implemented in a... more In this paper, we present a very long instruction word (VLIW) softcore processor implemented in an FPGA. The processor instruction set architecture (ISA) is based on the VEX ISA. The issue-width of the processor can be dynamically adjusted. The processor has two 2-issue cores, which can be run independently. If not in use, each core can be taken to a lower power mode by gating off the source clock. The two 2issue cores can be combined at run-time to form one larger 4-issue core. Applications/kernels with larger instruction level parallelism (ILP), such as matrix multiplication, FFT, DFT, etc., can be run on the larger 4-issue core to exploit the available ILP. Applications with more data level parallelism (DLP), such as AES encryption/decryption, ADPCM encode/decode etc., can be run on the two 2-issue cores with the data divided among the two cores. We utilize the Xilinx partial reconfiguration flow to implement our design. The size of the partial bitstreams to combine the two 2-issue cores to one 4-issue core or split vice versa is 59 kbytes. The minimum time required to reconfigure the processor or adjust the issue-slots are 0.893 ms and 0.148 ms for the Xilinx Virtex-II Pro and Virtex-4 FPGAs, respectively.
2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010
In this paper, we present the design and implementation of an open-source reconfigurable very lon... more In this paper, we present the design and implementation of an open-source reconfigurable very long instruction word (VLIW) multiprocessor system. This processor is implemented as a softcore on a field-programmable gate arrays (FPGA) and its instruction set architecture (ISA) is based on the Lx/ST200 ISA. This multiprocessor design is based on our earlier ρ-VEX processor design. Since the ρ-VEX processor is a parameterized processor, our multiprocessor design is also parameterized. By utilizing a freely available compiler and simulator in our development framework, we are able to optimize our design and map any application written in C to our multiprocessor system. This VLIW multiprocessor can exploit data level as well as instruction level parallelism inherent in an application and make its execution faster. More importantly, we achieve our results by saving expensive FPGA area through the sharing of resources. The results show that we can achieve two times better performance for our dual-processor system (with shared resources) compared to a uni-processor system or a 2-cluster processor system for applications having data level and instruction level parallelism.
ABSTRACT This paper presents the design and implementation of configurable fault-tolerance techni... more ABSTRACT This paper presents the design and implementation of configurable fault-tolerance techniques for a configurable VLIW processor. The processor can be configured for 2, 4, or 8 issue-slots with different types of execution functional units (FUs), and its instruction set architecture (ISA) is based on the VEX ISA. Separate techniques are employed to protect different modules of the processor from single event upsets (SEU) errors. Parity checking is utilized to detect errors in the instruction and data memories and the general register file (GR), while triple modular redundancy (TMR) approach is employed for all the synchronous flip-flops (FFs). At design-time, a user can choose between the standard non fault-tolerant design, a fault-tolerant design where the fault tolerance is permanently enabled, and a fault-tolerant design where the fault tolerance can be enabled and disabled at run-time. These options enable a user to trade-off between hardware resources, performance, and power consumption. A simulation based technique is utilized for testing purposes. The processor is implemented in a Xilinx Virtex-6 FPGA as well as synthesized to a 90 nm ASIC technology. Compared to the permanently enabled fault-tolerance, in scenarios, where fault-tolerance is not required at some point in time, considerable power savings (up to 25.93% for the FPGA and 70.22% for the ASIC) can be achieved by disabling the fault-tolerance at run-time.
2010 International Symposium on System on Chip, 2010
In this paper, we present a low-power, high-throughput hardware implementation of deblocking filt... more In this paper, we present a low-power, high-throughput hardware implementation of deblocking filter core in H.264/AVC for battery-powered multimedia electronic devices. The hardware implementation is based an optimized deblocking filter algorithm with 50% less number of addition operations. The evaluation of full or partial filtering skip scenarios is employed at an early stage in the filter processing chain to avoid unnecessary operations. Moreover, independent processing blocks are identified and are implemented with gated clock. Thus an efficient control block to activate/deactivate these independent processing blocks dynamically and pipeline implementation enable us to achieve low-power at one hand and high-throughput design for deblocking filter on the other. Experimental results suggest that the dynamic power consumption is reduced up to 50%, when compared with state-of-the-art designs in the literature. The deblocking filter core consumes 43 mW dynamic power on a Xilinx Virtex II FPGA and consumes 16.36 μW, when synthesized using 0.18μm CMOS standard cell library. The FPGA implementation on Virtex II can work at 76 MHz whereas the maximum operating frequency for 0.18μm process technology is 200 MHz. Our deblocking filter hardware implementation can easily provide real-time filtering operation for full-HD video format (1920×1080) @ 30 fps with an operating frequency as low as 59 MHz.
2013 11th IEEE International Conference on Industrial Informatics (INDIN), 2013
The growing complexity and diversity of embedded systems-combined with continuing demands for hig... more The growing complexity and diversity of embedded systems-combined with continuing demands for higher performance and lower power consumption-places increasing pressure on embedded platforms designers. The target of the ERA project is to offer a holistic, multi-dimensional methodology to address these problems in a unified framework exploiting the inter-and intra-synergism between the reconfigurable hardware (core, memory, and network resources), the reconfigurable software (compiler and tools), and the run-time system. Starting from the hardware level, we design our platform via a structured approach that allows integration of reconfigurable computing elements, network fabrics, and memory hierarchy components. These hardware elements can adapt their composition, organization, and even instruction-set architectures to exploit tradeoffs in performance and power. Appropriate hardware resources can be selected both statically at design time and dynamically at run time. Hardware details are exposed to our custom operating system, our custom runtime system, and our adaptive compiler, and are even visible all the way up to the application level. The design philosophy followed in the ERA project proved efficient enough not only to enable a better choice of power/performance trade-offs but also to support fast platform prototyping of high-efficiency embedded system designs. In this paper, we present a brief overview of the design approach, the major outcomes, and the lessons learned in the ERA project.
Uploads
Papers by Fakhar Anjam