Skip to main content

Rajesh Gupta

University of California, San Diego, Computer Science, Faculty Member

Followers

32

Following

1

Public Views

Stephen A. Edwards

Columbia University

Venkata SaiNithishChandra Gundlapalli

Rodrigo Antonio Mendonça dos Mendonça dos Santos Knupp

CEFET-RJ

Santosh Poudyal

Tribhuvan University (IOM)

University of Debrecen

Ranjeet Kumar Sinha

Oussama GUERNANE

Ecole Nationale Polytechnique d'Alger

Dominique Borrione

Université Grenoble Alpes

Interests

Uploads

Papers by Rajesh Gupta

Using a programming language for digital system design

IEEE Design & Test of Computers, 1997

become more complex, system designers are increasingly concerned about systemmodeling tools and t... more become more complex, system designers are increasingly concerned about systemmodeling tools and their impact on productivity and hardware design quality. In addition, they want to quickly produce a working hardware model, simulate it with the rest of the system, and synthesize and/or formally verify it for specific properties. Toward this end, designers are using textual languages based on high-level programming languages to express executable behaviors. Indeed, languages such as C, VHDL, and Verilog are common in large-scale system design and debugging. Undoubtedly, this growth in the use of textual programming languages stems from system designers' familiarity with general-purpose, high-level programming languages. Using programming languages for hardware specification can significantly shorten the system designer's learning curve and enables simulation of complete systems for correct functionality. There are pitfalls, however, in following a pure software-programminglanguage description to model hardwaremainly, inefficient results from synthesis tools. Consequently, language developers often modify and extend software-programming languages to produce hardware description languages (HDLs) geared specifically to hardware modeling. Most semantic extensions concern structural components, exact event timing, and operational concurrency-concepts absent from most softwareprogramming languages.

Associative Convolutional Layers

Motivated by the necessity for parameter efficiency in distributed machine learning and AI-enable... more Motivated by the necessity for parameter efficiency in distributed machine learning and AI-enabled edge devices, we provide a general and easy to implement method for significantly reducing the number of parameters of Convolutional Neural Networks (CNNs), during both the training and inference phases. We introduce a simple auxiliary neural network which can generate the convolutional filters of any CNN architecture from a low dimensional latent space. This auxiliary neural network, which we call "Convolutional Slice Generator" (CSG), is unique to the network and provides the association between its convolutional layers. During the training of the CNN, instead of training the filters of the convolutional layers, only the parameters of the CSG and their corresponding "code vectors" are trained. This results in a significant reduction of the number of parameters due to the fact that the CNN can be fully represented using only the parameters of the CSG, the code vect...

Ember - energy management of batteryless event detection sensors with deep reinforcement learning

Proceedings of the 18th Conference on Embedded Networked Sensor Systems, 2020

Batteryless sensors avoid battery replacement at the cost of slowing down or stopping their opera... more Batteryless sensors avoid battery replacement at the cost of slowing down or stopping their operations when there is not sufficient energy to harvest in the environment. While this strategy can work for some applications, event-based applications still remain a challenge as events arrive sporadically and energy availability is uncertain. One solution is to only turn On a sensor right before an event is happening to both detect the event and save as much energy as possible. Therefore, the system has to correctly predict events while managing limited resource availability. In this demo, we present Ember, an energy management system based on deep reinforcement learning to duty cycle event-driven sensors in low-energy conditions. We show how our system learns environmental patterns over time and makes decisions to maximize the event detection rate for batteryless energy-harvesting sensor nodes subject to low energy availability. Furthermore, we show a novel self-supervised data collection algorithm that helps Ember in discovering new environmental patterns over time. For more details, we refer readers to the full paper of Ember [2]. CCS CONCEPTS • Computer systems organization → Sensor networks; • Computing methodologies → Reinforcement learning.

Integrated I-cache Way Predictor and Branch Target Buffer to Reduce Energy Consumption

Lecture Notes in Computer Science, 2002

In this paper, we present a Branch Target Bu er (BTB) design for energy savings in set-associativ... more In this paper, we present a Branch Target Bu er (BTB) design for energy savings in set-associative instruction caches. We extend the functionality of a BTB by caching way predictions in addition to branch target addresses. Way prediction and branch target prediction are done in parallel. Instruction cache energy savings are achieved b y a c cessing one cache way if the way prediction for a fetch is available. To increase the number of way predictions for higher energy savings, we modify the BTB management policy to allocate entries for non-branch instructions. Furthermore, we propose to partition a BTB into ways for branch instructions and ways for non-branch instructions to reduce the BTB energy as well. We evaluate the e ectiveness of our BTB design and management policies with SPEC95 benchmarks. The best BTB con guration shows a 74% energy savings on average in a 4-way set-associative instruction cache and the performance d e gradation is only 0.1%. When the instruction cache energy and the BTB energy are considered t o gether, the average energy-delay product reduction is 65%.

An Embedded Platform with Duty-Cycled Radio and Processing Subsystems for Wireless Sensor Networks

Lecture Notes in Computer Science

Wireless sensor nodes are increasingly being tasked with computation and communication intensive ... more Wireless sensor nodes are increasingly being tasked with computation and communication intensive functions while still subject to constraints related to energy availability. On these embedded platforms, once all low power design techniques have been explored, duty-cycling the various subsystems remains the primary option to meet the energy and power constraints. This requires the ability to provide spurts of high MIPS and high bandwidth connections. However, due to the large overheads associated with duty-cycling the computation and communication subsystems, existing high performance sensor platforms are not efficient in supporting such an option. In this paper, we present the design and optimizations taken in a wireless gateway node (WGN) that bridges data from wireless sensor networks to Wi-Fi networks in an on-demand basis. We discuss our strategies to reduce duty-cycling related costs by partitioning the system and by reducing the amount of time required to activate or deactivate the high-powered components. We compare the design choices and performance parameters with those made in the Intel Stargate platform to show the effectiveness of duty-cycling on our platform. We have built a working prototype, and the experimental results with two different power management schemes show significant reductions in latency and average power consumption compared to the Stargate.

A Cross-Layer Approach for Power-Performance Optimization in Distributed Mobile Systems

19th IEEE International Parallel and Distributed Processing Symposium

The next generation of mobile systems with multimedia processing capabilities and wireless connec... more The next generation of mobile systems with multimedia processing capabilities and wireless connectivity will be increasingly deployed in highly dynamic and distributed environments for multimedia playback and delivery (e.g. video streaming, multimedia conferencing). The challenge is to meet the heavy resource demands of multimedia applications under the stringent energy, computational, and bandwidth constraints of mobile systems, while constantly adapting to the global state changes of the distributed environment. In this paper, we present our initiatives under the FORGE framework to address the issue of delivering high quality multimedia content in mobile environments. In order to cope with the resource intensive nature of multimedia applications and dynamically changing global state (e.g. node mobility, network congestion), an end-to-end approach to QoS aware power optimization is required. We present a framework for coordinating energy optimizing strategies across various layers of system implementation and functionality and discuss techniques that can be employed to achieve energy gains for mobile multimedia systems.

Dynamic voltage scaling for systemwide energy minimization in real-time embedded systems

Proceedings of the 2004 international symposium on Low power electronics and design - ISLPED '04, 2004

Traditionally, dynamic voltage scaling (DVS) techniques have focused on minimizing the processor ... more Traditionally, dynamic voltage scaling (DVS) techniques have focused on minimizing the processor energy consumption as opposed to the entire system energy consumption. The slowdown resulting from DVS can increase the energy consumption of components like memory and network interfaces. Furthermore, the leakage power consumption is increasing with the scaling device technology and must also be taken into account. In this work, we consider energy efficient slowdown in a real-time task system. We present an algorithm to compute task slowdown factors based on the contribution of the processor leakage and standby energy consumption of the resources in the system. Our simulation experiments using randomly generated task sets show on an average 10% energy gains over traditional dynamic voltage scaling. We further combine slowdown with procrastination scheduling which increases the average energy savings to 15%. We show that our scheduling approach minimizes the total static and dynamic energy consumption of the systemwide resources.

Verifying GPU kernels by test amplification

Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation, 2012

We present a novel technique for verifying properties of data parallel GPU programs via test ampl... more We present a novel technique for verifying properties of data parallel GPU programs via test amplification. The key insight behind our work is that we can use the technique of static information flow to amplify the result of a single test execution over the set of all inputs and interleavings that affect the property being verified. We empirically demonstrate the effectiveness of test amplification for verifying race-freedom and determinism over a large number of standard GPU kernels, by showing that the result of verifying a single dynamic execution can be amplified over the massive space of possible data inputs and thread interleavings.

Automated refinement checking of concurrent systems

2007 IEEE/ACM International Conference on Computer-Aided Design, 2007

Stepwise refinement is at the core of many approaches to synthesis and optimization of hardware a... more Stepwise refinement is at the core of many approaches to synthesis and optimization of hardware and software systems. For instance, it can be used to build a synthesis approach for digital circuits from high level specifications. It can also be used for post-synthesis modification such as in Engineering Change Orders (ECOs). Therefore, checking if a system, modeled as a set of concurrent processes, is a refinement of another is of tremendous value. In this paper, we focus on concurrent systems modeled as Communicating Sequential Processes (CSP) and show their refinements can be validated using insights from translation validation, automated theorem proving and relational approaches to reasoning about programs. The novelty of our approach is that it handles infinite state spaces in a fully automated manner. We have implemented our refinement checking technique and have applied it to a variety of refinements. We present the details of our algorithm and experimental results. As an example, we were able to automatically check an infinite state space buffer refinement that cannot be checked by current state of the art tools such as FDR. We were also able to check the data part of an industrial case study on the EP2 system.

Energy-optimized dynamic deferral of workload for capacity provisioning in data centers

2013 International Green Computing Conference Proceedings, 2013

Recent increase in energy prices has led researchers to find better ways for capacity provisionin... more Recent increase in energy prices has led researchers to find better ways for capacity provisioning in data centers to reduce energy wastage due to the variation in workload. This paper explores the opportunity for cost saving utilizing the flexibility from the Service Level Agreements (SLAs) and proposes a novel approach for capacity provisioning under bounded latency requirements of the workload. We investigate how many servers to be kept active and how much workload to be delayed for energy saving while meeting every deadline. We present an offline LP formulation for capacity provisioning by dynamic deferral and give two online algorithms to determine the capacity of the data center and the assignment of workload to servers dynamically. We prove the feasibility of the online algorithms and show that their worst case performance are bounded by a constant factor with respect to the offline formulation. We validate our algorithms on a MapReduce workload by provisioning capacity on a Hadoop cluster and show that the algorithms actually perform much better in practice compared to the naive 'follow the workload' provisioning, resulting in 20-40% cost-savings.

Interoperability as a design issue in C++ based modeling environments

Proceedings of the 14th international symposium on Systems synthesis - ISSS '01, 2001

The increasing heterogeneity and complexity of VLSI systems has made the use of C++ popular for b... more The increasing heterogeneity and complexity of VLSI systems has made the use of C++ popular for building simulation and synthesis models at higher levels of abstraction. Currently, there are several different embodiments of C++ based environments, mostly in the form of hardware modeling libraries built on top of C++. However, the semantic gap between hardware modeling concepts, and the software programming language constructs, poses several issues which require critical examination. In this paper, we address the issue of interoperability between models built using different C++ based modeling libraries, or even modeling "styles" including home-grown C++ models. Model interoperability is the ability to use C++ based descriptions across different C++ based modeling environments. Two important aspects of interoperability are model composability, and model reusability. In this paper we focus on model reusability, analyzing various dimensions of the reusability of C++ based models, in an integration environment for building SOC models. We show how an inheritance based composition may be used to make two distinct C++ based class libraries interoperate. We also outline the implementation of a dynamic composition environment, which allows automatic run-time delegation based composition, to achieve interoperability. These strategies allow system integrators to focus on design composition, rather than software programming details inherent in the current inheritance based solutions.

Analysis of high-level address code transformations for programmable processors

Proceedings of the conference on Design, automation and test in Europe - DATE '00, 2000

Memory intensive applications require considerable arithmetic for the computation and selection o... more Memory intensive applications require considerable arithmetic for the computation and selection of the different memory access pointers. These memory address calculations often involve complex (non)linear arithmetic expressions which have to be calculated during program execution under tight timing constraints, thus becoming a crucial bottleneck in the overall system performance. This paper explores applicability and effectiveness of source-level optimisations (as opposed to instruction-level) for address computations in the context of multimedia. We propose and evaluate two processor-target independent source-level optimisation techniques, namely, global scope operation cost minimisation complemented with loop-invariant code hoisting, and non-linear operator strength reduction. The transformations attempt to achieve minimal code execution within loops and reduced operator strengths. The effectiveness of the transformations is demonstrated with two real-life multimedia application kernels by comparing the improvements in the number of execution cycles, before and after applying the systematic source-level optimisations, using stateof-the-art C compilers on several popular RISC platforms.

Proceedings of the 4th international conference on Mobile systems, applications and services - MobiSys 2006, 2006

CoolSpots enable a wireless mobile device to automatically switch between multiple radio interfac... more CoolSpots enable a wireless mobile device to automatically switch between multiple radio interfaces, such as WiFi and Bluetooth, in order to increase battery lifetime. The main contribution of this work is an exploration of the policies that enable a system to switch among these interfaces, each with diverse radio characteristics and different ranges, in order to save power-supported by detailed quantitative measurements. The system and policies do not require any changes to the mobile applications themselves, and changes required to existing infrastructure are minimal. Results are reported for a suite of commonly used applications, such as file transfer, web browsing, and streaming media, across a range of operating conditions. Experimental validation of the CoolSpot system on a mobile research platform shows substantial energy savings: more than a 50% reduction in energy consumption of the wireless subsystem is possible, with an associated increase in the effective battery lifetime.

Specification, Modeling and Design Tools for System-on-Chip

Ubiquitous embedded systems are revolutionizing our daily lives. Whole systems on a chip deliver ... more Ubiquitous embedded systems are revolutionizing our daily lives. Whole systems on a chip deliver unprecedented computation power at ever decreasing costs. However, their complexity makes their design with traditional RTL-based flows extremely challenging. Complexity in such systems arises not only from the diversity of the technologies, from RF front-ends to baseband DSP software, that must be integrated on-chip, but also from the fact that such systems must be increasingly built from parts that have been designed separately and using different tools and flows. High abstraction levels and component reuse are essential. Often such systems are highly networked and rely on sophisticated communication mechanisms. Architectural design and performance analysis of such networked system-chips is a crucial part of the embedded system design process.Two basic methods for tackling the growing complexity of system-on-chip design are emerging. Formal specification and design methods, and platfor...

Cycle accurate transaction-driven simulation with multiple processor simulators

IEICE Electronics Express, 2009

Performance of multi-processor simulation is determined by how often simulators exchange events w... more Performance of multi-processor simulation is determined by how often simulators exchange events with one another and how accurately simulators model their behaviors. Previous techniques have limited their applicability or sacrificed accuracy for performance. In this paper, we notice that inaccuracy comes from events which arrive between event exchange boundaries. We propose cycle accurate transaction-driven simulation which maintains event exchange boundaries at bus transactions but compensates for accuracy. The proposed technique is implemented in CATS framework and our experiment with 64 processors achieves 1.2 M processor cycles/s.

Processor Speed Control With Thermal Constraints

IEEE Transactions on Circuits and Systems I: Regular Papers, 2009

We consider the problem of adjusting speeds of multiple computer processors sharing the same ther... more We consider the problem of adjusting speeds of multiple computer processors sharing the same thermal environment, such as a chip or multi-chip package. We assume that the speed of processor (and associated variables, such as power supply voltage) can be controlled, and we model the dissipated power of a processor as a positive and strictly increasing convex function of the speed. We show that the problem of processor speed control subject to thermal constraints for the environment is a convex optimization problem. We present an efficient infeasible-start primal-dual interior-point method for solving the problem. We also present a decentralized method, using dual decomposition. Both of these approaches can be interpreted as nonlinear static control laws, which adjust the processor speeds based on the measured temperatures in the system. We give a numerical example to illustrate performance of the algorithms.

Introducing core-based system design

IEEE Design & Test of Computers, 1997

IN RECENT YEARS, cores have captured the imagination of designers who understand the potential of... more IN RECENT YEARS, cores have captured the imagination of designers who understand the potential of using these cells like integrated circuits on a PC board in building on-chip systems. (See the "What is a core cell?" box.) With a rich cell library of predesigned, preverified circuit blocks, cores provide an attractive means to import technology to a system integrator and differentiate products by leveraging intellectual property advantages. Most importantly, core use shortens the time to market for new system designs through design reuse. Practical implementation of this design scenario, however, is fraught with unresolved issues: design methods for building single-chip systems, challenges in test and sign-off for these systems, and intellectual property licensing, protection, and liability. Here, we examine the evolving design flow for microelectronic systems, the market for core cells, and the challenges in using core cells for design, integration, assembly, and test of on-chip systems.

Hardware-software cosynthesis for digital systems

IEEE Design & Test of Computers, 1993

As the complexity of system design increases, use of pre-designed components, such as generalpurp... more As the complexity of system design increases, use of pre-designed components, such as generalpurpose microprocessors, provides an effective way to reduce the complexity of synthesized hardware. While the design problem of systems that contain processors and ASIC chips is not new, computeraided synthesis of such heterogeneous or mixed systems poses challenging problems because of the differences in model and rate of computation by application-specific hardware and processor software. In this article, we demonstrate the feasibility of achieving synthesis of heterogeneous systems which uses timing constraints to delegate tasks between hardware and software such that the final implementation meets required performance constraints.

Profile-Based Dynamic Voltage Scheduling Using Program Checkpoints

Design, Automation, and Test in Europe, 2002

Dynamic voltage scaling (DVS) is a known effectivemechanism for reducing CPU energy consumption w... more Dynamic voltage scaling (DVS) is a known effectivemechanism for reducing CPU energy consumption withoutsignificant performance degradation. While a lot of workhas been done on inter-task scheduling algorithms to implementDVS under operating system control, new researchchallenges exist in intra-task DVS techniques under softwareand compiler control. In this paper we introduce anovel intra-task DVS technique under compiler control usingprogram checkpoints. Checkpoints are

Algorithms for power savings

ACM Transactions on Algorithms, 2007

This article examines two different mechanisms for saving power in battery-operated embedded syst... more This article examines two different mechanisms for saving power in battery-operated embedded systems. The first strategy is that the system can be placed in a sleep state if it is idle. However, a fixed amount of energy is required to bring the system back into an active state in which it can resume work. The second way in which power savings can be achieved is by varying the speed at which jobs are run. We utilize a power consumption curve P ( s ) which indicates the power consumption level given a particular speed. We assume that P ( s ) is convex, nondecreasing, and nonnegative for s ≥ 0. The problem is to schedule arriving jobs in a way that minimizes total energy use and so that each job is completed after its release time and before its deadline. We assume that all jobs can be preempted and resumed at no cost. Although each problem has been considered separately, this is the first theoretical analysis of systems that can use both mechanisms. We give an offline algorithm that i...

Using a programming language for digital system design

IEEE Design & Test of Computers, 1997

become more complex, system designers are increasingly concerned about systemmodeling tools and t... more become more complex, system designers are increasingly concerned about systemmodeling tools and their impact on productivity and hardware design quality. In addition, they want to quickly produce a working hardware model, simulate it with the rest of the system, and synthesize and/or formally verify it for specific properties. Toward this end, designers are using textual languages based on high-level programming languages to express executable behaviors. Indeed, languages such as C, VHDL, and Verilog are common in large-scale system design and debugging. Undoubtedly, this growth in the use of textual programming languages stems from system designers' familiarity with general-purpose, high-level programming languages. Using programming languages for hardware specification can significantly shorten the system designer's learning curve and enables simulation of complete systems for correct functionality. There are pitfalls, however, in following a pure software-programminglanguage description to model hardwaremainly, inefficient results from synthesis tools. Consequently, language developers often modify and extend software-programming languages to produce hardware description languages (HDLs) geared specifically to hardware modeling. Most semantic extensions concern structural components, exact event timing, and operational concurrency-concepts absent from most softwareprogramming languages.

Associative Convolutional Layers

Motivated by the necessity for parameter efficiency in distributed machine learning and AI-enable... more Motivated by the necessity for parameter efficiency in distributed machine learning and AI-enabled edge devices, we provide a general and easy to implement method for significantly reducing the number of parameters of Convolutional Neural Networks (CNNs), during both the training and inference phases. We introduce a simple auxiliary neural network which can generate the convolutional filters of any CNN architecture from a low dimensional latent space. This auxiliary neural network, which we call "Convolutional Slice Generator" (CSG), is unique to the network and provides the association between its convolutional layers. During the training of the CNN, instead of training the filters of the convolutional layers, only the parameters of the CSG and their corresponding "code vectors" are trained. This results in a significant reduction of the number of parameters due to the fact that the CNN can be fully represented using only the parameters of the CSG, the code vect...

Ember - energy management of batteryless event detection sensors with deep reinforcement learning

Proceedings of the 18th Conference on Embedded Networked Sensor Systems, 2020

Batteryless sensors avoid battery replacement at the cost of slowing down or stopping their opera... more Batteryless sensors avoid battery replacement at the cost of slowing down or stopping their operations when there is not sufficient energy to harvest in the environment. While this strategy can work for some applications, event-based applications still remain a challenge as events arrive sporadically and energy availability is uncertain. One solution is to only turn On a sensor right before an event is happening to both detect the event and save as much energy as possible. Therefore, the system has to correctly predict events while managing limited resource availability. In this demo, we present Ember, an energy management system based on deep reinforcement learning to duty cycle event-driven sensors in low-energy conditions. We show how our system learns environmental patterns over time and makes decisions to maximize the event detection rate for batteryless energy-harvesting sensor nodes subject to low energy availability. Furthermore, we show a novel self-supervised data collection algorithm that helps Ember in discovering new environmental patterns over time. For more details, we refer readers to the full paper of Ember [2]. CCS CONCEPTS • Computer systems organization → Sensor networks; • Computing methodologies → Reinforcement learning.

Integrated I-cache Way Predictor and Branch Target Buffer to Reduce Energy Consumption

Lecture Notes in Computer Science, 2002

In this paper, we present a Branch Target Bu er (BTB) design for energy savings in set-associativ... more In this paper, we present a Branch Target Bu er (BTB) design for energy savings in set-associative instruction caches. We extend the functionality of a BTB by caching way predictions in addition to branch target addresses. Way prediction and branch target prediction are done in parallel. Instruction cache energy savings are achieved b y a c cessing one cache way if the way prediction for a fetch is available. To increase the number of way predictions for higher energy savings, we modify the BTB management policy to allocate entries for non-branch instructions. Furthermore, we propose to partition a BTB into ways for branch instructions and ways for non-branch instructions to reduce the BTB energy as well. We evaluate the e ectiveness of our BTB design and management policies with SPEC95 benchmarks. The best BTB con guration shows a 74% energy savings on average in a 4-way set-associative instruction cache and the performance d e gradation is only 0.1%. When the instruction cache energy and the BTB energy are considered t o gether, the average energy-delay product reduction is 65%.

An Embedded Platform with Duty-Cycled Radio and Processing Subsystems for Wireless Sensor Networks

Lecture Notes in Computer Science

Wireless sensor nodes are increasingly being tasked with computation and communication intensive ... more Wireless sensor nodes are increasingly being tasked with computation and communication intensive functions while still subject to constraints related to energy availability. On these embedded platforms, once all low power design techniques have been explored, duty-cycling the various subsystems remains the primary option to meet the energy and power constraints. This requires the ability to provide spurts of high MIPS and high bandwidth connections. However, due to the large overheads associated with duty-cycling the computation and communication subsystems, existing high performance sensor platforms are not efficient in supporting such an option. In this paper, we present the design and optimizations taken in a wireless gateway node (WGN) that bridges data from wireless sensor networks to Wi-Fi networks in an on-demand basis. We discuss our strategies to reduce duty-cycling related costs by partitioning the system and by reducing the amount of time required to activate or deactivate the high-powered components. We compare the design choices and performance parameters with those made in the Intel Stargate platform to show the effectiveness of duty-cycling on our platform. We have built a working prototype, and the experimental results with two different power management schemes show significant reductions in latency and average power consumption compared to the Stargate.

A Cross-Layer Approach for Power-Performance Optimization in Distributed Mobile Systems

19th IEEE International Parallel and Distributed Processing Symposium

The next generation of mobile systems with multimedia processing capabilities and wireless connec... more The next generation of mobile systems with multimedia processing capabilities and wireless connectivity will be increasingly deployed in highly dynamic and distributed environments for multimedia playback and delivery (e.g. video streaming, multimedia conferencing). The challenge is to meet the heavy resource demands of multimedia applications under the stringent energy, computational, and bandwidth constraints of mobile systems, while constantly adapting to the global state changes of the distributed environment. In this paper, we present our initiatives under the FORGE framework to address the issue of delivering high quality multimedia content in mobile environments. In order to cope with the resource intensive nature of multimedia applications and dynamically changing global state (e.g. node mobility, network congestion), an end-to-end approach to QoS aware power optimization is required. We present a framework for coordinating energy optimizing strategies across various layers of system implementation and functionality and discuss techniques that can be employed to achieve energy gains for mobile multimedia systems.

Dynamic voltage scaling for systemwide energy minimization in real-time embedded systems

Proceedings of the 2004 international symposium on Low power electronics and design - ISLPED '04, 2004

Traditionally, dynamic voltage scaling (DVS) techniques have focused on minimizing the processor ... more Traditionally, dynamic voltage scaling (DVS) techniques have focused on minimizing the processor energy consumption as opposed to the entire system energy consumption. The slowdown resulting from DVS can increase the energy consumption of components like memory and network interfaces. Furthermore, the leakage power consumption is increasing with the scaling device technology and must also be taken into account. In this work, we consider energy efficient slowdown in a real-time task system. We present an algorithm to compute task slowdown factors based on the contribution of the processor leakage and standby energy consumption of the resources in the system. Our simulation experiments using randomly generated task sets show on an average 10% energy gains over traditional dynamic voltage scaling. We further combine slowdown with procrastination scheduling which increases the average energy savings to 15%. We show that our scheduling approach minimizes the total static and dynamic energy consumption of the systemwide resources.

Verifying GPU kernels by test amplification

Proceedings of the 33rd ACM SIGPLAN Conference on Programming Language Design and Implementation, 2012

We present a novel technique for verifying properties of data parallel GPU programs via test ampl... more We present a novel technique for verifying properties of data parallel GPU programs via test amplification. The key insight behind our work is that we can use the technique of static information flow to amplify the result of a single test execution over the set of all inputs and interleavings that affect the property being verified. We empirically demonstrate the effectiveness of test amplification for verifying race-freedom and determinism over a large number of standard GPU kernels, by showing that the result of verifying a single dynamic execution can be amplified over the massive space of possible data inputs and thread interleavings.

Automated refinement checking of concurrent systems

2007 IEEE/ACM International Conference on Computer-Aided Design, 2007

Stepwise refinement is at the core of many approaches to synthesis and optimization of hardware a... more Stepwise refinement is at the core of many approaches to synthesis and optimization of hardware and software systems. For instance, it can be used to build a synthesis approach for digital circuits from high level specifications. It can also be used for post-synthesis modification such as in Engineering Change Orders (ECOs). Therefore, checking if a system, modeled as a set of concurrent processes, is a refinement of another is of tremendous value. In this paper, we focus on concurrent systems modeled as Communicating Sequential Processes (CSP) and show their refinements can be validated using insights from translation validation, automated theorem proving and relational approaches to reasoning about programs. The novelty of our approach is that it handles infinite state spaces in a fully automated manner. We have implemented our refinement checking technique and have applied it to a variety of refinements. We present the details of our algorithm and experimental results. As an example, we were able to automatically check an infinite state space buffer refinement that cannot be checked by current state of the art tools such as FDR. We were also able to check the data part of an industrial case study on the EP2 system.

Energy-optimized dynamic deferral of workload for capacity provisioning in data centers

2013 International Green Computing Conference Proceedings, 2013

Recent increase in energy prices has led researchers to find better ways for capacity provisionin... more Recent increase in energy prices has led researchers to find better ways for capacity provisioning in data centers to reduce energy wastage due to the variation in workload. This paper explores the opportunity for cost saving utilizing the flexibility from the Service Level Agreements (SLAs) and proposes a novel approach for capacity provisioning under bounded latency requirements of the workload. We investigate how many servers to be kept active and how much workload to be delayed for energy saving while meeting every deadline. We present an offline LP formulation for capacity provisioning by dynamic deferral and give two online algorithms to determine the capacity of the data center and the assignment of workload to servers dynamically. We prove the feasibility of the online algorithms and show that their worst case performance are bounded by a constant factor with respect to the offline formulation. We validate our algorithms on a MapReduce workload by provisioning capacity on a Hadoop cluster and show that the algorithms actually perform much better in practice compared to the naive 'follow the workload' provisioning, resulting in 20-40% cost-savings.

Interoperability as a design issue in C++ based modeling environments

Proceedings of the 14th international symposium on Systems synthesis - ISSS '01, 2001

The increasing heterogeneity and complexity of VLSI systems has made the use of C++ popular for b... more The increasing heterogeneity and complexity of VLSI systems has made the use of C++ popular for building simulation and synthesis models at higher levels of abstraction. Currently, there are several different embodiments of C++ based environments, mostly in the form of hardware modeling libraries built on top of C++. However, the semantic gap between hardware modeling concepts, and the software programming language constructs, poses several issues which require critical examination. In this paper, we address the issue of interoperability between models built using different C++ based modeling libraries, or even modeling "styles" including home-grown C++ models. Model interoperability is the ability to use C++ based descriptions across different C++ based modeling environments. Two important aspects of interoperability are model composability, and model reusability. In this paper we focus on model reusability, analyzing various dimensions of the reusability of C++ based models, in an integration environment for building SOC models. We show how an inheritance based composition may be used to make two distinct C++ based class libraries interoperate. We also outline the implementation of a dynamic composition environment, which allows automatic run-time delegation based composition, to achieve interoperability. These strategies allow system integrators to focus on design composition, rather than software programming details inherent in the current inheritance based solutions.

Analysis of high-level address code transformations for programmable processors

Proceedings of the conference on Design, automation and test in Europe - DATE '00, 2000

Memory intensive applications require considerable arithmetic for the computation and selection o... more Memory intensive applications require considerable arithmetic for the computation and selection of the different memory access pointers. These memory address calculations often involve complex (non)linear arithmetic expressions which have to be calculated during program execution under tight timing constraints, thus becoming a crucial bottleneck in the overall system performance. This paper explores applicability and effectiveness of source-level optimisations (as opposed to instruction-level) for address computations in the context of multimedia. We propose and evaluate two processor-target independent source-level optimisation techniques, namely, global scope operation cost minimisation complemented with loop-invariant code hoisting, and non-linear operator strength reduction. The transformations attempt to achieve minimal code execution within loops and reduced operator strengths. The effectiveness of the transformations is demonstrated with two real-life multimedia application kernels by comparing the improvements in the number of execution cycles, before and after applying the systematic source-level optimisations, using stateof-the-art C compilers on several popular RISC platforms.

Proceedings of the 4th international conference on Mobile systems, applications and services - MobiSys 2006, 2006

CoolSpots enable a wireless mobile device to automatically switch between multiple radio interfac... more CoolSpots enable a wireless mobile device to automatically switch between multiple radio interfaces, such as WiFi and Bluetooth, in order to increase battery lifetime. The main contribution of this work is an exploration of the policies that enable a system to switch among these interfaces, each with diverse radio characteristics and different ranges, in order to save power-supported by detailed quantitative measurements. The system and policies do not require any changes to the mobile applications themselves, and changes required to existing infrastructure are minimal. Results are reported for a suite of commonly used applications, such as file transfer, web browsing, and streaming media, across a range of operating conditions. Experimental validation of the CoolSpot system on a mobile research platform shows substantial energy savings: more than a 50% reduction in energy consumption of the wireless subsystem is possible, with an associated increase in the effective battery lifetime.

Specification, Modeling and Design Tools for System-on-Chip

Ubiquitous embedded systems are revolutionizing our daily lives. Whole systems on a chip deliver ... more Ubiquitous embedded systems are revolutionizing our daily lives. Whole systems on a chip deliver unprecedented computation power at ever decreasing costs. However, their complexity makes their design with traditional RTL-based flows extremely challenging. Complexity in such systems arises not only from the diversity of the technologies, from RF front-ends to baseband DSP software, that must be integrated on-chip, but also from the fact that such systems must be increasingly built from parts that have been designed separately and using different tools and flows. High abstraction levels and component reuse are essential. Often such systems are highly networked and rely on sophisticated communication mechanisms. Architectural design and performance analysis of such networked system-chips is a crucial part of the embedded system design process.Two basic methods for tackling the growing complexity of system-on-chip design are emerging. Formal specification and design methods, and platfor...

Cycle accurate transaction-driven simulation with multiple processor simulators

IEICE Electronics Express, 2009

Performance of multi-processor simulation is determined by how often simulators exchange events w... more Performance of multi-processor simulation is determined by how often simulators exchange events with one another and how accurately simulators model their behaviors. Previous techniques have limited their applicability or sacrificed accuracy for performance. In this paper, we notice that inaccuracy comes from events which arrive between event exchange boundaries. We propose cycle accurate transaction-driven simulation which maintains event exchange boundaries at bus transactions but compensates for accuracy. The proposed technique is implemented in CATS framework and our experiment with 64 processors achieves 1.2 M processor cycles/s.

Processor Speed Control With Thermal Constraints

IEEE Transactions on Circuits and Systems I: Regular Papers, 2009

We consider the problem of adjusting speeds of multiple computer processors sharing the same ther... more We consider the problem of adjusting speeds of multiple computer processors sharing the same thermal environment, such as a chip or multi-chip package. We assume that the speed of processor (and associated variables, such as power supply voltage) can be controlled, and we model the dissipated power of a processor as a positive and strictly increasing convex function of the speed. We show that the problem of processor speed control subject to thermal constraints for the environment is a convex optimization problem. We present an efficient infeasible-start primal-dual interior-point method for solving the problem. We also present a decentralized method, using dual decomposition. Both of these approaches can be interpreted as nonlinear static control laws, which adjust the processor speeds based on the measured temperatures in the system. We give a numerical example to illustrate performance of the algorithms.

Introducing core-based system design

IEEE Design & Test of Computers, 1997

IN RECENT YEARS, cores have captured the imagination of designers who understand the potential of... more IN RECENT YEARS, cores have captured the imagination of designers who understand the potential of using these cells like integrated circuits on a PC board in building on-chip systems. (See the "What is a core cell?" box.) With a rich cell library of predesigned, preverified circuit blocks, cores provide an attractive means to import technology to a system integrator and differentiate products by leveraging intellectual property advantages. Most importantly, core use shortens the time to market for new system designs through design reuse. Practical implementation of this design scenario, however, is fraught with unresolved issues: design methods for building single-chip systems, challenges in test and sign-off for these systems, and intellectual property licensing, protection, and liability. Here, we examine the evolving design flow for microelectronic systems, the market for core cells, and the challenges in using core cells for design, integration, assembly, and test of on-chip systems.

Hardware-software cosynthesis for digital systems

IEEE Design & Test of Computers, 1993

As the complexity of system design increases, use of pre-designed components, such as generalpurp... more As the complexity of system design increases, use of pre-designed components, such as generalpurpose microprocessors, provides an effective way to reduce the complexity of synthesized hardware. While the design problem of systems that contain processors and ASIC chips is not new, computeraided synthesis of such heterogeneous or mixed systems poses challenging problems because of the differences in model and rate of computation by application-specific hardware and processor software. In this article, we demonstrate the feasibility of achieving synthesis of heterogeneous systems which uses timing constraints to delegate tasks between hardware and software such that the final implementation meets required performance constraints.

Profile-Based Dynamic Voltage Scheduling Using Program Checkpoints

Design, Automation, and Test in Europe, 2002

Dynamic voltage scaling (DVS) is a known effectivemechanism for reducing CPU energy consumption w... more Dynamic voltage scaling (DVS) is a known effectivemechanism for reducing CPU energy consumption withoutsignificant performance degradation. While a lot of workhas been done on inter-task scheduling algorithms to implementDVS under operating system control, new researchchallenges exist in intra-task DVS techniques under softwareand compiler control. In this paper we introduce anovel intra-task DVS technique under compiler control usingprogram checkpoints. Checkpoints are

Algorithms for power savings

ACM Transactions on Algorithms, 2007

This article examines two different mechanisms for saving power in battery-operated embedded syst... more This article examines two different mechanisms for saving power in battery-operated embedded systems. The first strategy is that the system can be placed in a sleep state if it is idle. However, a fixed amount of energy is required to bring the system back into an active state in which it can resume work. The second way in which power savings can be achieved is by varying the speed at which jobs are run. We utilize a power consumption curve P ( s ) which indicates the power consumption level given a particular speed. We assume that P ( s ) is convex, nondecreasing, and nonnegative for s ≥ 0. The problem is to schedule arriving jobs in a way that minimizes total energy use and so that each job is completed after its release time and before its deadline. We assume that all jobs can be preempted and resumed at no cost. Although each problem has been considered separately, this is the first theoretical analysis of systems that can use both mechanisms. We give an offline algorithm that i...