The remarkable capabilities and intricate nature of Artificial Intelligence (AI) have dramaticall... more The remarkable capabilities and intricate nature of Artificial Intelligence (AI) have dramatically escalated the imperative for specialized AI accelerators. Nonetheless, designing these accelerators for various AI workloads remains both labor-and time-intensive. While existing design exploration and automation tools can partially alleviate the need for extensive human involvement, they still demand substantial hardware expertise, posing a barrier to non-experts and stifling AI accelerator development. Motivated by the astonishing potential of large language models (LLMs) for generating high-quality content in response to human language instructions, we embark on this work to examine the possibility of harnessing LLMs to automate AI accelerator design. Through this endeavor, we develop GPT4AIGChip, a framework intended to democratize AI accelerator design by leveraging human natural languages instead of domain-specific languages. Specifically, we first perform an in-depth investigation into LLMs' limitations and capabilities for AI accelerator design, thus aiding our understanding of our current position and garnering insights into LLM-powered automated AI accelerator design. Furthermore, drawing inspiration from the above insights, we develop a framework called GPT4AIGChip, which features an automated demo-augmented prompt-generation pipeline utilizing in-context learning to guide LLMs towards creating high-quality AI accelerator design. To our knowledge, this work is the first to demonstrate an effective pipeline for LLM-powered automated AI accelerator generation. Accordingly, we anticipate that our insights and framework can serve as a catalyst for innovations in next-generation LLM-powered design automation tools.
Driven by the explosive interest in applying deep reinforcement learning (DRL) agents to numerous... more Driven by the explosive interest in applying deep reinforcement learning (DRL) agents to numerous real-time control and decision-making applications, there has been a growing demand to deploy DRL agents to empower daily-life intelligent devices, while the prohibitive complexity of DRL stands at odds with limited on-device resources. In this work, we propose an Automated Agent Accelerator Co-Search (A3C-S) framework, which to our best knowledge is the first to automatically cosearch the optimally matched DRL agents and accelerators that maximize both test scores and hardware efficiency. Extensive experiments consistently validate the superiority of our A3C-S over state-of-the-art techniques.
Tiny deep learning has attracted increasing attention driven by the substantial demand for deploy... more Tiny deep learning has attracted increasing attention driven by the substantial demand for deploying deep learning on numerous intelligent Internet-of-Things devices. However, it is still challenging to unleash tiny deep learning's full potential on both large-scale datasets and downstream tasks due to the under-fitting issues caused by the limited model capacity of tiny neural networks (TNNs). To this end, we propose a framework called NetBooster to empower tiny deep learning by augmenting the architectures of TNNs via an expansion-then-contraction strategy. Extensive experiments show that NetBooster consistently outperforms state-of-the-art tiny deep learning solutions.
We introduce the concept of saturated absorption competition (SAC) microscopy as a means of provi... more We introduce the concept of saturated absorption competition (SAC) microscopy as a means of providing sub-diffraction spatial resolution in fluorescence imaging. Unlike the post-competition process between stimulated and spontaneous emission that is used in stimulated emission depletion (STED) microscopy, SAC microscopy breaks the diffraction limit by emphasizing a pre-competition process that occurs in the fluorescence absorption stage in a manner that shares similarities with ground-state depletion (GSD) microscopy. Moreover, unlike both STED and GSD microscopy, SAC microscopy offers a reduction in complexity and cost by utilizing only a single continuous-wave laser diode and an illumination intensity that is ~ 20x smaller than that used in STED. Our approach can be physically implemented in a confocal microscope by dividing the input laser source into a time-modulated primary excitation beam and a doughnut-shaped saturation beam, and subsequently employing a homodyne detection scheme to select the modulated fluorescence signal. Herein, we provide both a physico-chemical model of SAC and experimentally demonstrate by way of a proof-of-concept experiment a transverse spatial resolution of ~λ/6. Owing to the advantage of being non-invasive in observing living samples, far-field optical
The recent breakthroughs and prohibitive complexities of Deep Neural Networks (DNNs) have excited... more The recent breakthroughs and prohibitive complexities of Deep Neural Networks (DNNs) have excited extensive interest in domain specific DNN accelerators, among which optical DNN accelerators are particularly promising thanks to their unprecedented potential of achieving superior performanceper-watt. However, the development of optical DNN accelerators is much slower than that of electrical DNN accelerators. One key challenge is that while many techniques have been developed to facilitate the development of electrical DNN accelerators, techniques that support or expedite optical DNN accelerator design remain much less explored, limiting both the achievable performance and the innovation development of optical DNN accelerators. To this end, we develop the first-of-its-kind framework dubbed O-HAS, which for the first time demonstrates automated Optical Hardware Accelerator Search for boosting both the acceleration efficiency and development speed of optical DNN accelerators. Specifically, our O-HAS consists of two integrated enablers: (1) an O-Cost Predictor, which can accurately yet efficiently predict an optical accelerator's energy and latency based on the DNN model parameters and the optical accelerator design; and (2) an O-Search Engine, which can automatically explore the large design space of optical DNN accelerators and identify the optimal accelerators (i.e., the micro-architectures and algorithm-to-accelerator mapping methods) in order to maximize the target acceleration efficiency. Extensive experiments and ablation studies consistently validate the effectiveness of both our O-Cost Predictor and O-Search Engine as well as the excellent efficiency of O-HAS generated optical accelerators.
This paper presents a novel network compression framework, Kernel Quantization (KQ), targeting to... more This paper presents a novel network compression framework, Kernel Quantization (KQ), targeting to efficiently convert any pre-trained full-precision convolutional neural network (CNN) model into a low-precision version without significant performance loss. Unlike existing methods struggling with weight bit-length, KQ has the potential in improving the compression ratio by considering the convolution kernel as the quantization unit. Inspired by the evolution from weight pruning to filter pruning, we propose to quantize in both kernel and weight level. Instead of representing each weight parameter with a low-bit index, we learn a kernel codebook and replace all kernels in the convolution layer with corresponding low-bit indexes. Thus, KQ can represent the weight tensor in the convolution layer with low-bit indexes and a kernel codebook with limited size, which enables KQ to achieve significant compression ratio. Then, we conduct a 6-bit parameter quantization on the kernel codebook to further reduce redundancy. Extensive experiments on the ImageNet classification task prove that KQ needs 1.05 and 1.62 bits on average in VGG and ResNet18, respectively, to represent each parameter in the convolution layer and achieves the state-of-the-art compression ratio with little accuracy loss.
Proceedings of the ... AAAI Conference on Artificial Intelligence, Jun 28, 2022
Vision transformers (ViTs) have recently demonstrated great success in various computer vision ta... more Vision transformers (ViTs) have recently demonstrated great success in various computer vision tasks, motivating a tremendously increased interest in their deployment into many real-world IoT applications. However, powerful ViTs are often too computationally expensive to be fitted onto realworld resource-constrained devices, due to (1) their quadratically increased complexity with the number of input tokens and (2) their overparameterized self-attention heads and model depth. In parallel, different images are of varying complexity and their different regions can contain various levels of visual information, e.g., a sky background is not as informative as a foreground object in object classification tasks, indicating that treating all regions/tokens equally in terms of model complexity is unnecessary while such opportunities for trimming down ViTs' complexity have not been fully explored. To this end, we propose a Multi-grained Input-Adaptive Vision TransFormer framework dubbed MIA-Former that can input-adaptively adjust the structure of ViTs at three coarse-to-fine-grained granularities (i.e., model depth and the number of model heads/tokens). In particular, our MIA-Former adopts a low-cost network trained with a hybrid supervised and reinforcement training method to skip unnecessary layers, heads, and tokens in an input adaptive manner, reducing the overall computational cost. Furthermore, an interesting side effect of our MIA-Former is that its resulting ViTs are naturally equipped with improved robustness against adversarial attacks over their static counterparts, because MIA-Former's multi-grained dynamic control improves the model diversity similar to the effect of ensemble and thus increases the difficulty of adversarial attacks against all its sub-models. Extensive experiments and ablation studies validate that the proposed MIA-Former framework can (1) effectively allocate computation budgets adaptive to the difficulty of input images, achieving state-of-the-art (SOTA) accuracy-efficiency trade-offs, e.g., 20% computation savings with the same or even a higher accuracy compared with SOTA dynamic transformer models, and (2) boost ViTs' robustness accuracy under various adversarial attacks over their vanilla counterparts by 2.4% and 3.0%, respectively. Our code is available at .
Despite the impressive performance recently achieved by automatic speech recognition (ASR), we ob... more Despite the impressive performance recently achieved by automatic speech recognition (ASR), we observe two primary challenges that hinder its broader applications: (1) The difficulty of introducing scalability into the model to support more languages with limited training, inference, and storage overhead; (2) The low-resource adaptation ability that enables effective low-resource adaptation while avoiding over-fitting and catastrophic forgetting issues. Inspired by recent findings, we hypothesize that we can address the above challenges with modules widely shared across languages. To this end, we propose an ASR framework, dubbed Master-ASR, that, for the first time, simultaneously achieves strong multilingual scalability and low-resource adaptation ability thanks to its modularize-then-assemble strategy. Specifically, Master-ASR learns a small set of generalizable sub-modules and adaptively assembles them for different languages to reduce the multilingual overhead and enable effective knowledge transfer for low-resource adaptation. Extensive experiments and visualizations demonstrate that Master-ASR can effectively discover language similarity and improve multilingual and low-resource ASR performance over state-of-the-art (SOTA) methods, e.g., under multilingual-ASR, our framework achieves a 0.13∼2.41 lower character error rate (CER) with 30% smaller inference overhead over SOTA solutions on multilingual ASR and a comparable CER, with nearly 50 times fewer trainable parameters over SOTA solutions on low-resource tuning, respectively.
In computer-aided diagnosis (CAD), diagnosing untrained diseases as known categories will cause s... more In computer-aided diagnosis (CAD), diagnosing untrained diseases as known categories will cause serious medical accidents, which makes it crucial to distinguish the new class (open set) meanwhile preserving the known classes (closed set) performance so as to enhance the robustness. However, how to accurately define the decision boundary between known and unknown classes is still an open problem, as unknown classes are never seen during the training process, especially in medical area. Moreover, manipulating the latent distribution of known classes further influences the unknown's and makes it even harder. In this paper, we propose the Centralized Space Learning (CSL) method to address the open-set recognition problem in CADs by learning a centralized space to separate the known and unknown classes with the assistance of proxy images generated by a generative adversarial network (GAN). With three steps, including known space initialization, unknown anchor generation and centralized space refinement, CSL learns the optimized space distribution with unknown samples cluster around the center while the known spread away from the center, achieving a significant identification between the known and the unknown. Extensive experiments on multiple datasets and tasks illustrate the proposed CSL's practicability in CAD and the state-of-the-art open-set recognition performance.
The promise of Deep Neural Network (DNN) powered Internet of Thing (IoT) devices has motivated a ... more The promise of Deep Neural Network (DNN) powered Internet of Thing (IoT) devices has motivated a tremendous demand for automated solutions to enable fast development and deployment of efficient (1) DNNs equipped with instantaneous accuracy-efficiency trade-off capability to accommodate the timevarying resources at IoT devices and (2) dataflows to optimize DNNs' execution efficiency on different devices. Therefore, we propose InstantNet to automatically generate and deploy instantaneously switchable-precision networks which operate at variable bit-widths. Extensive experiments show that the proposed InstantNet consistently outperforms state-of-the-art designs. Our codes are available at: .
Despite the growing demand for tuning foundation vision transformers (FViTs) on downstream tasks,... more Despite the growing demand for tuning foundation vision transformers (FViTs) on downstream tasks, fully unleashing FViTs' potential under data-limited scenarios (e.g., fewshot tuning) remains a challenge due to FViTs' data-hungry nature. Common data augmentation techniques fall short in this context due to the limited features contained in the few-shot tuning data. To tackle this challenge, we first identify an opportunity for FViTs in few-shot tuning: pretrained FViTs themselves have already learned highly representative features from large-scale pretraining data, which are fully preserved during widely used parameter-efficient tuning. We thus hypothesize that leveraging those learned features to augment the tuning data can boost the effectiveness of few-shot FViT tuning. To this end, we propose a framework called Hint-based Data Augmentation (Hint-Aug), which aims to boost FViT in few-shot tuning by augmenting the over-fitted parts of tuning samples with the learned features of pretrained FViTs. Specifically, Hint-Aug integrates two key enablers: (1) an Attentive Over-fitting Detector (AOD) to detect over-confident patches of foundation ViTs for potentially alleviating their over-fitting on the few-shot tuning data and (2) a Confusion-based Feature Infusion (CFI) module to infuse easy-to-confuse features from the pretrained FViTs with the over-confident patches detected by the above AOD in order to enhance the feature diversity during tuning. Extensive experiments and ablation studies on five datasets and three parameter-efficient tuning techniques consistently validate Hint-Aug's effectiveness: 0.04% ∼ 32.91% higher accuracy over the state-of-the-art (SOTA) data augmentation method under various low-shot settings. For example, on the Pet dataset, Hint-Aug achieves a 2.22% higher accuracy with 50% less training data over SOTA data augmentation methods.
In computer-aided diagnosis (CAD), diagnosing untrained diseases as known categories will cause s... more In computer-aided diagnosis (CAD), diagnosing untrained diseases as known categories will cause serious medical accidents, which makes it crucial to distinguish the new class (open set) meanwhile preserving the known classes (closed set) performance so as to enhance the robustness. However, how to accurately define the decision boundary between known and unknown classes is still an open problem, as unknown classes are never seen during the training process, especially in medical area. Moreover, manipulating the latent distribution of known classes further influences the unknown’s and makes it even harder. In this paper, we propose the Centralized Space Learning (CSL) method to address the open-set recognition problem in CADs by learning a centralized space to separate the known and unknown classes with the assistance of proxy images generated by a generative adversarial network (GAN). With three steps, including known space initialization, unknown anchor generation and centralized s...
Vision Transformers (ViTs) have achieved state-ofthe-art performance on various vision tasks. How... more Vision Transformers (ViTs) have achieved state-ofthe-art performance on various vision tasks. However, ViTs' selfattention module is still arguably a major bottleneck, limiting their achievable hardware efficiency and more extensive applications to resource constrained platforms. Meanwhile, existing accelerators dedicated to NLP Transformers are not optimal for ViTs. This is because there is a large difference between ViTs and Transformers for natural language processing (NLP) tasks: ViTs have a relatively fixed number of input tokens, whose attention maps can be pruned by up to 90% even with fixed sparse patterns, without severely hurting the model accuracy (e.g., <=1.5% under 90% pruning ratio); while NLP Transformers need to handle input sequences of varying numbers of tokens and rely on on-the-fly predictions of dynamic sparse attention patterns for each input to achieve a decent sparsity (e.g., >=50%). To this end, we propose a dedicated algorithm and accelerator co-design framework dubbed ViTCoD for accelerating ViTs. Specifically, on the algorithm level, ViTCoD prunes and polarizes the attention maps to have either denser or sparser fixed patterns for regularizing two levels of workloads without hurting the accuracy, largely reducing the attention computations while leaving room for alleviating the remaining dominant data movements; on top of that, we further integrate a lightweight and learnable auto-encoder module to enable trading the dominant high-cost data movements for lower-cost computations. On the hardware level, we develop a dedicated accelerator to simultaneously coordinate the aforementioned enforced denser and sparser workloads for boosted hardware utilization, while integrating on-chip encoder and decoder engines to leverage ViTCoD's algorithm pipeline for much reduced data movements. Extensive experiments and ablation studies validate that ViTCoD largely reduces the dominant data movement costs, achieving speedups of up to 235.3×, 142.9×, 86.0×, 10.1×, and 6.8× over general computing platforms CPUs, EdgeGPUs, GPUs, and prior-art Transformer accelerators SpAtten and Sanger under an attention sparsity of 90%, respectively. Our code implementation is available at .
Self-supervised learning (SSL) for rich speech representations has achieved empirical success in ... more Self-supervised learning (SSL) for rich speech representations has achieved empirical success in low-resource Automatic Speech Recognition (ASR) and other speech processing tasks, which can mitigate the necessity of a large amount of transcribed speech and thus has driven a growing demand for on-device ASR and other speech processing. However, advanced speech SSL models have become increasingly large, which contradicts the limited on-device resources. This gap could be more severe in multilingual/multitask scenarios requiring simultaneously recognizing multiple languages or executing multiple speech processing tasks. Additionally, strongly overparameterized speech SSL models tend to suffer from overfitting when being finetuned on low-resource speech corpus. This work aims to enhance the practical usage of speech SSL models towards a win-win in both enhanced efficiency and alleviated overfitting via our proposed S 3 -Router framework, which for the first time discovers that simply discarding no more than 10% of model weights via only finetuning model connections of speech SSL models can achieve better accuracy over standard weight finetuning on downstream speech processing tasks. More importantly, S 3 -Router can serve as an all-in-one technique to enable (1) a new finetuning scheme, (2) an efficient multilingual/multitask solution, (3) a state-of-the-art ASR pruning technique, and (4) a new tool to quantitatively analyze the learned speech representation. We believe S 3 -Router has provided a new perspective for practical deployment of speech SSL models. Our codes are available at: .
Proceedings of the 49th Annual International Symposium on Computer Architecture
Eye tracking has become an essential human-machine interaction modality for providing immersive e... more Eye tracking has become an essential human-machine interaction modality for providing immersive experience in numerous virtual and augmented reality (VR/AR) applications desiring high throughput (e.g., 240 FPS), small-form, and enhanced visual privacy. However, existing eye tracking systems are still limited by their: (1) large form-factor largely due to the adopted bulky lensbased cameras; (2) high communication cost required between the camera and backend processor; and (3) potentially concerned low visual privacy, thus prohibiting their more extensive applications. To this end, we propose, develop, and validate a lensless FlatCambased eye tracking algorithm and accelerator co-design framework dubbed EyeCoD to enable eye tracking systems with a much reduced form-factor and boosted system efficiency without sacrificing the tracking accuracy, paving the way for next-generation eye tracking solutions. On the system level, we advocate the use of lensless FlatCams instead of lens-based cameras to facilitate the small form-factor need in mobile eye tracking systems, which also leaves rooms for a dedicated sensing-processor co-design to reduce the required camera-processor communication latency. On the algorithm level, EyeCoD integrates a predict-then-focus pipeline that first predicts the region-of-interest (ROI) via segmentation and then only focuses on the ROI parts to estimate gaze directions, greatly reducing redundant computations and data movements. On the hardware level, we further develop a dedicated accelerator that (1) integrates a novel workload orchestration between the aforementioned segmentation and gaze estimation models, (2) leverages intra-channel reuse opportunities for depth-wise layers, (3) utilizes input feature-wise partition to save activation memory size, and (4) develops a sequential-write-parallel-read input buffer to alleviate the bandwidth requirement for the activation global buffer. On-silicon measurement and extensive experiments validate that our EyeCoD consistently reduces both the communication
Low precision deep neural network (DNN) training is one of the most effective techniques for boos... more Low precision deep neural network (DNN) training is one of the most effective techniques for boosting DNNs' training efficiency, as it trims down the training cost from the finest bit level. While existing works mostly fix the model precision during the whole training process, a few pioneering works have shown that dynamic precision schedules help DNNs converge to a better accuracy while leading to a lower training cost than their static precision training counterparts. However, existing dynamic low precision training methods rely on manually designed precision schedules to achieve advantageous efficiency and accuracy trade-offs, limiting their more comprehensive practical applications and achievable performance. To this end, we propose LDP, a Learnable Dynamic Precision DNN training framework that can automatically learn a temporally and spatially dynamic precision schedule during training towards optimal accuracy and efficiency trade-offs. It is worth noting that LDP-trained DNNs are by nature efficient during inference. Furthermore, we visualize the resulting temporal and spatial precision schedule and distribution of LDP trained DNNs on different tasks to better understand the corresponding DNNs' characteristics at different training stages and DNN layers both during and after training, drawing insights for promoting further innovations. Extensive experiments and ablation studies (seven networks, five datasets, and three tasks) show that the proposed LDP consistently outperforms state-of-the-art (SOTA) low precision DNN training techniques in terms of training efficiency and achieved accuracy trade-offs. For example, in addition to having the advantage of being automated, our LDP achieves a 0.31\% higher accuracy with a 39.1\% lower computational cost when training ResNet-20 on CIFAR-10 as compared with the best SOTA method.
Vision transformers (ViTs) have recently demonstrated great success in various computer vision ta... more Vision transformers (ViTs) have recently demonstrated great success in various computer vision tasks, motivating a tremendously increased interest in their deployment into many real-world IoT applications. However, powerful ViTs are often too computationally expensive to be fitted onto realworld resource-constrained devices, due to (1) their quadratically increased complexity with the number of input tokens and (2) their overparameterized self-attention heads and model depth. In parallel, different images are of varied complexity and their different regions can contain various levels of visual information, e.g., a sky background is not as informative as a foreground object in object classification tasks, indicating that treating all regions/tokens equally in terms of model complexity is unnecessary while such opportunities for trimming down ViTs’ complexity have not been fully explored. To this end, we propose a Multi-grained InputAdaptive Vision TransFormer framework dubbed MIAForm...
The remarkable capabilities and intricate nature of Artificial Intelligence (AI) have dramaticall... more The remarkable capabilities and intricate nature of Artificial Intelligence (AI) have dramatically escalated the imperative for specialized AI accelerators. Nonetheless, designing these accelerators for various AI workloads remains both labor-and time-intensive. While existing design exploration and automation tools can partially alleviate the need for extensive human involvement, they still demand substantial hardware expertise, posing a barrier to non-experts and stifling AI accelerator development. Motivated by the astonishing potential of large language models (LLMs) for generating high-quality content in response to human language instructions, we embark on this work to examine the possibility of harnessing LLMs to automate AI accelerator design. Through this endeavor, we develop GPT4AIGChip, a framework intended to democratize AI accelerator design by leveraging human natural languages instead of domain-specific languages. Specifically, we first perform an in-depth investigation into LLMs' limitations and capabilities for AI accelerator design, thus aiding our understanding of our current position and garnering insights into LLM-powered automated AI accelerator design. Furthermore, drawing inspiration from the above insights, we develop a framework called GPT4AIGChip, which features an automated demo-augmented prompt-generation pipeline utilizing in-context learning to guide LLMs towards creating high-quality AI accelerator design. To our knowledge, this work is the first to demonstrate an effective pipeline for LLM-powered automated AI accelerator generation. Accordingly, we anticipate that our insights and framework can serve as a catalyst for innovations in next-generation LLM-powered design automation tools.
Driven by the explosive interest in applying deep reinforcement learning (DRL) agents to numerous... more Driven by the explosive interest in applying deep reinforcement learning (DRL) agents to numerous real-time control and decision-making applications, there has been a growing demand to deploy DRL agents to empower daily-life intelligent devices, while the prohibitive complexity of DRL stands at odds with limited on-device resources. In this work, we propose an Automated Agent Accelerator Co-Search (A3C-S) framework, which to our best knowledge is the first to automatically cosearch the optimally matched DRL agents and accelerators that maximize both test scores and hardware efficiency. Extensive experiments consistently validate the superiority of our A3C-S over state-of-the-art techniques.
Tiny deep learning has attracted increasing attention driven by the substantial demand for deploy... more Tiny deep learning has attracted increasing attention driven by the substantial demand for deploying deep learning on numerous intelligent Internet-of-Things devices. However, it is still challenging to unleash tiny deep learning's full potential on both large-scale datasets and downstream tasks due to the under-fitting issues caused by the limited model capacity of tiny neural networks (TNNs). To this end, we propose a framework called NetBooster to empower tiny deep learning by augmenting the architectures of TNNs via an expansion-then-contraction strategy. Extensive experiments show that NetBooster consistently outperforms state-of-the-art tiny deep learning solutions.
We introduce the concept of saturated absorption competition (SAC) microscopy as a means of provi... more We introduce the concept of saturated absorption competition (SAC) microscopy as a means of providing sub-diffraction spatial resolution in fluorescence imaging. Unlike the post-competition process between stimulated and spontaneous emission that is used in stimulated emission depletion (STED) microscopy, SAC microscopy breaks the diffraction limit by emphasizing a pre-competition process that occurs in the fluorescence absorption stage in a manner that shares similarities with ground-state depletion (GSD) microscopy. Moreover, unlike both STED and GSD microscopy, SAC microscopy offers a reduction in complexity and cost by utilizing only a single continuous-wave laser diode and an illumination intensity that is ~ 20x smaller than that used in STED. Our approach can be physically implemented in a confocal microscope by dividing the input laser source into a time-modulated primary excitation beam and a doughnut-shaped saturation beam, and subsequently employing a homodyne detection scheme to select the modulated fluorescence signal. Herein, we provide both a physico-chemical model of SAC and experimentally demonstrate by way of a proof-of-concept experiment a transverse spatial resolution of ~λ/6. Owing to the advantage of being non-invasive in observing living samples, far-field optical
The recent breakthroughs and prohibitive complexities of Deep Neural Networks (DNNs) have excited... more The recent breakthroughs and prohibitive complexities of Deep Neural Networks (DNNs) have excited extensive interest in domain specific DNN accelerators, among which optical DNN accelerators are particularly promising thanks to their unprecedented potential of achieving superior performanceper-watt. However, the development of optical DNN accelerators is much slower than that of electrical DNN accelerators. One key challenge is that while many techniques have been developed to facilitate the development of electrical DNN accelerators, techniques that support or expedite optical DNN accelerator design remain much less explored, limiting both the achievable performance and the innovation development of optical DNN accelerators. To this end, we develop the first-of-its-kind framework dubbed O-HAS, which for the first time demonstrates automated Optical Hardware Accelerator Search for boosting both the acceleration efficiency and development speed of optical DNN accelerators. Specifically, our O-HAS consists of two integrated enablers: (1) an O-Cost Predictor, which can accurately yet efficiently predict an optical accelerator's energy and latency based on the DNN model parameters and the optical accelerator design; and (2) an O-Search Engine, which can automatically explore the large design space of optical DNN accelerators and identify the optimal accelerators (i.e., the micro-architectures and algorithm-to-accelerator mapping methods) in order to maximize the target acceleration efficiency. Extensive experiments and ablation studies consistently validate the effectiveness of both our O-Cost Predictor and O-Search Engine as well as the excellent efficiency of O-HAS generated optical accelerators.
This paper presents a novel network compression framework, Kernel Quantization (KQ), targeting to... more This paper presents a novel network compression framework, Kernel Quantization (KQ), targeting to efficiently convert any pre-trained full-precision convolutional neural network (CNN) model into a low-precision version without significant performance loss. Unlike existing methods struggling with weight bit-length, KQ has the potential in improving the compression ratio by considering the convolution kernel as the quantization unit. Inspired by the evolution from weight pruning to filter pruning, we propose to quantize in both kernel and weight level. Instead of representing each weight parameter with a low-bit index, we learn a kernel codebook and replace all kernels in the convolution layer with corresponding low-bit indexes. Thus, KQ can represent the weight tensor in the convolution layer with low-bit indexes and a kernel codebook with limited size, which enables KQ to achieve significant compression ratio. Then, we conduct a 6-bit parameter quantization on the kernel codebook to further reduce redundancy. Extensive experiments on the ImageNet classification task prove that KQ needs 1.05 and 1.62 bits on average in VGG and ResNet18, respectively, to represent each parameter in the convolution layer and achieves the state-of-the-art compression ratio with little accuracy loss.
Proceedings of the ... AAAI Conference on Artificial Intelligence, Jun 28, 2022
Vision transformers (ViTs) have recently demonstrated great success in various computer vision ta... more Vision transformers (ViTs) have recently demonstrated great success in various computer vision tasks, motivating a tremendously increased interest in their deployment into many real-world IoT applications. However, powerful ViTs are often too computationally expensive to be fitted onto realworld resource-constrained devices, due to (1) their quadratically increased complexity with the number of input tokens and (2) their overparameterized self-attention heads and model depth. In parallel, different images are of varying complexity and their different regions can contain various levels of visual information, e.g., a sky background is not as informative as a foreground object in object classification tasks, indicating that treating all regions/tokens equally in terms of model complexity is unnecessary while such opportunities for trimming down ViTs' complexity have not been fully explored. To this end, we propose a Multi-grained Input-Adaptive Vision TransFormer framework dubbed MIA-Former that can input-adaptively adjust the structure of ViTs at three coarse-to-fine-grained granularities (i.e., model depth and the number of model heads/tokens). In particular, our MIA-Former adopts a low-cost network trained with a hybrid supervised and reinforcement training method to skip unnecessary layers, heads, and tokens in an input adaptive manner, reducing the overall computational cost. Furthermore, an interesting side effect of our MIA-Former is that its resulting ViTs are naturally equipped with improved robustness against adversarial attacks over their static counterparts, because MIA-Former's multi-grained dynamic control improves the model diversity similar to the effect of ensemble and thus increases the difficulty of adversarial attacks against all its sub-models. Extensive experiments and ablation studies validate that the proposed MIA-Former framework can (1) effectively allocate computation budgets adaptive to the difficulty of input images, achieving state-of-the-art (SOTA) accuracy-efficiency trade-offs, e.g., 20% computation savings with the same or even a higher accuracy compared with SOTA dynamic transformer models, and (2) boost ViTs' robustness accuracy under various adversarial attacks over their vanilla counterparts by 2.4% and 3.0%, respectively. Our code is available at .
Despite the impressive performance recently achieved by automatic speech recognition (ASR), we ob... more Despite the impressive performance recently achieved by automatic speech recognition (ASR), we observe two primary challenges that hinder its broader applications: (1) The difficulty of introducing scalability into the model to support more languages with limited training, inference, and storage overhead; (2) The low-resource adaptation ability that enables effective low-resource adaptation while avoiding over-fitting and catastrophic forgetting issues. Inspired by recent findings, we hypothesize that we can address the above challenges with modules widely shared across languages. To this end, we propose an ASR framework, dubbed Master-ASR, that, for the first time, simultaneously achieves strong multilingual scalability and low-resource adaptation ability thanks to its modularize-then-assemble strategy. Specifically, Master-ASR learns a small set of generalizable sub-modules and adaptively assembles them for different languages to reduce the multilingual overhead and enable effective knowledge transfer for low-resource adaptation. Extensive experiments and visualizations demonstrate that Master-ASR can effectively discover language similarity and improve multilingual and low-resource ASR performance over state-of-the-art (SOTA) methods, e.g., under multilingual-ASR, our framework achieves a 0.13∼2.41 lower character error rate (CER) with 30% smaller inference overhead over SOTA solutions on multilingual ASR and a comparable CER, with nearly 50 times fewer trainable parameters over SOTA solutions on low-resource tuning, respectively.
In computer-aided diagnosis (CAD), diagnosing untrained diseases as known categories will cause s... more In computer-aided diagnosis (CAD), diagnosing untrained diseases as known categories will cause serious medical accidents, which makes it crucial to distinguish the new class (open set) meanwhile preserving the known classes (closed set) performance so as to enhance the robustness. However, how to accurately define the decision boundary between known and unknown classes is still an open problem, as unknown classes are never seen during the training process, especially in medical area. Moreover, manipulating the latent distribution of known classes further influences the unknown's and makes it even harder. In this paper, we propose the Centralized Space Learning (CSL) method to address the open-set recognition problem in CADs by learning a centralized space to separate the known and unknown classes with the assistance of proxy images generated by a generative adversarial network (GAN). With three steps, including known space initialization, unknown anchor generation and centralized space refinement, CSL learns the optimized space distribution with unknown samples cluster around the center while the known spread away from the center, achieving a significant identification between the known and the unknown. Extensive experiments on multiple datasets and tasks illustrate the proposed CSL's practicability in CAD and the state-of-the-art open-set recognition performance.
The promise of Deep Neural Network (DNN) powered Internet of Thing (IoT) devices has motivated a ... more The promise of Deep Neural Network (DNN) powered Internet of Thing (IoT) devices has motivated a tremendous demand for automated solutions to enable fast development and deployment of efficient (1) DNNs equipped with instantaneous accuracy-efficiency trade-off capability to accommodate the timevarying resources at IoT devices and (2) dataflows to optimize DNNs' execution efficiency on different devices. Therefore, we propose InstantNet to automatically generate and deploy instantaneously switchable-precision networks which operate at variable bit-widths. Extensive experiments show that the proposed InstantNet consistently outperforms state-of-the-art designs. Our codes are available at: .
Despite the growing demand for tuning foundation vision transformers (FViTs) on downstream tasks,... more Despite the growing demand for tuning foundation vision transformers (FViTs) on downstream tasks, fully unleashing FViTs' potential under data-limited scenarios (e.g., fewshot tuning) remains a challenge due to FViTs' data-hungry nature. Common data augmentation techniques fall short in this context due to the limited features contained in the few-shot tuning data. To tackle this challenge, we first identify an opportunity for FViTs in few-shot tuning: pretrained FViTs themselves have already learned highly representative features from large-scale pretraining data, which are fully preserved during widely used parameter-efficient tuning. We thus hypothesize that leveraging those learned features to augment the tuning data can boost the effectiveness of few-shot FViT tuning. To this end, we propose a framework called Hint-based Data Augmentation (Hint-Aug), which aims to boost FViT in few-shot tuning by augmenting the over-fitted parts of tuning samples with the learned features of pretrained FViTs. Specifically, Hint-Aug integrates two key enablers: (1) an Attentive Over-fitting Detector (AOD) to detect over-confident patches of foundation ViTs for potentially alleviating their over-fitting on the few-shot tuning data and (2) a Confusion-based Feature Infusion (CFI) module to infuse easy-to-confuse features from the pretrained FViTs with the over-confident patches detected by the above AOD in order to enhance the feature diversity during tuning. Extensive experiments and ablation studies on five datasets and three parameter-efficient tuning techniques consistently validate Hint-Aug's effectiveness: 0.04% ∼ 32.91% higher accuracy over the state-of-the-art (SOTA) data augmentation method under various low-shot settings. For example, on the Pet dataset, Hint-Aug achieves a 2.22% higher accuracy with 50% less training data over SOTA data augmentation methods.
In computer-aided diagnosis (CAD), diagnosing untrained diseases as known categories will cause s... more In computer-aided diagnosis (CAD), diagnosing untrained diseases as known categories will cause serious medical accidents, which makes it crucial to distinguish the new class (open set) meanwhile preserving the known classes (closed set) performance so as to enhance the robustness. However, how to accurately define the decision boundary between known and unknown classes is still an open problem, as unknown classes are never seen during the training process, especially in medical area. Moreover, manipulating the latent distribution of known classes further influences the unknown’s and makes it even harder. In this paper, we propose the Centralized Space Learning (CSL) method to address the open-set recognition problem in CADs by learning a centralized space to separate the known and unknown classes with the assistance of proxy images generated by a generative adversarial network (GAN). With three steps, including known space initialization, unknown anchor generation and centralized s...
Vision Transformers (ViTs) have achieved state-ofthe-art performance on various vision tasks. How... more Vision Transformers (ViTs) have achieved state-ofthe-art performance on various vision tasks. However, ViTs' selfattention module is still arguably a major bottleneck, limiting their achievable hardware efficiency and more extensive applications to resource constrained platforms. Meanwhile, existing accelerators dedicated to NLP Transformers are not optimal for ViTs. This is because there is a large difference between ViTs and Transformers for natural language processing (NLP) tasks: ViTs have a relatively fixed number of input tokens, whose attention maps can be pruned by up to 90% even with fixed sparse patterns, without severely hurting the model accuracy (e.g., <=1.5% under 90% pruning ratio); while NLP Transformers need to handle input sequences of varying numbers of tokens and rely on on-the-fly predictions of dynamic sparse attention patterns for each input to achieve a decent sparsity (e.g., >=50%). To this end, we propose a dedicated algorithm and accelerator co-design framework dubbed ViTCoD for accelerating ViTs. Specifically, on the algorithm level, ViTCoD prunes and polarizes the attention maps to have either denser or sparser fixed patterns for regularizing two levels of workloads without hurting the accuracy, largely reducing the attention computations while leaving room for alleviating the remaining dominant data movements; on top of that, we further integrate a lightweight and learnable auto-encoder module to enable trading the dominant high-cost data movements for lower-cost computations. On the hardware level, we develop a dedicated accelerator to simultaneously coordinate the aforementioned enforced denser and sparser workloads for boosted hardware utilization, while integrating on-chip encoder and decoder engines to leverage ViTCoD's algorithm pipeline for much reduced data movements. Extensive experiments and ablation studies validate that ViTCoD largely reduces the dominant data movement costs, achieving speedups of up to 235.3×, 142.9×, 86.0×, 10.1×, and 6.8× over general computing platforms CPUs, EdgeGPUs, GPUs, and prior-art Transformer accelerators SpAtten and Sanger under an attention sparsity of 90%, respectively. Our code implementation is available at .
Self-supervised learning (SSL) for rich speech representations has achieved empirical success in ... more Self-supervised learning (SSL) for rich speech representations has achieved empirical success in low-resource Automatic Speech Recognition (ASR) and other speech processing tasks, which can mitigate the necessity of a large amount of transcribed speech and thus has driven a growing demand for on-device ASR and other speech processing. However, advanced speech SSL models have become increasingly large, which contradicts the limited on-device resources. This gap could be more severe in multilingual/multitask scenarios requiring simultaneously recognizing multiple languages or executing multiple speech processing tasks. Additionally, strongly overparameterized speech SSL models tend to suffer from overfitting when being finetuned on low-resource speech corpus. This work aims to enhance the practical usage of speech SSL models towards a win-win in both enhanced efficiency and alleviated overfitting via our proposed S 3 -Router framework, which for the first time discovers that simply discarding no more than 10% of model weights via only finetuning model connections of speech SSL models can achieve better accuracy over standard weight finetuning on downstream speech processing tasks. More importantly, S 3 -Router can serve as an all-in-one technique to enable (1) a new finetuning scheme, (2) an efficient multilingual/multitask solution, (3) a state-of-the-art ASR pruning technique, and (4) a new tool to quantitatively analyze the learned speech representation. We believe S 3 -Router has provided a new perspective for practical deployment of speech SSL models. Our codes are available at: .
Proceedings of the 49th Annual International Symposium on Computer Architecture
Eye tracking has become an essential human-machine interaction modality for providing immersive e... more Eye tracking has become an essential human-machine interaction modality for providing immersive experience in numerous virtual and augmented reality (VR/AR) applications desiring high throughput (e.g., 240 FPS), small-form, and enhanced visual privacy. However, existing eye tracking systems are still limited by their: (1) large form-factor largely due to the adopted bulky lensbased cameras; (2) high communication cost required between the camera and backend processor; and (3) potentially concerned low visual privacy, thus prohibiting their more extensive applications. To this end, we propose, develop, and validate a lensless FlatCambased eye tracking algorithm and accelerator co-design framework dubbed EyeCoD to enable eye tracking systems with a much reduced form-factor and boosted system efficiency without sacrificing the tracking accuracy, paving the way for next-generation eye tracking solutions. On the system level, we advocate the use of lensless FlatCams instead of lens-based cameras to facilitate the small form-factor need in mobile eye tracking systems, which also leaves rooms for a dedicated sensing-processor co-design to reduce the required camera-processor communication latency. On the algorithm level, EyeCoD integrates a predict-then-focus pipeline that first predicts the region-of-interest (ROI) via segmentation and then only focuses on the ROI parts to estimate gaze directions, greatly reducing redundant computations and data movements. On the hardware level, we further develop a dedicated accelerator that (1) integrates a novel workload orchestration between the aforementioned segmentation and gaze estimation models, (2) leverages intra-channel reuse opportunities for depth-wise layers, (3) utilizes input feature-wise partition to save activation memory size, and (4) develops a sequential-write-parallel-read input buffer to alleviate the bandwidth requirement for the activation global buffer. On-silicon measurement and extensive experiments validate that our EyeCoD consistently reduces both the communication
Low precision deep neural network (DNN) training is one of the most effective techniques for boos... more Low precision deep neural network (DNN) training is one of the most effective techniques for boosting DNNs' training efficiency, as it trims down the training cost from the finest bit level. While existing works mostly fix the model precision during the whole training process, a few pioneering works have shown that dynamic precision schedules help DNNs converge to a better accuracy while leading to a lower training cost than their static precision training counterparts. However, existing dynamic low precision training methods rely on manually designed precision schedules to achieve advantageous efficiency and accuracy trade-offs, limiting their more comprehensive practical applications and achievable performance. To this end, we propose LDP, a Learnable Dynamic Precision DNN training framework that can automatically learn a temporally and spatially dynamic precision schedule during training towards optimal accuracy and efficiency trade-offs. It is worth noting that LDP-trained DNNs are by nature efficient during inference. Furthermore, we visualize the resulting temporal and spatial precision schedule and distribution of LDP trained DNNs on different tasks to better understand the corresponding DNNs' characteristics at different training stages and DNN layers both during and after training, drawing insights for promoting further innovations. Extensive experiments and ablation studies (seven networks, five datasets, and three tasks) show that the proposed LDP consistently outperforms state-of-the-art (SOTA) low precision DNN training techniques in terms of training efficiency and achieved accuracy trade-offs. For example, in addition to having the advantage of being automated, our LDP achieves a 0.31\% higher accuracy with a 39.1\% lower computational cost when training ResNet-20 on CIFAR-10 as compared with the best SOTA method.
Vision transformers (ViTs) have recently demonstrated great success in various computer vision ta... more Vision transformers (ViTs) have recently demonstrated great success in various computer vision tasks, motivating a tremendously increased interest in their deployment into many real-world IoT applications. However, powerful ViTs are often too computationally expensive to be fitted onto realworld resource-constrained devices, due to (1) their quadratically increased complexity with the number of input tokens and (2) their overparameterized self-attention heads and model depth. In parallel, different images are of varied complexity and their different regions can contain various levels of visual information, e.g., a sky background is not as informative as a foreground object in object classification tasks, indicating that treating all regions/tokens equally in terms of model complexity is unnecessary while such opportunities for trimming down ViTs’ complexity have not been fully explored. To this end, we propose a Multi-grained InputAdaptive Vision TransFormer framework dubbed MIAForm...
Uploads
Papers by Zhongzhi Yu