Skip to main content

Youngmin Yi

Followers

4

Following

2

Co-authors

2

Public Views

Dominik Wujastyk

University of Alberta

Christian K Wedemeyer

University of Chicago

James Mallinson

University of Oxford

Ucla

Bernd-Christian Otto

Friedrich-Alexander-Universität Erlangen-Nürnberg

Amanda J. Lucia

University of California, Riverside

University of Oxford

University of Vienna

Stephen J Davis

Yale University

Fabrizio Desideri

Università degli Studi di Firenze (University of Florence)

Interests

Uploads

Papers by Youngmin Yi

Towards Real-time CNN Inference from a Video Stream on a Mobile GPU (WiP Paper)

The 21st ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded Systems, 2020

While there are several frameworks for CNN inference on mobile GPUs, they do not achieve real-tim... more While there are several frameworks for CNN inference on mobile GPUs, they do not achieve real-time processing for the most of the CNNs that aim at reasonable accuracy since they all employ kernel-by-kernel execution model and do not effectively support INT8 quantization yet. In this paper, we reveal that mobile GPUs suffer from large kernel launch overhead unlike server GPUs, and then propose an on-device deep learning inference framework that can achieve real-time inference of CNNs on mobile GPUs by removing kernel launch overhead and by effectively exploiting INT8 quantization. We have evaluated the proposed framework with a state-of-the-art CNN based face detector (RetinaFace), and observed up to 2.01X of speedup compared to ARM Compute Library (ACL) on a commodity smartphone.

CPU-GPU 이기종 플랫폼에서 하둡 맵리듀스의 가속: CKY 파서 사례 분석 CPU-GPU 이기종 플랫폼에서 하둡 맵리듀스의 가속: CKY 파서 사례 분석 (Accelerating Hadoop MapReduce on CPU-GPU Heterogeneous Platforms: A Case Study with CKY Parser)

These days, big data computing is prevalent and Hadoop MapReduce framework is widely used for its... more These days, big data computing is prevalent and Hadoop MapReduce framework is widely used for its simple programming model. On the other hand, General-Purpose Graphics Processing Unit (GPGPU) has become very popular and various domains of applications have been successfully accelerated using GPUs. In this paper, we propose a method to use GPU within Hadoop MapReduce framework. Then, we propose a static partitioning method that considers different capability of CPU mappers and GPU mappers, and a dynamic scheduling method that deals with a dynamic input size. Compared to a single CPU execution time, the CKY parser on a 14-node Hadoop cluster with 12 CPU cores and 1 GPU per node achieves 245 times speedup. Compared to the execution time on a 14-node Hadoop cluster with 12 CPU cores and no GPU per node, it also achieves 2.5 times speedup. Our proposed approach for both CPU and GPU mapper execution leads to an additional speedup, resulting in total of 2.8 times speedup.

CUDA를 이용한 PCA 기반 얼굴인식의 가속 (Acceleration of PCA based Face Recognition using CUDA)

Face recognition is important in many applications including surveillance, biometrics, and other ... more Face recognition is important in many applications including surveillance, biometrics, and other domains and fast face recognition is required if she wants to train and test more images or to increase the resolution of an input image for better accuracy in recognition. Meanwhile, Graphics Processing Units (GPUs) have become widely available, offering the opportunity for real-time face recognition even for larger set of images with high resolution. In this paper, we explore the design space of parallelizing a PCA (Principal Components Analysis) based face recognition algorithm and propose a fast face recognizer on GPUs by exploiting the fine-grained data-parallelism found in the face recognition algorithm. Our best results with the CUDA face recognizer show over 40-fold speedups compared to a sequential C implementation.

Understanding and bridging the gaps in current GNN performance optimizations

Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2021

Graph Neural Network (GNN) has recently drawn a rapid increase of interest in many domains for it... more Graph Neural Network (GNN) has recently drawn a rapid increase of interest in many domains for its effectiveness in learning over graphs. Maximizing its performance is essential for many tasks, but remains preliminarily understood. In this work, we provide an in-depth examination of the state-of-the-art GNN frameworks, revealing five major gaps in the current frameworks in optimizing GNN performance, especially in handling the special complexities of GNN over traditional graph or DNN operations. Based on the insights, we put together a set of optimizations to fill the gaps. These optimizations leverage the state-of-the-art GPU optimization techniques and tailor them to the special properties of GNN. Experimental results show that these optimizations achieve 1.37×-15.5× performance improvement over the state-of-the-art frameworks on various GNN models.

Performance Evaluation of INT8 Quantized Inference on Mobile GPUs

IEEE Access, 2021

During the past several years, the need for on-device deep learning has rapidly increased, and th... more During the past several years, the need for on-device deep learning has rapidly increased, and the performance of mobile GPUs has significantly increased. As a viable approach for efficient on-device deep learning, INT8 quantized inference has been actively studied and proposed but there are currently few frameworks that support INT8 quantization for mobile GPUs. This paper presents a unified framework that integrates various INT8 quantization methods, such as symmetric, asymmetric, per-layer, and per-channel, and discusses their impact on accuracy and efficiency on recent mobile GPUs. Moreover, we discuss the performance and accuracy of INT8 quantized Winograd convolution and propose INT8 Winograd convolution with F(2 × 2, 3 × 3), where weight tensors are quantized in INT4 and input tensors are quantized in INT6. We evaluated the performance of INT8 methods, including INT8 Winograd, for ResNet50, MobileNet-v1, and VGG16 on Mali G52, G72, and G76 GPUs on Odroid N2, Galaxy S9, and Galaxy Note 10+, respectively. INT8 quantized inference based on General Matrix Multiplication (GEMM) was 1.67× faster than FP32 GEMM for ResNet50 on Mali G52, and was further accelerated by batch normalization folding and by the proposed INT8 Winograd convolution, achieving 2.45× speedup in total with an accuracy loss of only 0.31%p. INDEX TERMS On-device deep learning, INT8 quantization, INT8 Winograd convolution, mobile GPU.

BPNet: Branch-pruned Conditional Neural Network for Systematic Time-accuracy Tradeoff

2020 57th ACM/IEEE Design Automation Conference (DAC), 2020

Recently, there have been attempts to execute the neural network conditionally with auxiliary cla... more Recently, there have been attempts to execute the neural network conditionally with auxiliary classifiers allowing early termination depending on the difficulty of the input, which can reduce the execution time or energy consumption without any or with negligible accuracy decrease. However, previous studies do not consider how many or where the auxiliary classifiers, or branches, should be added in a systematic fashion. In this paper, we propose Branch-pruned Conditional Neural Network (BPNet) and its methodology in which the time-accuracy tradeoff for the conditional neural network can be found systematically. We applied BPNet to SqueezeNet, ResNet-20, and VGG-16 with CIFAR-10 and 100. BPNet achieves up to 3.15× of speedup without any accuracy drop compared to the base networks.

NNSim: Fast Performance Estimation Based on Sampled Simulation of GPGPU Kernels for Neural Networks

2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC), 2018

Existent GPU simulators are too slow to use for neural networks implemented in GPUs. For fast per... more Existent GPU simulators are too slow to use for neural networks implemented in GPUs. For fast performance estimation, we propose a novel hybrid method of analytical performance modeling and sampled simulation of GPUs. By taking full advantage of repeated computation of neural networks, three sampling techniques are devised: Inter-Kernel sampling, Intra-Kernel sampling, and Streaming Multiprocessor sampling. The key technique is to estimate the average IPC through sampled simulation, considering the effect of the warp scheduler and memory access contention. Compared with GPGPU-Sim, the proposed technique reduces the simulation time by up to 450 times with less than 5.0% of accuracy loss.

Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, 2017

Pipeline is an important programming pattern, while GPU, designed mostly for data-level parallel ... more Pipeline is an important programming pattern, while GPU, designed mostly for data-level parallel executions, lacks an ecient mechanism to support pipeline programming and executions. This paper provides a systematic examination of various existing pipeline execution models on GPU, and analyzes their strengths and weaknesses. To address their shortcomings, this paper then proposes three new execution models equipped with much improved controllability, including a hybrid model that is capable of getting the strengths of all. These insights ultimately lead to the development of a software programming framework named VersaPipe. With VersaPipe, users only need to write the operations for each pipeline stage. VersaPipe will then automatically assemble the stages into a hybrid execution model and congure it to achieve the best performance. Experiments on a set of pipeline benchmarks and a real-world face detection application show that VersaPipe produces up to 6.90⇥ (2.88⇥ on average) speedups over the original manual implementations. CCS CONCEPTS • General and reference → Performance; • Computing methodologies → Parallel computing methodologies; • Computer systems organization → Heterogeneous (hybrid) systems;

A fully data parallel WFST-based large vocabulary continuous speech recognition on a graphics processing unit

Interspeech 2009, 2009

Tremendous compute throughput is becoming available in personal desktop and laptop systems throug... more Tremendous compute throughput is becoming available in personal desktop and laptop systems through the use of graphics processing units (GPUs). However, exploiting this resource requires re-architecting an application to fit a data parallel programming model. The complex graph traversal routines in the inference process for large vocabulary continuous speech recognition (LVCSR) have been considered by many as unsuitable for extensive parallelization. We explore and demonstrate a fully data parallel implementation of a speech inference engine on NVIDIA’s GTX280 GPU. Our implementation consists of two phases compute-intensive observation probability computation phase and communication-intensive graph traversal phase. We take advantage of dynamic elimination of redundant computation in the compute-intensive phase while maintaining close-to-peak execution efficiency. We also demonstrate the importance of exploring application-level trade-offs in the communication-intensive graph traversal phase to adapt the algorithm to data parallel execution on GPUs. On 3.1 hours of speech data set, we achieve more than 11× speedup compared to a highly optimized sequential implementation on Intel Core i7 without sacrificing accuracy.

Exploiting Activation Sparsity for Fast CNN Inference on Mobile GPUs

ACM Transactions on Embedded Computing Systems, 2021

Over the past several years, the need for on-device deep learning has been rapidly increasing, an... more Over the past several years, the need for on-device deep learning has been rapidly increasing, and efficient CNN inference on mobile platforms has been actively researched. Sparsity exploitation has been one of the most active research themes, but the studies mostly focus on weight sparsity by weight pruning. Activation sparsity, on the contrary, requires compression at runtime for every input tensor. Hence, the research on activation sparsity mainly targets NPUs that can efficiently process this with their own hardware logic. In this paper, we observe that it is difficult to accelerate CNN inference on mobile GPUs with natural activation sparsity and that the widely used CSR-based sparse convolution is not sufficiently effective due to the compression overhead. We propose several novel sparsification methods that can boost activation sparsity without harming accuracy. In particular, we selectively sparsify some layers with an extremely high sparsity and adopt sparse convolution or ...

GOPipe

Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming, 2019

Recent studies have shown promising performance benefits of pipelined stencil applications. An im... more Recent studies have shown promising performance benefits of pipelined stencil applications. An important factor for the computing efficiency of such pipelines is the granularity of a task. We presents GOPipe, the first granularity-oblivious programming framework for efficient pipelined stencil executions. With GOPipe, programmers no longer need to specify the appropriate task granularity. GOPipe automatically finds it, and schedules tasks of that granularity while observing all inter-task and inter-stage data dependencies. In our experiments on four real-life applications, GOPipe outperforms the state-of-the-art by up to 4.57× with a much better programming productivity.

Scheduling of Deep Learning Applications Onto Heterogeneous Processors in an Embedded Device

IEEE Access, 2020

As the need for on-device machine learning is increasing recently, embedded devices tend to be eq... more As the need for on-device machine learning is increasing recently, embedded devices tend to be equipped with heterogeneous processors that include a multi-core CPU, a GPU, and/or a DNN accelerator called a Neural Processing Unit (NPU). In the scheduling of multiple deep learning (DL) applications in such embedded devices, there are several technical challenges. First, a task can be mapped onto a single core or any number of available cores. So we need to consider various possible configurations of CPU cores. Second, embedded devices usually apply Dynamic Voltage and Frequency Scaling (DVFS) to reduce energy consumption at run-time. We need to consider the effect of DVFS in the profiling of task execution times. Third, to avoid overheat condition, it is recommended to limit the core utilization. Lastly, some cores will be shutdown at run-time if core utilization is not high enough, in case the hot-plugging option is turned on. In this paper, we propose a scheduling technique based on Genetic Algorithm to run DL applications on heterogeneous processors, considering all those issues. First, we aim to optimize the throughput of a single deep learning application. Next, we aim to find the Pareto optimal scheduling of multiple DL applications in terms of the response time of each DL application and overall energy consumption under the given throughput constraints of DL applications. The proposed technique is verified with real DL networks running on two embedded devices, Galaxy S9 and HiKey970. INDEX TERMS Deep learning scheduling, genetic algorithm, heterogeneous processor, mobile device.

Real-Time and Energy-Efficient Face Detection on CPU-GPU Heterogeneous Embedded Platforms

IEICE Transactions on Information and Systems, 2018

As energy efficiency has become a major design constraint or objective, heterogeneous manycore ar... more As energy efficiency has become a major design constraint or objective, heterogeneous manycore architectures have emerged as mainstream target platforms not only in server systems but also in embedded systems. Manycore accelerators such as GPUs are getting also popular in embedded domains, as well as the heterogeneous CPU cores. However, as the number of cores in an embedded GPU is far less than that of a server GPU, it is important to utilize both heterogeneous multi-core CPUs and GPUs to achieve the desired throughput with the minimal energy consumption. In this paper, we present a case study of mapping LBP-based face detection onto a recent CPU-GPU heterogeneous embedded platform, which exploits both task parallelism and data parallelism to achieve maximal energy efficiency with a real-time constraint. We first present the parallelization technique of each task for the GPU execution, then we propose performance and energy models for both task-parallel and data-parallel executions on heterogeneous processors, which are used in design space exploration for the optimal mapping. The design space is huge since not only processor heterogeneity such as CPU-GPU and big.LITTLE, but also various data partitioning ratios for the data-parallel execution on these heterogeneous processors are considered. In our case study of LBP face detection on Exynos 5422, the estimation error of the proposed performance and energy models were on average -2.19% and -3.67% respectively. By systematically finding the optimal mappings with the proposed models, we could achieve 28.6% less energy consumption compared to the manual mapping, while still meeting the real-time constraint.

Distributed Video Decoding on Hadoop

IEICE Transactions on Information and Systems, 2018

Video analytics is usually time-consuming as it not only requires video decoding as a first step ... more Video analytics is usually time-consuming as it not only requires video decoding as a first step but also usually applies complex computer vision and machine learning algorithms to the decoded frame. To achieve high efficiency in video analytics with ever increasing frame size, many researches have been conducted for distributed video processing using Hadoop. However, most approaches focused on processing multiple video files on multiple nodes. Such approaches require a number of video files to achieve any speedup, and could easily result in load imbalance when the size of video files is reasonably long since a video file itself is processed sequentially. In contrast, we propose a distributed video decoding method with an extended FFmpeg and VideoRecordReader, by which a single large video file can be processed in parallel across multiple nodes in Hadoop. The experimental results show that a case study of face detection and SURF system achieve 40.6 times and 29.1 times of speedups respectively on a four-node cluster with 12 mappers in each node, showing good scalability.

Acceleration of Word2vec Using GPUs

Neural Information Processing, 2016

Word2vec is a widely used word embedding toolkit which generates word vectors by training input c... more Word2vec is a widely used word embedding toolkit which generates word vectors by training input corpus. Since word vector can represent an exponential number of word cluster and enables reasoning of words with simple algebraic operations, it has become a widely used representation for the subsequent NLP tasks. In this paper, we present an efficient parallelization of word2vec using GPUs that preserves the accuracy. With two K20 GPUs, the proposed acceleration technique achieves 1.7M words/sec, which corresponds to about 20× of speedup compared to a single-threaded CPU execution.

Virtual synchronization technique with OS modeling for fast and time-accurate cosimulation

First IEEE/ACM/IFIP International Conference on Hardware/ Software Codesign and Systems Synthesis (IEEE Cat. No.03TH8721)

Hardware/Software cosimulation is the key process to shorten the design turn around time. We have... more Hardware/Software cosimulation is the key process to shorten the design turn around time. We have proposed a novel technique, called virtual synchronization, for fast and time accurate cosimulation that involves interacting component simulators. In this paper, we further extend the virtual synchronization technique with OS modeling for the case where multiple software tasks are executed under the supervision of a real-time operating system. The OS modeler models the RTOS overheads of context switching and tick interrupt handling as well as preemption behavior. While maintaining the timing accuracy to an acceptable level below a few percents, we could reduce the simulation time drastically compared with existent conservative approach by removing the need of time synchronization between simulators. It is confirmed with a preliminary experiment with a multimedia example that consists of four real-life tasks.

Embedded software generation from system level specification for multi-tasking embedded systems

Proceedings of the ASP-DAC 2005. Asia and South Pacific Design Automation Conference, 2005.

In this paper we present a new design flow in which embedded software code is generated from syst... more In this paper we present a new design flow in which embedded software code is generated from system level specification of multi-tasking embedded system, both for simulation and implementation. The generated software has a layered structure using virtual OS APIs and OS wrapper implementations to make it reconfigurable for multiple target platforms. Implementation of the OS wrapper is explained in details. With a Divx play example, we show some experimental results about the real-time performance comparison between two different platforms Control FSM H.263 Decoder MAD Stream

Fast PCA-based face recognition on GPUs

2013 IEEE International Conference on Acoustics, Speech and Signal Processing, 2013

ABSTRACT Face recognition is very important in many applications including surveillance, biometri... more ABSTRACT Face recognition is very important in many applications including surveillance, biometrics, and other domains. Fast face recognition is required if she wants to train or test more images or to increase the resolution of an input image for better accuracy in the recognition. Meanwhile, Graphics Processing Units (GPUs) have become widely available, offering the opportunity for real-time face recognition even for larger set of images with a high resolution. In this paper, we explore the design space of parallelizing a PCA (Principal Component Analysis) based face recognition algorithm and propose a fast face recognizer on GPUs by exploiting the fine-grained data-parallelism found in the face recognition algorithm. We successfully accelerated the major three tasks by 120-folds, 70-folds, and 110-folds, compared to a sequential C implementation. For the end-to-end comparison, our CUDA face recognizer achieved a 30-fold speedup.

A cycle-level parallel simulation technique exploiting both space and time parallelism

2012 23rd IEEE International Symposium on Rapid System Prototyping (RSP), 2012

Automatic CUDA Code Synthesis Framework for Multicore CPU and GPU Architectures

Lecture Notes in Computer Science, 2012

ABSTRACT Recently, general purpose GPU (GPGPU) programming has spread rapidly after CUDA was firs... more ABSTRACT Recently, general purpose GPU (GPGPU) programming has spread rapidly after CUDA was first introduced to write parallel programs in high-level languages for NVIDIA GPUs. While a GPU exploits data parallelism very effectively, task-level parallelism is exploited as a multi-threaded program on a multicore CPU. For such a heterogeneous platform that consists of a multicore CPU and GPU, in this paper, we propose an automatic code synthesis framework that takes a process network model specification as input and generates a multithreaded CUDA code. With the model based specification, one can explicitly specify both function-level and loop-level parallelism in an application and explore wide design space in mapping of function blocks and selecting the communication methods between CPU and GPU. The proposed technique is complementary to other high-level methods of CUDA programming. We have confirmed viability of our approach with several examples.

Towards Real-time CNN Inference from a Video Stream on a Mobile GPU (WiP Paper)

The 21st ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools for Embedded Systems, 2020

While there are several frameworks for CNN inference on mobile GPUs, they do not achieve real-tim... more While there are several frameworks for CNN inference on mobile GPUs, they do not achieve real-time processing for the most of the CNNs that aim at reasonable accuracy since they all employ kernel-by-kernel execution model and do not effectively support INT8 quantization yet. In this paper, we reveal that mobile GPUs suffer from large kernel launch overhead unlike server GPUs, and then propose an on-device deep learning inference framework that can achieve real-time inference of CNNs on mobile GPUs by removing kernel launch overhead and by effectively exploiting INT8 quantization. We have evaluated the proposed framework with a state-of-the-art CNN based face detector (RetinaFace), and observed up to 2.01X of speedup compared to ARM Compute Library (ACL) on a commodity smartphone.

CPU-GPU 이기종 플랫폼에서 하둡 맵리듀스의 가속: CKY 파서 사례 분석 CPU-GPU 이기종 플랫폼에서 하둡 맵리듀스의 가속: CKY 파서 사례 분석 (Accelerating Hadoop MapReduce on CPU-GPU Heterogeneous Platforms: A Case Study with CKY Parser)

These days, big data computing is prevalent and Hadoop MapReduce framework is widely used for its... more These days, big data computing is prevalent and Hadoop MapReduce framework is widely used for its simple programming model. On the other hand, General-Purpose Graphics Processing Unit (GPGPU) has become very popular and various domains of applications have been successfully accelerated using GPUs. In this paper, we propose a method to use GPU within Hadoop MapReduce framework. Then, we propose a static partitioning method that considers different capability of CPU mappers and GPU mappers, and a dynamic scheduling method that deals with a dynamic input size. Compared to a single CPU execution time, the CKY parser on a 14-node Hadoop cluster with 12 CPU cores and 1 GPU per node achieves 245 times speedup. Compared to the execution time on a 14-node Hadoop cluster with 12 CPU cores and no GPU per node, it also achieves 2.5 times speedup. Our proposed approach for both CPU and GPU mapper execution leads to an additional speedup, resulting in total of 2.8 times speedup.

CUDA를 이용한 PCA 기반 얼굴인식의 가속 (Acceleration of PCA based Face Recognition using CUDA)

Face recognition is important in many applications including surveillance, biometrics, and other ... more Face recognition is important in many applications including surveillance, biometrics, and other domains and fast face recognition is required if she wants to train and test more images or to increase the resolution of an input image for better accuracy in recognition. Meanwhile, Graphics Processing Units (GPUs) have become widely available, offering the opportunity for real-time face recognition even for larger set of images with high resolution. In this paper, we explore the design space of parallelizing a PCA (Principal Components Analysis) based face recognition algorithm and propose a fast face recognizer on GPUs by exploiting the fine-grained data-parallelism found in the face recognition algorithm. Our best results with the CUDA face recognizer show over 40-fold speedups compared to a sequential C implementation.

Understanding and bridging the gaps in current GNN performance optimizations

Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2021

Graph Neural Network (GNN) has recently drawn a rapid increase of interest in many domains for it... more Graph Neural Network (GNN) has recently drawn a rapid increase of interest in many domains for its effectiveness in learning over graphs. Maximizing its performance is essential for many tasks, but remains preliminarily understood. In this work, we provide an in-depth examination of the state-of-the-art GNN frameworks, revealing five major gaps in the current frameworks in optimizing GNN performance, especially in handling the special complexities of GNN over traditional graph or DNN operations. Based on the insights, we put together a set of optimizations to fill the gaps. These optimizations leverage the state-of-the-art GPU optimization techniques and tailor them to the special properties of GNN. Experimental results show that these optimizations achieve 1.37×-15.5× performance improvement over the state-of-the-art frameworks on various GNN models.

Performance Evaluation of INT8 Quantized Inference on Mobile GPUs

IEEE Access, 2021

During the past several years, the need for on-device deep learning has rapidly increased, and th... more During the past several years, the need for on-device deep learning has rapidly increased, and the performance of mobile GPUs has significantly increased. As a viable approach for efficient on-device deep learning, INT8 quantized inference has been actively studied and proposed but there are currently few frameworks that support INT8 quantization for mobile GPUs. This paper presents a unified framework that integrates various INT8 quantization methods, such as symmetric, asymmetric, per-layer, and per-channel, and discusses their impact on accuracy and efficiency on recent mobile GPUs. Moreover, we discuss the performance and accuracy of INT8 quantized Winograd convolution and propose INT8 Winograd convolution with F(2 × 2, 3 × 3), where weight tensors are quantized in INT4 and input tensors are quantized in INT6. We evaluated the performance of INT8 methods, including INT8 Winograd, for ResNet50, MobileNet-v1, and VGG16 on Mali G52, G72, and G76 GPUs on Odroid N2, Galaxy S9, and Galaxy Note 10+, respectively. INT8 quantized inference based on General Matrix Multiplication (GEMM) was 1.67× faster than FP32 GEMM for ResNet50 on Mali G52, and was further accelerated by batch normalization folding and by the proposed INT8 Winograd convolution, achieving 2.45× speedup in total with an accuracy loss of only 0.31%p. INDEX TERMS On-device deep learning, INT8 quantization, INT8 Winograd convolution, mobile GPU.

BPNet: Branch-pruned Conditional Neural Network for Systematic Time-accuracy Tradeoff

2020 57th ACM/IEEE Design Automation Conference (DAC), 2020

Recently, there have been attempts to execute the neural network conditionally with auxiliary cla... more Recently, there have been attempts to execute the neural network conditionally with auxiliary classifiers allowing early termination depending on the difficulty of the input, which can reduce the execution time or energy consumption without any or with negligible accuracy decrease. However, previous studies do not consider how many or where the auxiliary classifiers, or branches, should be added in a systematic fashion. In this paper, we propose Branch-pruned Conditional Neural Network (BPNet) and its methodology in which the time-accuracy tradeoff for the conditional neural network can be found systematically. We applied BPNet to SqueezeNet, ResNet-20, and VGG-16 with CIFAR-10 and 100. BPNet achieves up to 3.15× of speedup without any accuracy drop compared to the base networks.

NNSim: Fast Performance Estimation Based on Sampled Simulation of GPGPU Kernels for Neural Networks

2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC), 2018

Existent GPU simulators are too slow to use for neural networks implemented in GPUs. For fast per... more Existent GPU simulators are too slow to use for neural networks implemented in GPUs. For fast performance estimation, we propose a novel hybrid method of analytical performance modeling and sampled simulation of GPUs. By taking full advantage of repeated computation of neural networks, three sampling techniques are devised: Inter-Kernel sampling, Intra-Kernel sampling, and Streaming Multiprocessor sampling. The key technique is to estimate the average IPC through sampled simulation, considering the effect of the warp scheduler and memory access contention. Compared with GPGPU-Sim, the proposed technique reduces the simulation time by up to 450 times with less than 5.0% of accuracy loss.

Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, 2017

Pipeline is an important programming pattern, while GPU, designed mostly for data-level parallel ... more Pipeline is an important programming pattern, while GPU, designed mostly for data-level parallel executions, lacks an ecient mechanism to support pipeline programming and executions. This paper provides a systematic examination of various existing pipeline execution models on GPU, and analyzes their strengths and weaknesses. To address their shortcomings, this paper then proposes three new execution models equipped with much improved controllability, including a hybrid model that is capable of getting the strengths of all. These insights ultimately lead to the development of a software programming framework named VersaPipe. With VersaPipe, users only need to write the operations for each pipeline stage. VersaPipe will then automatically assemble the stages into a hybrid execution model and congure it to achieve the best performance. Experiments on a set of pipeline benchmarks and a real-world face detection application show that VersaPipe produces up to 6.90⇥ (2.88⇥ on average) speedups over the original manual implementations. CCS CONCEPTS • General and reference → Performance; • Computing methodologies → Parallel computing methodologies; • Computer systems organization → Heterogeneous (hybrid) systems;

A fully data parallel WFST-based large vocabulary continuous speech recognition on a graphics processing unit

Interspeech 2009, 2009

Tremendous compute throughput is becoming available in personal desktop and laptop systems throug... more Tremendous compute throughput is becoming available in personal desktop and laptop systems through the use of graphics processing units (GPUs). However, exploiting this resource requires re-architecting an application to fit a data parallel programming model. The complex graph traversal routines in the inference process for large vocabulary continuous speech recognition (LVCSR) have been considered by many as unsuitable for extensive parallelization. We explore and demonstrate a fully data parallel implementation of a speech inference engine on NVIDIA’s GTX280 GPU. Our implementation consists of two phases compute-intensive observation probability computation phase and communication-intensive graph traversal phase. We take advantage of dynamic elimination of redundant computation in the compute-intensive phase while maintaining close-to-peak execution efficiency. We also demonstrate the importance of exploring application-level trade-offs in the communication-intensive graph traversal phase to adapt the algorithm to data parallel execution on GPUs. On 3.1 hours of speech data set, we achieve more than 11× speedup compared to a highly optimized sequential implementation on Intel Core i7 without sacrificing accuracy.

Exploiting Activation Sparsity for Fast CNN Inference on Mobile GPUs

ACM Transactions on Embedded Computing Systems, 2021

Over the past several years, the need for on-device deep learning has been rapidly increasing, an... more Over the past several years, the need for on-device deep learning has been rapidly increasing, and efficient CNN inference on mobile platforms has been actively researched. Sparsity exploitation has been one of the most active research themes, but the studies mostly focus on weight sparsity by weight pruning. Activation sparsity, on the contrary, requires compression at runtime for every input tensor. Hence, the research on activation sparsity mainly targets NPUs that can efficiently process this with their own hardware logic. In this paper, we observe that it is difficult to accelerate CNN inference on mobile GPUs with natural activation sparsity and that the widely used CSR-based sparse convolution is not sufficiently effective due to the compression overhead. We propose several novel sparsification methods that can boost activation sparsity without harming accuracy. In particular, we selectively sparsify some layers with an extremely high sparsity and adopt sparse convolution or ...

GOPipe

Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming, 2019

Recent studies have shown promising performance benefits of pipelined stencil applications. An im... more Recent studies have shown promising performance benefits of pipelined stencil applications. An important factor for the computing efficiency of such pipelines is the granularity of a task. We presents GOPipe, the first granularity-oblivious programming framework for efficient pipelined stencil executions. With GOPipe, programmers no longer need to specify the appropriate task granularity. GOPipe automatically finds it, and schedules tasks of that granularity while observing all inter-task and inter-stage data dependencies. In our experiments on four real-life applications, GOPipe outperforms the state-of-the-art by up to 4.57× with a much better programming productivity.

Scheduling of Deep Learning Applications Onto Heterogeneous Processors in an Embedded Device

IEEE Access, 2020

As the need for on-device machine learning is increasing recently, embedded devices tend to be eq... more As the need for on-device machine learning is increasing recently, embedded devices tend to be equipped with heterogeneous processors that include a multi-core CPU, a GPU, and/or a DNN accelerator called a Neural Processing Unit (NPU). In the scheduling of multiple deep learning (DL) applications in such embedded devices, there are several technical challenges. First, a task can be mapped onto a single core or any number of available cores. So we need to consider various possible configurations of CPU cores. Second, embedded devices usually apply Dynamic Voltage and Frequency Scaling (DVFS) to reduce energy consumption at run-time. We need to consider the effect of DVFS in the profiling of task execution times. Third, to avoid overheat condition, it is recommended to limit the core utilization. Lastly, some cores will be shutdown at run-time if core utilization is not high enough, in case the hot-plugging option is turned on. In this paper, we propose a scheduling technique based on Genetic Algorithm to run DL applications on heterogeneous processors, considering all those issues. First, we aim to optimize the throughput of a single deep learning application. Next, we aim to find the Pareto optimal scheduling of multiple DL applications in terms of the response time of each DL application and overall energy consumption under the given throughput constraints of DL applications. The proposed technique is verified with real DL networks running on two embedded devices, Galaxy S9 and HiKey970. INDEX TERMS Deep learning scheduling, genetic algorithm, heterogeneous processor, mobile device.

Real-Time and Energy-Efficient Face Detection on CPU-GPU Heterogeneous Embedded Platforms

IEICE Transactions on Information and Systems, 2018

As energy efficiency has become a major design constraint or objective, heterogeneous manycore ar... more As energy efficiency has become a major design constraint or objective, heterogeneous manycore architectures have emerged as mainstream target platforms not only in server systems but also in embedded systems. Manycore accelerators such as GPUs are getting also popular in embedded domains, as well as the heterogeneous CPU cores. However, as the number of cores in an embedded GPU is far less than that of a server GPU, it is important to utilize both heterogeneous multi-core CPUs and GPUs to achieve the desired throughput with the minimal energy consumption. In this paper, we present a case study of mapping LBP-based face detection onto a recent CPU-GPU heterogeneous embedded platform, which exploits both task parallelism and data parallelism to achieve maximal energy efficiency with a real-time constraint. We first present the parallelization technique of each task for the GPU execution, then we propose performance and energy models for both task-parallel and data-parallel executions on heterogeneous processors, which are used in design space exploration for the optimal mapping. The design space is huge since not only processor heterogeneity such as CPU-GPU and big.LITTLE, but also various data partitioning ratios for the data-parallel execution on these heterogeneous processors are considered. In our case study of LBP face detection on Exynos 5422, the estimation error of the proposed performance and energy models were on average -2.19% and -3.67% respectively. By systematically finding the optimal mappings with the proposed models, we could achieve 28.6% less energy consumption compared to the manual mapping, while still meeting the real-time constraint.

Distributed Video Decoding on Hadoop

IEICE Transactions on Information and Systems, 2018

Video analytics is usually time-consuming as it not only requires video decoding as a first step ... more Video analytics is usually time-consuming as it not only requires video decoding as a first step but also usually applies complex computer vision and machine learning algorithms to the decoded frame. To achieve high efficiency in video analytics with ever increasing frame size, many researches have been conducted for distributed video processing using Hadoop. However, most approaches focused on processing multiple video files on multiple nodes. Such approaches require a number of video files to achieve any speedup, and could easily result in load imbalance when the size of video files is reasonably long since a video file itself is processed sequentially. In contrast, we propose a distributed video decoding method with an extended FFmpeg and VideoRecordReader, by which a single large video file can be processed in parallel across multiple nodes in Hadoop. The experimental results show that a case study of face detection and SURF system achieve 40.6 times and 29.1 times of speedups respectively on a four-node cluster with 12 mappers in each node, showing good scalability.

Acceleration of Word2vec Using GPUs

Neural Information Processing, 2016

Word2vec is a widely used word embedding toolkit which generates word vectors by training input c... more Word2vec is a widely used word embedding toolkit which generates word vectors by training input corpus. Since word vector can represent an exponential number of word cluster and enables reasoning of words with simple algebraic operations, it has become a widely used representation for the subsequent NLP tasks. In this paper, we present an efficient parallelization of word2vec using GPUs that preserves the accuracy. With two K20 GPUs, the proposed acceleration technique achieves 1.7M words/sec, which corresponds to about 20× of speedup compared to a single-threaded CPU execution.

Virtual synchronization technique with OS modeling for fast and time-accurate cosimulation

First IEEE/ACM/IFIP International Conference on Hardware/ Software Codesign and Systems Synthesis (IEEE Cat. No.03TH8721)

Hardware/Software cosimulation is the key process to shorten the design turn around time. We have... more Hardware/Software cosimulation is the key process to shorten the design turn around time. We have proposed a novel technique, called virtual synchronization, for fast and time accurate cosimulation that involves interacting component simulators. In this paper, we further extend the virtual synchronization technique with OS modeling for the case where multiple software tasks are executed under the supervision of a real-time operating system. The OS modeler models the RTOS overheads of context switching and tick interrupt handling as well as preemption behavior. While maintaining the timing accuracy to an acceptable level below a few percents, we could reduce the simulation time drastically compared with existent conservative approach by removing the need of time synchronization between simulators. It is confirmed with a preliminary experiment with a multimedia example that consists of four real-life tasks.

Embedded software generation from system level specification for multi-tasking embedded systems

Proceedings of the ASP-DAC 2005. Asia and South Pacific Design Automation Conference, 2005.

In this paper we present a new design flow in which embedded software code is generated from syst... more In this paper we present a new design flow in which embedded software code is generated from system level specification of multi-tasking embedded system, both for simulation and implementation. The generated software has a layered structure using virtual OS APIs and OS wrapper implementations to make it reconfigurable for multiple target platforms. Implementation of the OS wrapper is explained in details. With a Divx play example, we show some experimental results about the real-time performance comparison between two different platforms Control FSM H.263 Decoder MAD Stream

Fast PCA-based face recognition on GPUs

2013 IEEE International Conference on Acoustics, Speech and Signal Processing, 2013

ABSTRACT Face recognition is very important in many applications including surveillance, biometri... more ABSTRACT Face recognition is very important in many applications including surveillance, biometrics, and other domains. Fast face recognition is required if she wants to train or test more images or to increase the resolution of an input image for better accuracy in the recognition. Meanwhile, Graphics Processing Units (GPUs) have become widely available, offering the opportunity for real-time face recognition even for larger set of images with a high resolution. In this paper, we explore the design space of parallelizing a PCA (Principal Component Analysis) based face recognition algorithm and propose a fast face recognizer on GPUs by exploiting the fine-grained data-parallelism found in the face recognition algorithm. We successfully accelerated the major three tasks by 120-folds, 70-folds, and 110-folds, compared to a sequential C implementation. For the end-to-end comparison, our CUDA face recognizer achieved a 30-fold speedup.

A cycle-level parallel simulation technique exploiting both space and time parallelism

2012 23rd IEEE International Symposium on Rapid System Prototyping (RSP), 2012

Automatic CUDA Code Synthesis Framework for Multicore CPU and GPU Architectures

Lecture Notes in Computer Science, 2012

ABSTRACT Recently, general purpose GPU (GPGPU) programming has spread rapidly after CUDA was firs... more ABSTRACT Recently, general purpose GPU (GPGPU) programming has spread rapidly after CUDA was first introduced to write parallel programs in high-level languages for NVIDIA GPUs. While a GPU exploits data parallelism very effectively, task-level parallelism is exploited as a multi-threaded program on a multicore CPU. For such a heterogeneous platform that consists of a multicore CPU and GPU, in this paper, we propose an automatic code synthesis framework that takes a process network model specification as input and generates a multithreaded CUDA code. With the model based specification, one can explicitly specify both function-level and loop-level parallelism in an application and explore wide design space in mapping of function blocks and selecting the communication methods between CPU and GPU. The proposed technique is complementary to other high-level methods of CUDA programming. We have confirmed viability of our approach with several examples.