Papers by Jose Nunez-yanez
ArXiv, 2019
We present a flexible and self-contained platform for acoustic levitation research based on the X... more We present a flexible and self-contained platform for acoustic levitation research based on the Xilinx Zynq SoC using an array of ultrasonic emitters. The platform employs an inexpensive ZedBoard and provides fast movement of the levitated objects as well as object detection based on the produced echo. Several features available in the Zynq device are of benefit for this platform: hardware acceleration for the phase calculations, large number of parallel I/Os connected through the FPGA Mezzanine connector (FMC), integrated ADC capabilities to capture echo signals and ease of programmability due to a C-based design flow for both CPU and FPGA. A planar and spherical cap phased arrays are created and we investigate the capabilities and limitations of the different designs to improve the stability of the levitation process.
This paper presents a reconfigurable motion estimation processor suitable for high definition vid... more This paper presents a reconfigurable motion estimation processor suitable for high definition video coding. A toolset for the design of a video coding system is presented as well. The presented tools can be used in the design and configuration of the reconfigurable processor itself. They can also be used to design user-defined block-matching motion estimation algorithms. Using the tools, the processor’s design space may be explored in order to find configurations suitable for high definition video. The experiments presented show the effect of modifying the processor configuration on the performance obtained when coding high definition video sequences, and the results indicate that for high definition video, supporting sub-partitioning offers no gain for the increase in complexity.

Thread-Parallel MPEG-2 and MPEG-4 Encoders for Shared-Memory System-On-Chip Multiprocessors
International Journal of Computers and Applications, 2007
ABSTRACT This work focuses on speeding up MPEG-2 and MPEG-4 encoding by using thread-parallelism ... more ABSTRACT This work focuses on speeding up MPEG-2 and MPEG-4 encoding by using thread-parallelism for shared-memory, System-On-Chip multiprocessors. Improving the performance of the MPEG encoders is shown by reducing the dynamic instruction count at multiple processor contexts and then mapping onto a configurable SoC multiprocessor. The resulting reduction in the dynamic instruction count of the parallelized MPEG-2 TM5 encoder for 32 processor contexts reaches a maximum of 95% and that of the MPEG-4 XViD a maximum of 83% for 16 processor contexts, both compared to the sequential encoder. To realize the parallelized encoders we present a configurable, N-way, extensible, bus-based, cache-coherent SoC multiprocessor, augmented with data-parallel coprocessors, and we give the VLSI implementation for the 2-way and 4-way configurations.
2009 International Conference on Field Programmable Logic and Applications, 2009
This paper presents a reconfigurable processor designed to execute user-defined block-matching mo... more This paper presents a reconfigurable processor designed to execute user-defined block-matching motion estimation algorithms, and a toolset for the design of such algorithms and for the configuration of the processor. The toolset enables the exploration of the processor's design space in order to find an optimal configuration depending on the target application. The use of the toolset to test different configurations for different kinds of video sequences is illustrated. Experimental results show the benefits and cost of certain optimizations in the motion estimation process, and that fast block-matching search algorithms can outperform full search algorithms commonly used in hardware implementations. The usefulness of the toolset in exploring the configuration space is also shown.

Integration, the VLSI Journal, 2008
This work presents a detailed case study in customizing a configurable, extensible, 32-bit RISC p... more This work presents a detailed case study in customizing a configurable, extensible, 32-bit RISC processor with vector/SIMD instruction extensions for the efficient execution of block-based video-coding algorithms utilizing a proprietary co-design environment. In addition to the default Full-Search motion estimation of the MPEG-2 Test Model 5, fourteen fast ME algorithms were implemented in both scalar and vector form. Results demonstrate a reduction of up to 68% in the dynamic instruction count of the full search-based encoder whereas the fast motion estimation algorithms achieved a reduction in instruction count of nearly 90%, both accelerated via three 128-bit vector/SIMD instructions when compared to the scalar, reference implementation of the standard. We address in detail the profiling, vectorization and the development of these vector instruction set extensions, discuss in depth the implementation of a parametric vector accelerator that implements these instructions and show the introduction of that accelerator into a 32-bit RISC processor pipeline, in a closely-coupled configuration.
A Novel <formula formulatype="inline"><tex>$\Delta\Sigma$</tex> </formula> Control System Processor and Its VLSI Implementation
IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2008
This paper describes a novel control system processor architecture based on DeltaSigma modulation... more This paper describes a novel control system processor architecture based on DeltaSigma modulation known as the DeltaSigma -CSP. The DeltaSigma -CSP utilizes 1-bit processing which is a new concept in digital control applications with the direct benefit of making multi-bit multiplication operations redundant. A simple conditional-negate-and-add (CNA) unit is instead used for operations in control law implementations. For this reason,

ACM Transactions on Embedded Computing Systems, 2018
In this article, we investigate how to utilise an Field-Programmable Gate Array (FPGA) in an embe... more In this article, we investigate how to utilise an Field-Programmable Gate Array (FPGA) in an embedded system to save energy. For this purpose, we study the energy efficiency of a hybrid FPGA-CPU device that can switch task execution between hardware and software with a focus on periodic tasks. To increase the applicability of this task switching, we also consider the voltage and frequency scaling (VFS) applied to the FPGA to reduce the system energy consumption. We show that in some cases, if the task’s period is higher than a specific level, the FPGA accelerator cannot reduce the energy consumption associated to the task and the software version is the most energy efficient option. We have applied the proposed techniques to a robot map creation algorithm as a case study which shows up to 38% energy reduction compared to the FPGA implementation. Overall, experimental results show up to 48% energy reduction by applying the proposed techniques at runtime on 13 individual tasks.
IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2012

IEEE Transactions on Computers, 2005
This paper presents a practical realisation in hardware of the concepts of variable order Markov ... more This paper presents a practical realisation in hardware of the concepts of variable order Markov modelling using multi-symbol alphabets and arithmetic coding for lossless compression of universal data. This type of statistical coding algorithms has long been regarded as being able to deliver very high compression ratios close to the information content of the source data. However, their high computational complexity has limited their practical application in embedded environments such as in mobile computing and wireless communications. In this work a hardware amenable algorithm named PPMH and based on these principles has been developed and its architecture and implementation detailed. This novel lossless compression core offers innovative solutions to the computational issues in both stages of modelling and coding and delivers high compression efficiency and throughput. The configurability features of the core allow efficient use of the embedded SRAM present in modern FPGA technologies where memory resources range from a few kilobits to several megabits per device family. The core has been targeted to the Altera Stratix FPGA family and performance, coding efficiency and complexity measured for different memory configurations. Index Terms-Markov modelling, statistical compression, lossless compression, arithmetic coding. I. INTRODUCTION The past 5 years have witnessed an explosion in wireless networking demand and capability mainly motivated by the huge success of handheld and mobile devices such as cellular phones
Ultrasonic Levitation with Software-Defined FPGAs and Electronically Phased Arrays
2019 NASA/ESA Conference on Adaptive Hardware and Systems (AHS)
Sparse and dense matrix multiplication hardware for heterogeneous multi-precision neural networks
Array
Workload Partitioning Strategy for Improved Parallelism on FPGA-CPU Heterogeneous Chips
2018 28th International Conference on Field Programmable Logic and Applications (FPL)

2018 Design, Automation & Test in Europe Conference & Exhibition (DATE)
Fully binarised convolutional neural networks (CNNs) deliver very high inference performance usin... more Fully binarised convolutional neural networks (CNNs) deliver very high inference performance using singlebit weights and activations, together with XNOR type operators for the kernel convolutions. Current research shows that full binarisation results in a degradation of accuracy and different approaches to tackle this issue are being investigated such as using more complex models as accuracy reduces. This paper proposes an alternative based on a multi-precision CNN framework that combines a binarised and a floating point CNN in a pipeline configuration deployed on heterogeneous hardware. The binarised CNN is mapped onto an FPGA device and used to perform inference over the whole input set while the floating point network is mapped onto a CPU device and performs reinference only when the classification confidence level is low. A lightweight confidence mechanism enables a flexible trade-off between accuracy and throughput. To demonstrate the concept, we choose a Zynq 7020 device as the hardware target and show that the multi-precision network is able to increase the BNN accuracy from 78.5% to 82.5% and the CPU inference speed from 29.68 to 90.82 images/sec.

2017 25th European Signal Processing Conference (EUSIPCO)
This paper proposes an adaptive vibration signal compression scheme composed of a lifting discret... more This paper proposes an adaptive vibration signal compression scheme composed of a lifting discrete wavelet transform (LDWT) with set-partitioning embedded blocks (SPECK) that efficiently sorts the wavelet coefficients by significance. The output of the SPECK module is input to an optimized context-based arithmetic coder that generates the compressed bitstream. The algorithm is deployed as part of a reliable and effective health monitoring technology for machines and civil constructions (e.g. power generation system). This application area relies on the collection of large quantities of high quality vibration sensor data that needs to be compressed before storing and transmission. Experimental results indicate that the proposed method outperforms state-of-the-art coders, while retaining the characteristics in the compressed vibration signals to ensure accurate event analysis. For the same quality level, up to 59.41% bitrate reduction is achieved by the proposed method compared to JPEG2000.

Energies
This work proposes a methodology to find performance and energy trade-offs for parallel applicati... more This work proposes a methodology to find performance and energy trade-offs for parallel applications running on Heterogeneous Multi-Processing systems with a single instruction-set architecture. These offer flexibility in the form of different core types and voltage and frequency pairings, defining a vast design space to explore. Therefore, for a given application, choosing a configuration that optimizes the performance and energy consumption is not straightforward. Our method proposes novel analytical models for performance and power consumption whose parameters can be fitted using only a few strategically sampled offline measurements. These models are then used to estimate an application’s performance and energy consumption for the whole configuration space. In turn, these offline predictions define the choice of estimated Pareto-optimal configurations of the model, which are used to inform the selection of the configuration that the application should be executed on. The methodol...

IEEE Access
Estimation of Remaining Useful Life (RUL) is a crucial task in Prognostics and Health Management ... more Estimation of Remaining Useful Life (RUL) is a crucial task in Prognostics and Health Management (PHM) for condition-based maintenance of machinery. In order to transmit and store the sensor data for archiving and long term analysis, data compression techniques are regularly used to reduce the requirements of bandwidth, energy and storage in modern remote PHM systems. In these systems the challenge arises of how the compressed sensor data affects the RUL estimation algorithms. A main drawback of conventional statistical modeling approaches is that they require expert prior knowledge and a significant number of assumptions. Alternative regression based approaches and deep neural networks are known to have issues when modeling long-term dependencies in the sequential data. Recently Long Short-Term Memory (LSTM) neural networks have been proposed to overcome these issues and in this paper we create a LSTM network and data fusion approach that can estimate the RUL with compressed (distorted) data. The experimental results indicate that the proposed method is able to estimate RUL reliably with narrower error bands compared to other state-of-the-art approaches. Moreover, the proposed method is able to predict RUL from both the raw and compressed datasets with comparable accuracy. INDEX TERMS machine health monitoring,remaining useful life (RUL),long-short term memory,recurrent neural network,data compression
Exploring Heterogeneous Scheduling for Edge Computing with CPU and FPGA MPSoCs
Journal of Systems Architecture
Intra- and inter-core power modelling for single-ISA heterogeneous processors
International Journal of Embedded Systems
The Journal of Supercomputing
Heterogeneous chips that combine CPUs and FPGAs can distribute processing so that the algorithm t... more Heterogeneous chips that combine CPUs and FPGAs can distribute processing so that the algorithm tasks are mapped onto the most suitable processing element. New software-defined high-level design environments for these chips use general purpose languages such as C++ and OpenCL for hardware and interface generation without the need for register transfer language expertise. These advances in hardware compilers have resulted in significant increases in FPGA design productivity. In this paper, we investigate how to enhance an existing software-defined framework B Sam Amiri

Mechanical Systems and Signal Processing
Anomaly detection is a crucial task in Prognostics and Condition Monitoring (PCM) of machinery. I... more Anomaly detection is a crucial task in Prognostics and Condition Monitoring (PCM) of machinery. In modern remote PCM systems, data compression techniques are regularly used to reduce the need for bandwidth and storage. In these systems the challenge arises of how the compressed (distorted) vibration data affects the condition monitoring algorithms. This paper introduces a novel algorithm that can adaptively establish normal bounds of operation from continuous noisy vibration profiles working with compressed vibration data. The proposed technique is based on four modules, including feature extraction, feature fusion, extreme value vibration modeling and adaptive thresholding for anomaly detection. The proposed method has been validated with experiments using three time-series datasets. The experimental results indicate that the proposed algorithm is able to perform detection of malfunctions in rotating machines effectively without faulty reference data. Moreover, the proposed method is able to produce accurate early warning and alarm indications from both the raw and compressed (distorted) datasets with equal veracity.
Uploads
Papers by Jose Nunez-yanez