Papers by Dimitrios Soudris
2000 IEEE International Symposium on Circuits and Systems. Emerging Technologies for the 21st Century. Proceedings (IEEE Cat No.00CH36353)
Recent advances in electronic technology integration coupled with increasing needs for more servi... more Recent advances in electronic technology integration coupled with increasing needs for more services in portable communications favors the development of high performance dual-mode terminals. We present the complete architecture implementation of the GMSK/GFSK modulator/demodulator including the FIR filters design. The main features of the modulator/demodulator and the architectural implementation of FIR filters are described. The interface with ASPIS processor and

arXiv (Cornell University), Mar 11, 2022
Printed electronics (PE) feature low non-recurring engineering costs and low per unit-area fabric... more Printed electronics (PE) feature low non-recurring engineering costs and low per unit-area fabrication costs, enabling thus extremely low-cost and on-demand hardware. Such low-cost fabrication allows for high customization that would be infeasible in silicon, and bespoke architectures prevail to improve the efficiency of emerging PE machine learning (ML) applications. However, even with bespoke architectures, the large feature sizes in PE constraint the complexity of the ML models that can be implemented. In this work, we bring together, for the first time, approximate computing and PE design targeting to enable complex ML models, such as Multi-Layer Perceptrons (MLPs) and Support Vector Machines (SVMs), in PE. To this end, we propose and implement a cross-layer approximation, tailored for bespoke ML architectures. At the algorithmic level we apply a hardware-driven coefficient approximation of the ML model and at the circuit level we apply a netlist pruning through a full search exploration. In our extensive experimental evaluation we consider 14 MLPs and SVMs and evaluate more than 4300 approximate and exact designs. Our results demonstrate that our cross approximation delivers Pareto optimal designs that, compared to the state-of-the-art exact designs, feature 47% and 44% average area and power reduction, respectively, and less than 1% accuracy loss.
Electronics, Dec 23, 2021
This article is an open access article distributed under the terms and conditions of the Creative... more This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY
... Alienor Richard, Dragomir Milojevic, Frederic Robert Bio Electro and Mechanical Systems, Univ... more ... Alienor Richard, Dragomir Milojevic, Frederic Robert Bio Electro and Mechanical Systems, Université Libre de Bruxelles, Belgium ... Journal of Signal Processing Systems, 57(2):139156, Nov 2009. [9] Y. Moullec, J.-P. Diguet, N. Ben Amor, T. Gourdeaux, and J.-L. Philippe. ...
Journal of Real-time Image Processing, Oct 7, 2008
ABSTRACT

Diabetes & Metabolism, Mar 1, 2014
Introduction La telemedecine s'etend a tous les domaines de la sante, dont la cicatrisation d... more Introduction La telemedecine s'etend a tous les domaines de la sante, dont la cicatrisation des plaies chroniques. Les plaies de pied diabetique et les ulceres veineux representent une part importante de ces plaies. Materiels et methodes Ce poster presente la description preliminaire du projet Europeen FP7-ICT-2011-8 qui vise a elaborer un systeme de soins pour plaies chroniques intitule SWAN-iCare. Resultats De septembre 2012 a septembre 2016, un consortium de cliniciens et de scientifiques va elaborer un nouveau systeme de prise en charge des plaies chroniques : le systeme SWAN-iCare, dedie principalement aux plaies de pied diabetique et aux ulceres veineux. Ce dispositif reposera sur un traitement par pression negative realisable au domicile du patient, couple a un monitoring a distance de la cicatrisation a partir : – de capteurs situes au voisinage de la plaie pour mesurer certaines caracteristiques physico-chimiques de l'exsudat impactant sur la cicatrisation (pH, taux de proteases…). – de mesure de la pression plantaire pour les plaies de pied diabetique ou de la dorsiflexion pour les ulceres veineux, ou encore d'un suivi de l'oxygenation de la plaie ou de la glycemie capillaire. Le but est d'offrir pour des plaies dites « difficile a cicatriser » un outil de monitoring pour le centre de cicatrisation en charge du patient, afin d'ameliorer la prise en charge globale de ces plaies de pronostic incertain. Les resultats presentes ici tracent les grandes lignes de ce projet, dans les limites du respect de la propriete intellectuelle. Conclusion Dans un avenir proche, on peut s'attendre au developpement du traitement par pression negative pour les plaies chroniques, associe a des systemes de monitoring a distance reposant sur des capteurs pour differentes caracteristiques des plaies concernees.

Ledger
In the past several years, there has been an increased usage of smart, always- connected devices ... more In the past several years, there has been an increased usage of smart, always- connected devices at the edge of the network, which provide real-time contextual information with low overhead to optimize processes and improve how companies and individuals interact, work, and live. The efficient management of this huge pool of devices requires runtime moni- toring to identify potential performance bottlenecks and physical defects. Typical solutions, where monitoring data are aggregated in a centralized manner, soon become inefficient, as they are unable to handle the increased load and become single points of failure. In addition, the resource-constrained nature of edge devices calls for low-overhead monitoring systems. In this paper, we propose HLF-Kubed, a blockchain-based, highly available framework for monitoring edge devices, leveraging distributed ledger technology. HLF-Kubed builds upon Kubernetes container orchestrator and HyperLedger Fabric frameworks and implements a smart co...

Metals
The quality control of discretely manufactured parts typically involves defect recognition activi... more The quality control of discretely manufactured parts typically involves defect recognition activities, which are time-consuming, repetitive tasks that must be performed by highly trained and/or experienced personnel. However, in the context of the fourth industrial revolution, the pertinent goal is to automate such procedures in order to improve their accuracy and consistency, while at the same time enabling their application in near real-time. In this light, the present paper examines the applicability of popular deep neural network types, which are widely employed for object detection tasks, in recognizing surface defects of parts that are produced through a die-casting process. The data used to train the networks belong to two different datasets consisting of images that contain various types of surface defects and for two different types of parts. The first dataset is freely available and concerns pump impellers, while the second dataset has been created during the present study...

Proceedings of the 2023 ACM SIGPLAN International Symposium on Memory Management
Two-dimensional rectangular bin packing (2DBP) is a known abstraction of dynamic storage allocati... more Two-dimensional rectangular bin packing (2DBP) is a known abstraction of dynamic storage allocation (DSA). We argue that such abstractions can aid practical purposes. 2DBP algorithms optimize their placements' makespan, i.e., the size of the used address range. Demand paging-enabled virtual memory systems render makespan irrelevant: allocators commonly employ sparse addressing and need worry only about fragmentation caused within page boundaries. But in the embedded domain, where portions of memory are statically pre-allocated, makespan remains a reasonable metric. Recent work has shown that viewing allocators as blackbox 2DBP solvers bears meaning. There exists a 2DBP-based fragmentation metric which often correlates monotonically with maximum resident set size (RSS). Given the eld's indeterminacy with respect to fragmentation de nitions, as well as the immense value of physical memory savings, we are motivated to set allocator-generated placements against their 2DBP-devised, makespan-optimizing counterparts. Of course, allocators must operate online while 2DBP algorithms work on complete request traces; but since both sides aim for minimum memory wastage, the idea of studying their relationship preserves its intellectual-and practical-interest.

Proceedings of the 7th Workshop on Parallel Programming and Run-Time Management Techniques for Many-core Architectures and the 5th Workshop on Design Tools and Architectures For Multicore Embedded Computing Platforms
Modern data analytics applications exhibit scale-out characteristics, requiring large amount of c... more Modern data analytics applications exhibit scale-out characteristics, requiring large amount of computational power. Recent research has shown that modern manycore architectures forms a promising platform solution for this emerging type of workloads. In this paper, we present a framework for the deployment, monitoring and automated exploration of Hadoop MapReduce clusters implementing data analytics applications onto the Intel SCC manycore platform. We provide an in-depth analysis on the performance and energy characteristics of Hadoop MapReduce workloads on the Intel SCC, i.e. on a real-silicon manycore which highly differentiates from typical server and accelerator architectures. Through extensive experimentation, we show that there is a trade-off between the number of worker nodes and the per-node available I/O bandwidth and that intelligently scaling the frequency of data-nodes yields in energy savings with minimal impact on performance.

2019 29th International Conference on Field Programmable Logic and Applications (FPL)
Process variability is a challenging fabrication issue impacting, mainly, the reliability and per... more Process variability is a challenging fabrication issue impacting, mainly, the reliability and performance of chips. Variability is already present in current technology nodes and is expected to become even more significant in the future. In this work, we focus on the study of performance variation in 16nm FinFET FPGAs. We devise a comprehensive assessment methodology based on multiple programmable sensors with diverse resource and delay characteristics. Additionally, we consider various voltage and temperature conditions and decouple variability to systematic and stochastic. The experimental results on Zynq XCZU7EV show up to 7.3% intra-die variation increasing to 9.9% for certain operating conditions. Our approach demonstrates that logic and interconnect resources present different variability, slightly uncorrelated, which highlights the necessity and way towards more sophisticated mitigation methods/tools.

2022 Design, Automation & Test in Europe Conference & Exhibition (DATE)
Printed electronics (PE) feature low non-recurring engineering costs and low per unit-area fabric... more Printed electronics (PE) feature low non-recurring engineering costs and low per unit-area fabrication costs, enabling thus extremely low-cost and on-demand hardware. Such low-cost fabrication allows for high customization that would be infeasible in silicon, and bespoke architectures prevail to improve the efficiency of emerging PE machine learning (ML) applications. However, even with bespoke architectures, the large feature sizes in PE constraint the complexity of the ML models that can be implemented. In this work, we bring together, for the first time, approximate computing and PE design targeting to enable complex ML models, such as Multi-Layer Perceptrons (MLPs) and Support Vector Machines (SVMs), in PE. To this end, we propose and implement a cross-layer approximation, tailored for bespoke ML architectures. At the algorithmic level we apply a hardware-driven coefficient approximation of the ML model and at the circuit level we apply a netlist pruning through a full search exploration. In our extensive experimental evaluation we consider 14 MLPs and SVMs and evaluate more than 4300 approximate and exact designs. Our results demonstrate that our cross approximation delivers Pareto optimal designs that, compared to the state-of-the-art exact designs, feature 47% and 44% average area and power reduction, respectively, and less than 1% accuracy loss.

Proceedings of the 2018 on Great Lakes Symposium on VLSI, 2018
As technology node scales-down and process variability increases, the vendors impose even more co... more As technology node scales-down and process variability increases, the vendors impose even more conservative guard-bands to prevent potential malfunction of their microchips. However, this approach introduces considerable amounts of unexploited performance to individual chips, which can be harvested by developing novel customization tools. In the current work, we focus on the exploitation of process variability in modern FPGA chips to provide more energy efficient solutions. We propose a framework that i) generates variability maps characterizing the energy efficiency of commercial chips and ii) combines voltage and frequency scaling to limit the power dissipation of any given design for a given set of performance constraints. Experimental results on Zynq XC7Z020 28nm FPGAs show that the developed framework achieves up to 28.3% power reduction while maintaining the performance and functional integrity of realistic benchmarks. Moreover, by selecting the most efficient chip, we achieve up to 5.1% additional power savings.

Proceedings of the 23th International Workshop on Software and Compilers for Embedded Systems, 2020
Lately, cloud computing has seen explosive growth, due to the flexibility and scalability it offe... more Lately, cloud computing has seen explosive growth, due to the flexibility and scalability it offers. The ever-increasing computational demands, especially from the machine learning domain, have forced cloud operators to enhance their infrastructure with acceleration devices, such as General-Purpose (GP)GPUs or FPGAs. Even though multi-tenancy has been widely examined for conventional CPUs, this is not the case for accelerators. Current solutions support "one accelerator per user" schemes, which can lead to both under-utilization and starvation of available resources. In this work, we analyze the potentials of GPU sharing inside data-center environments. We investigate how several architectural features affect the performance of GPUs under different multi-tenant stressing scenarios. We compare CUDA MPS with the native, default CUDA scheduler and also with Vinetalk, a research framework providing GPU sharing capabilities. Experimental results show that NVIDIA's MPS achieves the best performance in multi-application scenarios, specifically up to X4.5 and X11.2 compared to native CUDA scheduler and Vinetalk respectively.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this p... more The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.

2021 10th International Conference on Modern Circuits and Systems Technologies (MOCAST), 2021
Modern mobile communication systems utilize increased bandwidth to provide advanced network perfo... more Modern mobile communication systems utilize increased bandwidth to provide advanced network performance and connectivity, all while their most computationally-intensive functions must be accelerated within the limited power envelope of embedded devices. In this paper, we improve the circuit complexity and throughput of a key digital function in the baseband processing chain, namely the high-order QAM demodulation. In particular, we explore 4 different demodulation algorithms, we employ both floating- and fixed-point arithmetic, and we insert approximations in the arithmetic units. In terms of accuracy of our most prominent implementations, i.e., for 64-QAM, our designs deliver BER values ranging from 10−1 to 10−4 for SNR 0−14dB. In terms of FPGA resources on Xilinx ZCU106, these 64-QAM designs achieve up to 98% reduction in LUT utilization compared to the accurate floating-point model of the same algorithm, and up to 122% increase in operating frequency. When targeting demodulation with high levels of accuracy, i.e., almost zero BER degradation with respect to that of the original floating-point model, the prevailing solution is the Approximate LLR algorithm configured with fixed-point arithmetic and 8-bit truncation, providing 81% decrease in LUTs and 13% increase in frequency to sustain a throughput of 323 Msamples/second.
Uploads
Papers by Dimitrios Soudris