Papers by Nectarios Koziris
2023 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

Proceedings of the VLDB Endowment
The ever-increasing demand for high performance Big Data analytics and data processing, has paved... more The ever-increasing demand for high performance Big Data analytics and data processing, has paved the way for heterogeneous hardware accelerators, such as Graphics Processing Units (GPUs) and Field Programmable Gate Arrays (FPGAs), to be integrated into modern Big Data platforms. Currently, this integration comes at the cost of programmability since the end-user Application Programming Interface (APIs) must be altered to access the underlying heterogeneous hardware. For example, current Big Data frameworks, such as Apache Spark, provide a new API that combines the existing Spark programming model with GPUs. For other Big Data frameworks, such as Flink, the integration of GPUs and FPGAs is achieved via external API calls that bypass their execution models completely. In this paper, we rethink current Big Data frameworks from a systems and programming language perspective, and introduce a novel co-designed approach for integrating hardware acceleration into their execution models. The...

2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS)
Despite their proliferation as a dominant computing paradigm, cloud computing systems lack effect... more Despite their proliferation as a dominant computing paradigm, cloud computing systems lack effective mechanisms to manage their vast amounts of resources efficiently. Resources are stranded and fragmented, ultimately limiting cloud systems' applicability to large classes of critical applications that pose non-moderate resource demands. Eliminating current technological barriers of actual fluidity and scalability of cloud resources is essential to strengthen cloud computing's role as a critical cornerstone for the digital economy. ACTiCLOUD proposes a novel cloud architecture that breaks the existing scale-up and share-nothing barriers and enables the holistic management of physical resources both at the local cloud site and at distributed levels. Specifically, it makes advancements in the cloud resource management stacks by extending state-of-the-art hypervisor technology beyond the physical server boundary and localized cloud management system to provide a holistic resource management within a rack, within a site, and across distributed cloud sites. On top of this, ACTiCLOUD will adapt and optimize system libraries and runtimes (e.g., JVM) as well as ACTiCLOUD-native applications, which are extremely demanding, and critical classes of applications that currently face severe difficulties in matching their resource requirements to state-of-the-art cloud offerings.

2020 IEEE International Conference on Big Data (Big Data), 2020
In this paper we present the SELIS Big Data Analytics and Machine Learning System (BDA), an open-... more In this paper we present the SELIS Big Data Analytics and Machine Learning System (BDA), an open-source cloud-enabled elastic system that has been designed and implemented in order to address data related issues from the logistics domain. By taking into consideration real-life data analytics needs from more than 40 EU logistics providers we present the detailed SELIS BDA architecture along with the generic data and execution model devised to accommodate their diverse needs. We describe the main technologies we have utilized to realize the respective offering and justify our choices from the wider opensource Big Data systems community. We experimentally test our offering under various workloads where we prove that it can scale to serve a large number of concurrent requests while its abstraction/orchestration poses a very small overhead compared to the stand-alone Big Data systems. We believe that the SELIS BDA can be an easy-to-use entry point for the big data analytics world for any logistics company especially from the SME domain.

2018 IEEE International Conference on Cloud Computing Technology and Science (CloudCom), 2018
The use of accelerators in computing facilities that employ heterogeneity in order to achieve hig... more The use of accelerators in computing facilities that employ heterogeneity in order to achieve higher performance has become prominent in the past years. Accelerators lie on the hearts of modern data center and computing facilities, powering the majority of the top ten super-computers in the world. They are essential for the modern computing landscape, implementing custom architecture in order to provide efficient scalable processing power targeting a wide range of scientific domains. In this paper, we address the challenge of making accelerator resources remotely accessible. We present RACCEX, a middleware framework that enables efficient Remote ACCelerator EXecution. For our proof-of-concept, we target the Intel Xeon Phi coprocessor. Our proposed solution allows users of lightweight nodes to offload applications remotely on Xeon Phi accelerators. RACCEX intercepts SCIF transport layer calls and forwards them to a remote server equipped with one or more accelerators. Preliminary evaluation of our prototype exhibits promising results with RACCEX framework being able to retain the virtualization overhead up to 10% for large messages compared to the native execution in terms of latency.

Can a stable matching that achieves high equity among the two sides of a market be reached in qua... more Can a stable matching that achieves high equity among the two sides of a market be reached in quadratic time? The Deferred Acceptance (DA) algorithm finds a stable matching that is biased in favor of one side; optimizing apt equity measures is strongly NP-hard. A proposed approximation algorithm offers a guarantee only with respect to the DA solutions. Recent work introduced Deferred Acceptance with Compensation Chains (DACC), a class of algorithms that can reach any stable matching in O(n^4) time, but did not propose a way to achieve good equity. In this paper, we propose an alternative that is computationally simpler and achieves high equity too. We introduce Monotonic Deferred Acceptance (MDA), a class of algorithms that progresses monotonically towards a stable matching; we couple MDA with a mechanism we call Strongly Deferred Acceptance (SDA), to build an algorithm that reaches an equitable stable matching in quadratic time; we amend this algorithm with a few low-cost local sea...

2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), 2017
In this work, we propose AURA, a cloud deployment tool used to deploy applications over providers... more In this work, we propose AURA, a cloud deployment tool used to deploy applications over providers that tend to present transient failures. The complexity of modern cloud environments imparts an error-prone behavior during the deployment phase of an application, something that hinders automation and magnifies costs both in terms of time and money. To overcome this challenge, we propose AURA, a framework that formulates an application deployment as a Directed Acyclic Graph traversal and re-executes the parts of the graph that failed. AURA achieves to execute any deployment script that updates filesystem related resources in an idempotent manner through the adoption of a layered filesystem technique. In our demonstration, we allow users to describe, deploy and monitor applications through a comprehensive UI and showcase AURA's ability to overcome transient failures, even in the most unstable environments.

We present IReS, the Intelligent Resource Scheduler that is able to abstractly describe, optimize... more We present IReS, the Intelligent Resource Scheduler that is able to abstractly describe, optimize and execute any batch analytics workflow with respect to a multi-objective policy. Relying on cost and performance models of the required tasks over the available platforms, IReS allocates distinct workflow parts to the most advantageous execution and/or storage engine among the available ones and decides on the exact amount of resources provisioned. Moreover, IReS efficiently adapts to the current cluster/engine conditions and recovers from failures by effectively monitoring the workflow execution in real-time. Our current prototype has been tested in a plethora of business driven and synthetic workflows, proving its potential of yielding significant gains in cost and performance compared to statically scheduled, single-engine executions. IReS incurs only marginal overhead to the workflow execution performance, managing to discover an approximate pareto-optimal set of execution plans w...

Big data analytics have become a necessity to businesses worldwide. The complexity of the tasks t... more Big data analytics have become a necessity to businesses worldwide. The complexity of the tasks they execute is ever increasing due to the surge in data and task heterogeneity. Current analytics platforms, while successful in harnessing multiple aspects of this \data deluge", bind their ecacy to a single data and compute model and often depend on proprietary systems. However, no single execution engine is suitable for all types of computation and no single data store is suitable for all types of data. To this end, we present and demonstrate a platform that designs, optimizes, plans and executes complex analytics workows over multiple engines. Our system enables users to create workows of variable detail concerning the execution semantics, depending on their level of expertise and interest. The workows are then analysed in order to determine missing execution semantics. Through the modelling of the cost and performance of the required tasks over the available platforms, the syst...
2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), 2020

Proceedings of the AAAI Conference on Artificial Intelligence, 2020
Given a two-sided market where each agent ranks those on the other side by preference, the stable... more Given a two-sided market where each agent ranks those on the other side by preference, the stable marriage problem calls for finding a perfect matching such that no pair of agents prefer each other to their matches. Recent studies show that the number of stable solutions can be large in practice. Yet the classical solution to the problem, the Gale-Shapley (GS) algorithm, assigns an optimal match to each agent on one side, and a pessimal one to each on the other side; such a solution may fare well in terms of equity only in highly asymmetric markets. Finding a stable matching that minimizes the sex equality cost, an equity measure expressing the discrepancy of mean happiness among the two sides, is strongly NP-hard. Extant heuristics either (a) oblige some agents to involuntarily abandon their matches, or (b) bias the outcome in favor of some agents, or (c) need high-polynomial or unbounded time.We provide the first procedurally fair algorithms that output equitable stable marriages ...

ACM Transactions on Mathematical Software, 2018
The Sparse Matrix-Vector Multiplication (SpMV) kernel ranks among the most important and thorough... more The Sparse Matrix-Vector Multiplication (SpMV) kernel ranks among the most important and thoroughly studied linear algebra operations, as it lies at the heart of many iterative methods for the solution of sparse linear systems, and often constitutes a severe performance bottleneck. Its optimization, which is intimately associated with the data structures used to store the sparse matrix, has always been of particular interest to the applied mathematics and computer science communities and has attracted further attention since the advent of multicore architectures. In this article, we present SparseX, an open source software package for SpMV targeting multicore platforms, that employs the state-of-the-art Compressed Sparse eXtended (CSX) sparse matrix storage format to deliver high efficiency through a highly usable “BLAS-like” interface that requires limited or no tuning. Performance results indicate that our library achieves superior performance over competitive libraries on large-s...

2016 IEEE International Conference on Big Data (Big Data), 2016
Current platforms fail to efficiently cope with the data and task heterogeneity of modern analyti... more Current platforms fail to efficiently cope with the data and task heterogeneity of modern analytics workflows due to their adhesion to a single data and/or compute model. As a remedy, we present IReS, the Intelligent Resource Scheduler for complex analytics workflows executed over multi-engine environments. IReS is able to optimize a workflow with respect to a user-defined policy relying on cost and performance models of the required tasks over the available platforms. This optimization consists in allocating distinct workflow parts to the most advantageous execution and/or storage engine among the available ones and deciding on the exact amount of resources provisioned. Our current prototype supports 5 compute and 3 data engines, yet new ones can effortlessly be added to IReS by virtue of its engine-agnostic mechanisms. Our extensive experimental evaluation confirms that IReS speeds up diverse and realistic workflows by up to 30% compared to their optimal single-engine plan by automatically scattering parts of them to different execution engines and datastores. Its optimizer incurs only marginal overhead to the workflow execution performance, managing to discover the optimal execution plan within a few seconds, even for large-scale workflow instances.

2016 IEEE International Conference on Big Data (Big Data), 2016
Multi-engine analytics has been gaining an increasing amount of attention from both the academic ... more Multi-engine analytics has been gaining an increasing amount of attention from both the academic and the industrial community as it can successfully cope with the heterogeneity and complexity that the plethora of frameworks, technologies and requirements have brought forth. It is now common for a data analyst to combine data that resides on multiple and totally independent engines and perform complex analytics queries. Multi-engine solutions based on SQL can facilitate such efforts, as SQL is a popular standard that the majority of data-scientists understands. Existing solutions propose a middleware that centrally optimizes query execution for multiple engines. Yet, this approach requires manual integration of every primitive engine operator along with its cost model, rendering the process of adding new operators or engines highly inextensible. To address this issue we present MuSQLE, a system for SQL-based analytics over multi-engine environments. MuSQLE can efficiently utilize external SQL engines allowing for both intra and inter engine optimizations. Our framework adopts a novel API-based strategy. Instead of manual integration, MuSQLE specifies a generic API, used for the cost estimation and query execution, that needs to be implemented for each SQL engine endpoint. Our engine API is integrated with a state-of-the-art query optimizer, adding support for location-based, multi-engine query optimization and letting individual runtimes perform sub-query physical optimization. The derived multi-engine plans are executed using the Spark distributed execution framework. Our detailed experimental evaluation, integrating PostgreSQL, MemSQL and SparkSQL under MuSQLE, demonstrates its ability to accurately decide on the most suitable execution engine. MuSQLE can provide speedups of up to 1 order of magnitude for TPCH queries, leveraging different engines for the execution of individual query parts.

2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP), 2016
Hardware Transactional Memory (HTM) is nowadays available in several commercial and HPC targeted ... more Hardware Transactional Memory (HTM) is nowadays available in several commercial and HPC targeted processors and in the future it will likely be available on systems that can accommodate a very large number of threads. Thus, it is essential for the research community to target on evaluating HTM on as many cores as possible in order to understand the virtues and limitations that come with it. In this paper we utilize HTM to parallelize accesses on a classic data structure, a red-black tree. With minimal programming effort, we implement a red-black tree by enclosing each operation in a single HTM transaction and evaluate it on two servers equipped with Intel Haswell-EP and IBM Power8 processors, supporting a large number of hardware threads, namely 56 and 160 respectively. Our evaluation reveals that applying HTM in such a simplistic manner allows scalability for up to a limited number of hardware threads. To fully utilize the underlying hardware we apply different optimizations on each platform.

2015 IEEE 27th International Conference on Tools with Artificial Intelligence (ICTAI), 2015
A stable marriage problem (SMP) of size n involves n men and n women, each of whom has ordered me... more A stable marriage problem (SMP) of size n involves n men and n women, each of whom has ordered members of the opposite gender by descending preferability. A solution is a perfect matching among men and women, such that there exists no pair who prefer each other to their current spouses. The problem was formulated in 1962 by Gale and Shapley, who showed that any instance can be solved in polynomial time, and has attracted interest due to its application to any two-sided market. Still, the solution obtained by the Gale-Shapley algorithm is favorable to one side. Gusfield and Irving introduced the equitable stable marriage problem (ESMP), which calls for finding a stable matching that minimizes the distance between men's and women's sum-of-rankings of their spouses. Unfortunately, ESMP is strongly NP-hard; approximation algorithms therefor are impractical, while even proposed heuristics may run for an unpredictable number of iterations. We propose a novel, deterministic approach that treats both genders equally, while eschewing an exhaustive exploration of the space of all stable matchings. Our thorough experimental study shows that, in contrast to previous proposals, our method not only achieves highquality solutions, but also terminates efficiently and repeatably on all tested large problem instances.

2015 IEEE International Conference on Big Data (Big Data), 2015
Among the privacy-preserving approaches that are known in the literature, k-anonymity remains the... more Among the privacy-preserving approaches that are known in the literature, k-anonymity remains the basis of more advanced models while still being useful as a stand-alone solution. Applying k-anonymity in practice, though, incurs severe loss of data utility, thus limiting its effectiveness and reliability in real-life applications and systems. However, such loss in utility does not necessarily arise from an inherent drawback of the model itself, but rather from the deficiencies of the algorithms used to implement the model.Conventional approaches rely on a methodology that publishes data in homogeneous generalized groups. An alternative modern data publishing scheme focuses on publishing the data in heterogeneous groups and achieves higher utility, while ensuring the same privacy guarantees. As conventional approaches cannot anonymize data following this heterogeneous scheme, innovative solutions are required for this purpose. Following this approach, in this paper we provide a set of algorithms that ensure high-utility k-anonymity, via solving an equivalent graph processing problem.
2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2021
Abstract Proceedings of the 2023 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems
Traditional data centers include monolithic servers that tightly integrate CPU, memory and disk (... more Traditional data centers include monolithic servers that tightly integrate CPU, memory and disk (Figure 1a). Instead, Disaggregated Systems (DSs) [8, 13, 18, 27] organize multiple compute (CC), memory (MC) and storage devices as independent, failure-isolated components interconnected over a high-bandwidth network (Figure 1b). DSs can greatly reduce data center costs by providing improved resource utilization, resource scaling, failure-handling and elasticity in modern data centers [5,
Uploads
Papers by Nectarios Koziris