Papers by George Markomanolis

Lecture Notes in Computer Science, 2022
It is common in the HPC community that the achieved performance with just CPUs is limited for man... more It is common in the HPC community that the achieved performance with just CPUs is limited for many computational cases. The EuroHPC pre-exascale and the coming exascale systems are mainly focused on accelerators, and some of the largest upcoming supercomputers such as LUMI and Frontier will be powered by AMD Instinct TM accelerators. However, these new systems create many challenges for developers who are not familiar with the new ecosystem or with the required programming models that can be used to program for heterogeneous architectures. In this paper, we present some of the more well-known programming models to program for current and future GPU systems. We then measure the performance of each approach using a benchmark and a mini-app, test with various compilers, and tune the codes where necessary. Finally, we compare the performance, where possible, between the NVIDIA Volta (V100), Ampere (A100) GPUs, and the AMD MI100 GPU.
Zenodo (CERN European Organization for Nuclear Research), Jan 12, 2023
This is the first issue of the journal of High-Performance Storage.

Simulation is a popular approach to obtain objective performance indicators platforms that are no... more Simulation is a popular approach to obtain objective performance indicators platforms that are not at one's disposal. It may help the dimensioning of compute clusters in large computing centers. In this work we present a framework for the offline simulation of MPI applications. Its main originality with regard to the literature is to rely on time-independent execution traces. This allows us to completely decouple the acquisition process from the actual replay of the traces in a simulation context. Then we are able to acquire traces for large application instances without being limited to an execution on a single compute cluster. Finally our framework is built on top of a scalable, fast, and validated simulation kernel. In this paper, we present the used time-independent trace format, investigate several acquisition strategies, detail the developed trace replay tool, and assess the quality of our simulation framework in terms of accuracy, acquisition time, simulation time, and trace size.

Supercomputing Frontiers
It is common in the HPC community that the achieved performance with just CPUs is limited for man... more It is common in the HPC community that the achieved performance with just CPUs is limited for many computational cases. The EuroHPC pre-exascale and the coming exascale systems are mainly focused on accelerators, and some of the largest upcoming supercomputers such as LUMI and Frontier will be powered by AMD Instinctâ„¢ accelerators. However, these new systems create many challenges for developers who are not familiar with the new ecosystem or with the required programming models that can be used to program for heterogeneous architectures. In this paper, we present some of the more well-known programming models to program for current and future GPU systems. We then measure the performance of each approach using a benchmark and a mini-app, test with various compilers, and tune the codes where necessary. Finally, we compare the performance, where possible, between the NVIDIA Volta (V100), Ampere (A100) GPUs, and the AMD MI100 GPU.
cschpc/epmhpcgpu: Dataset and scripts for the paper with title Evaluating Programming Models for the HPC GPU Ecosystem
Dataset and scripts for the paper with title Evaluating Programming Models for the HPC GPU Ecosystem

Performance analysis of an online atmospheric-chemistry global model with Paraver: Identification of scaling limitations
2014 International Conference on High Performance Computing & Simulation (HPCS), 2014
ABSTRACT The performance analysis of a parallel application can be a difficult task. Specially in... more ABSTRACT The performance analysis of a parallel application can be a difficult task. Specially in the case that this application is an operational atmospheric-chemistry model there can be multiple performance bottlenecks caused from different fields. Although the exascale era is coming, the applications are not ready to take advantage of all the new technologies and programming models. It is needed to improve our model in order to simulate higher resolutions and scale more efficient. In this article we describe the approaches that we follow for the performance analysis of an atmospheric-chemistry global model called NMMB/BSC chemical transport model and the identification of various bottlenecks by using the Paraver tool. We present the differences between some model configurations depending on the usage of extra modules and we study eight different topics that limit the scalability of the model. These topics include categories that there is no need for code modification such as mapping, processor affinity and more in depth analysis with hardware counters and load imbalance issues. The final results show the directions that we should follow in order to improve our model.
The purpose of the current work, is to guide a beginner user how to use some functionalities of t... more The purpose of the current work, is to guide a beginner user how to use some functionalities of the Burst Buffer. The guide includes an introduction to the Burst Buffer, instructions on how to port a SLURM script to DataWarp, some insights about the MPI I/O aggregators and how to modify the default values. Moreover, it provides a basic methodology for using Burst Buffer, it proposes some useful environment variables, and presents some results with the NAS BT benchmark.
IO500 ISC21 Lists
These are the IO500 lists published during ISC21.
Vi4Io/Io-500-Dev: Zenodo Citation Release
IO-500 benchmark
This is issue 2 of the journal of High-Performance Storage.
IO500 Ranked List ISC19
The IO500 Benchmark publishes twice annual ranked lists of all submissions. This list was first p... more The IO500 Benchmark publishes twice annual ranked lists of all submissions. This list was first published at the 2019 iSC conference.
IEEE Transactions on Parallel and Distributed Systems, 2017
This article summarizes our recent work and developments on SMPI, a flexible simulator of MPI app... more This article summarizes our recent work and developments on SMPI, a flexible simulator of MPI applications. In this tool, we took a particular care to ensure our simulator could be used to produce fast and accurate predictions in a wide variety of situations. Although we did build SMPI on SimGrid whose speed and accuracy had already been assessed in other contexts, moving such techniques to a HPC workload required significant additional effort. Obviously, an accurate modeling of communications and network topology was one of the key to such achievements. Another less obvious key was the choice to combine in a single tool the possibility to do both offline and online simulation.

Lecture Notes in Computer Science, 2014
Simulation and modeling for performance prediction and profiling is essential for developing and ... more Simulation and modeling for performance prediction and profiling is essential for developing and maintaining HPC code that is expected to scale for next-generation exascale systems, and correctly modeling network behavior is essential for creating realistic simulations. In this article we describe an implementation of a flow-based hybrid network model that accounts for factors such as network topology and contention, which are commonly ignored by other approaches. We focus on large-scale, Ethernet-connected systems, as these currently compose 37.8% of the TOP500 index, and this share is expected to increase as higher-speed 10 and 100GbE become more available. The European Mont-Blanc project to study exascale computing by developing prototype systems with low-power embedded devices will also use Ethernet-based interconnect. Our model is implemented within SMPI, an open-source MPI implementation that connects real applications to the SimGrid simulation framework. SMPI provides implementations of collective communications based on current versions of both OpenMPI and MPICH. SMPI and SimGrid also provide methods for easing the simulation of largescale systems, including shadow execution, memory folding, and support for both online and offline (i.e., post-mortem) simulation. We validate our proposed model by comparing traces produced by SMPI with those from real world experiments, as well as with those obtained using other established network models. Our study shows that SMPI has a consistently better predictive power than classical LogP-based models for a wide range of scenarios including both established HPC benchmarks and real applications.

2012 SC Companion: High Performance Computing, Networking Storage and Analysis, 2012
Simulation is a popular approach to obtain objective performance indicators on platforms that are... more Simulation is a popular approach to obtain objective performance indicators on platforms that are not at one's disposal. It may help the dimensioning of compute clusters in large computing centers. In a previous work, we proposed a framework for the off-line simulation of MPI applications. Its main originality with regard to the literature is to rely on time-independent execution traces. This allows us to completely decouple the acquisition process from the actual replay of the traces in a simulation context. Then we are able to acquire traces for large application instances without being limited to an execution on a single compute cluster. Finally our framework is built on top of a scalable, fast, and validated simulation kernel. In this paper, we detail the performance issues that we encountered with the first implementation of our trace replay framework. We propose several modifications to address these issues and analyze their impact. Results shows a clear improvement on the accuracy and efficiency with regard to the initial implementation.
IO500 Full Ranked List, Supercomputing 2018 (Corrected)
The IO500 Benchmark publishes twice annual ranked lists of all submissions. This list was first p... more The IO500 Benchmark publishes twice annual ranked lists of all submissions. This list was first published at the 2018 SC conference but has been corrected to fix a reporting bug in some of the metadata scores.
SParse Optimization Research COde (SPORCO) is an open-source Python package for solving optimizat... more SParse Optimization Research COde (SPORCO) is an open-source Python package for solving optimization problems with sparsity-inducing regularization, consisting primarily of sparse coding and dictionary learning, for both standard and convolutional forms of sparse representation. In the current version, all optimization problems are solved within the Alternating Direction Method of Multipliers (ADMM) framework. SPORCO was developed for applications in signal and image processing, but is also expected to be useful for problems in computer vision, statistics, and machine learning.

Abstract: In a previous work, we proposed a framework for the o-line simulation of MPI applicatio... more Abstract: In a previous work, we proposed a framework for the o-line simulation of MPI applications. Its main originality with regard to the literature is to rely on time-independent execution traces. Time-independent traces are an original way to estimate the performance of parallel applications. To acquire time-independent traces of the execution of MPI applications, we have to instrument them to log the necessary information. There exist many pro ling tools which can instrument an application. In this report we propose a scoring system that corresponds to our framework speci c requirements and evaluate the most well-known and open source pro ling tools according to it. Furthermore we introduce an original tool called Minimal Instrumentation that was designed to ful ll the requirements of our framework. Key-words: MPI, Pro ling tools, Traces, Performance Analysis, o-line simulation

Simulation and modeling for performance prediction and profiling is essential for developing and ... more Simulation and modeling for performance prediction and profiling is essential for developing and maintaining HPC code that is expected to scale for next-generation exascale systems, and correctly modeling network behavior is essential for creating realistic simulations. In this article we describe an implementation of a flow-based hybrid network model that accounts for factors such as network topology and contention, which are commonly ignored by other approaches. We focus on large-scale, Ethernet-connected systems, as these currently compose 37.8% of the TOP500 index, and this share is expected to increase as higher-speed 10 and 100GbE become more available. The European Mont-Blanc project to study exascale computing by developing prototype systems with low-power embedded devices will also use Ethernet-based interconnect. Our model is implemented within SMPI, an open-source MPI implementation that connects real applications to the SimGrid simulation framework. SMPI provides implementations of collective communications based on current versions of both OpenMPI and MPICH. SMPI and SimGrid also provide methods for easing the simulation of largescale systems, including shadow execution, memory folding, and support for both online and offline (i.e., post-mortem) simulation. We validate our proposed model by comparing traces produced by SMPI with those from real world experiments, as well as with those obtained using other established network models. Our study shows that SMPI has a consistently better predictive power than classical LogP-based models for a wide range of scenarios including both established HPC benchmarks and real applications.

Proper modeling of collective communications is essential for understanding the behavior of mediu... more Proper modeling of collective communications is essential for understanding the behavior of medium-to-large scale parallel applications, and even minor deviations in implementation can adversely aect the prediction of real-world performance. We propose a hybrid network model extending LogP based approaches to account for topology and contention in high-speed TCP networks. This model is validated within SMPI, an MPI implementation provided by the SimGrid simulation toolkit. With SMPI, standard MPI applications can be compiled and run in a simulated network environment, and traces can be captured without incurring errors from tracing overheads or poor clock synchronization as in physical experiments. SMPI provides features for simulating applications that require large amounts of time or resources, including selective execution, ram folding, and o-line replay of execution traces. We validate our model by comparing traces produced by SMPI with those from other simulation platforms, as well as real world environments.
Uploads
Papers by George Markomanolis