Papers by Diego R. Llanos
The Journal of supercomputing/Journal of supercomputing, Apr 15, 2024
There are many works devoted to improving the matrix product computation, as it is used in a wide... more There are many works devoted to improving the matrix product computation, as it is used in a wide variety of scientific applications arising from many different fields. In this work, we propose alternative data distribution policies and communication patterns to reduce the elapsed time when computing triangular matrix products in distributed memory environments. In particular, we focus on commodity clusters, where the number of nodes is limited, proposing alternatives to traditional approaches in order to improve this operation's performance. Our proposal overcomes the performance results associated with the state-of-the-art libraries, such as ScaLAPACK and SLATE, offering execution times that are up to 30% faster. Inmaculada Santamaria-Valenzuela and Rocío Carratalá-Sáez contributed equally to this work.

Resumen-El uso de aceleradores hardware de alto rendimiento, tales como las unidades de procesami... more Resumen-El uso de aceleradores hardware de alto rendimiento, tales como las unidades de procesamiento gráfico (GPUs), ha ido en creciente aumento en los sistemas de supercomputación. Esta tendencia en fácilmente apreciable en la lista de computadoras mostradas por la clasificación TOP500. Programar este tipo de dispositivos es una tarea costosa que requiere un alto conocimiento sobre la arquitectura de cada uno de los aceleradores. Esta dificultad aumenta cuando se pretende explotar, de forma eficiente, los diferentes recursos hardware de un dispositivo. Este trabajo propone un modelo de programación que permite el solapamiento de tareas de comunicación y computación en dispositivos GPU mejorando así, el rendimiento de las aplicaciones. Nuestro estudio experimental muestra que este modelo oculta, de forma transparente, las latencias de comunicación si hay suficiente carga de comunicación, obteniendo hasta un 61.10 % de mejora de rendimiento comparado con nuestra implementación síncrona. Palabras clave-Computación de alto rendimiento, Concurrencia, CUDA, GP-GPU, Modelos de Programación paralelos, Solapamiento de tareas.

Transactional Memory (TM) is a technique that aims to mitigate the performance losses that are in... more Transactional Memory (TM) is a technique that aims to mitigate the performance losses that are inherent to the serialization of accesses in critical sections. Some studies have shown that the use of TM may lead to performance improvements, despite the existence of management overheads. However, the relative performance of TM, with respect to classical critical sections management depends greatly on the actual percentage of times that the same data is handled simultaneously by two transactions. In this paper, we compare the relative performance of the critical sections provided by OpenMP with respect to two Software Transactional Memory (STM) implementations. These three methods are used to manage concurrent data accesses in ATLaS, a software-based, Thread-Level Speculation (TLS) system. The complexity of this application makes it extremely difficult to predict whether two transactions may conflict or not, and how many times the transactions will be executed. Our experimental results show that the STM solutions only deliver a performance comparable to OpenMP when there are almost no conflicts. In any other case, their performance losses make OpenMP the best alternative to manage critical sections.
Resumen-OpenACC existe ya desde hace algunos años y, durante los mismos, han ido apareciendo una ... more Resumen-OpenACC existe ya desde hace algunos años y, durante los mismos, han ido apareciendo una serie de compiladores tanto en el ámbito académico como en la industria. Debido a la novedad del estándar OpenACC así como al continuo desarrollo de los compiladores existentes, una suite de benchmarks específicamente creada para analizar el comportamiento del código generado por estos compiladores en distintas máquinas adquiere una utilidad importante. En este artículo presentamos TOR-MENT OpenACC, una suite de benchmarks preparada para ser compilada por diferentes compiladores y que ofrece un resumen de los resultados obtenidos. Así mismo, junto a esta herramienta hemos desarrollado una métrica adecuada para la puntuación de los pares compilador-máquina y que hemos denominado Puntuación TORMENT ACC.

Currently, the generation of parallel codes which are portable to different kinds of parallel com... more Currently, the generation of parallel codes which are portable to different kinds of parallel computers is a challenge. Many approaches have been proposed during the last years following two different paths. Programming from scratch using new programming languages and models that deal with parallelism explicitly, or automatically generating parallel codes from already existing sequential programs. Using the current main-trend parallel languages, the programmer deals with mapping and optimization details that forces to take into account details of the execution platform to obtain a good performance. In code generators from sequential programs, programmers cannot control basic mapping decisions, and many times the programmer needs to transform the code to expose to the compiler information needed to leverage important optimizations. This paper presents a new high-level parallel programming language named CMAPS, designed to be used with the Trasgo parallel programming framework. This language provides a simple and explicit way to express parallelism in a highly abstract level. The programmer does not face decisions about granularity, thread management, or interprocess communication. Thus, the programmer can express different parallel paradigms in a easy, unified, abstract, and portable form. The language supports the necessary features imposed by transformation models such as Trasgo, to generate parallel codes that adapt their communication and synchronization structures for target machines composed by mixed distributed-and shared-memory parallel multicomputers.

Parallel Computing, Nov 1, 2017
Current High Performance Computing (HPC) systems are typically built as interconnected clusters o... more Current High Performance Computing (HPC) systems are typically built as interconnected clusters of shared-memory multicore computers. Several techniques to automatically generate parallel programs from high-level parallel languages or sequential codes have been proposed. To properly exploit the scalability of HPC clusters, these techniques should take into account the combination of data communication across distributed memory, and the exploitation of sharedmemory models. In this paper, we present a new communication calculation technique to be applied across different SPMD (Single Program Multiple Data) code blocks, containing several uniform data access expressions. We have implemented this technique in Trasgo, a programming model and compilation framework that transforms parallel programs from a high-level parallel specification that deals with parallelism in a unified, abstract, and portable way. The proposed technique computes at runtime exact coarse-grained communications for distributed message-passing processes. Applying this technique at runtime has the advantage of being independent of compile-time decisions, such as the tile size chosen for each process. Our approach allows the automatic generation of pre-compiled multi-level parallel routines, libraries, or programs that can adapt their communication, synchronization, and optimization structures to the target system, even when computing nodes have different capabilities. Our experimental results show that, despite our runtime calculation, our approach can automatically produce efficient programs compared with MPI reference codes, and with codes generated with auto-parallelizing compilers.
The threadblock size and shape choice is one of the most important user decisions when a parallel... more The threadblock size and shape choice is one of the most important user decisions when a parallel problem is coded to run in GPU architectures. In fact, threadblock conguration has a signicant

Dataflow programming consists in developing a program by describing its sequential stages and the... more Dataflow programming consists in developing a program by describing its sequential stages and the interactions between them. The runtimes supporting this kind of programming are responsible of exploiting the parallelism by concurrently executing the different stages when their dependencies have been met. In this paper we introduce a new parallel programming model and framework based on the dataflow paradigm. Its features are: It is a unique one-tier model that supports hybrid shared-and distributed-memory systems; it can express activities arbitrarily linked, including cycles; it uses a distributed work-stealing mechanism to allow Multiple-Producer/Multiple-Consumer configurations; and it has a run-time mechanism for the reconfiguration of the dependences network which also allows to create task-to-task affinities. We present an evaluation using examples of different classes of applications. Experimental results show that programs generated using this framework deliver good performance, and that the new abstractions introduce minimal overheads.

Intel Xeon Phi accelerators are one of the newest devices used in the field of parallel computing... more Intel Xeon Phi accelerators are one of the newest devices used in the field of parallel computing. However, there are comparatively few studies concerning their performance when using most of the existing parallelization techniques. One of them is thread-level speculation, a technique that optimistically tries to extract parallelism of loops without the need of a compile-time analysis that guarantees that the loop can be executed in parallel. In this article we evaluate the performance delivered by an Intel Xeon Phi coprocessor when using a software, state-of-the-art thread-level speculative parallelization library in the execution of well-known benchmarks. Our results show that, although the Xeon Phi delivers a relatively good speedup in comparison with a shared-memory architecture in terms of scalability, the low computing power of its computational units when specific vectorization and SIMD instructions are not exploited, indicates that further development of new specific techniq...
OpenACC is a parallel programming model for automatic parallelization of sequential code using co... more OpenACC is a parallel programming model for automatic parallelization of sequential code using compiler directives or pragmas. OpenACC is intended to be used with accelerators such as GPUs and Xeon Phi. The different implementations of the standard, although still in early development, are primarily focused on GPU execution. In this study, we analyze how the different OpenACC compilers available under certain premises behave when the clauses affecting the underlying block geometry implementation are modified. These clauses are the Gang number, Worker number, and Vector Size defined by the standard.

Lecture Notes in Computer Science, 2017
Supercomputers are becoming more heterogeneous. They are composed by several machines with differ... more Supercomputers are becoming more heterogeneous. They are composed by several machines with different computation capabilities and different kinds and families of accelerators, such as GPUs or Intel Xeon Phi coprocessors. Programming these machines is a hard task, that requires a deep study of the architectural details, in order to exploit efficiently each computational unit. In this paper, we present an extension of a GPU-CPU heterogeneous programming model, to include support for Intel Xeon Phi coprocessors. This contribution extends the previous model and its implementation, by taking advantage of both the GPU communication model and the CPU execution model of the original approach, to derive a new approach for the Xeon Phi. Our experimental results show that using our approach, the programming effort needed for changing the kind of target devices is highly reduced for several study cases. For example, using our model to program a Mandelbrot benchmark, the 97% of the application code is reused between a GPU implementation and a Xeon Phi implementation.

International Journal of Parallel Programming, 2018
Dataflow programming consists in developing a program by describing its sequential stages and the... more Dataflow programming consists in developing a program by describing its sequential stages and the interactions between them. The runtimes supporting this kind of programming are responsible for exploiting the parallelism by concurrently executing the different stages as soon as their dependencies are met. In this paper we introduce a new parallel programming model and framework based on the dataflow paradigm. It presents a new combination of features that allows to easily map programs to shared or distributed memory, exploiting data locality and affinity to obtain the same performance than optimized coarse-grain MPI programs. These features include: It is a unique onetier model that supports hybrid shared-and distributed-memory systems with the same abstractions; it can express activities arbitrarily linked, including non-nested cycles; it uses internally a distributed work-stealing mechanism to allow Multiple-Producer/Multiple-Consumer configurations; and it has a runtime mechanism for the reconfiguration of the dependences and communication channels which also allows the creation of task-to-task data affinities. We present an evaluation using examples of different classes of applications. Experimental results show that programs generated using this framework deliver good performance in hybrid distributed-and shared-memory environments, with a similar development effort as other dataflow programming models oriented to shared-memory.
The Journal of Supercomputing, 2018
Concurrency and Computation: Practice and Experience, 2018

Parallel Computing, 2017
Current High Performance Computing (HPC) systems are typically built as interconnected clusters o... more Current High Performance Computing (HPC) systems are typically built as interconnected clusters of shared-memory multicore computers. Several techniques to automatically generate parallel programs from high-level parallel languages or sequential codes have been proposed. To properly exploit the scalability of HPC clusters, these techniques should take into account the combination of data communication across distributed memory, and the exploitation of sharedmemory models. In this paper, we present a new communication calculation technique to be applied across different SPMD (Single Program Multiple Data) code blocks, containing several uniform data access expressions. We have implemented this technique in Trasgo, a programming model and compilation framework that transforms parallel programs from a high-level parallel specification that deals with parallelism in a unified, abstract, and portable way. The proposed technique computes at runtime exact coarse-grained communications for distributed message-passing processes. Applying this technique at runtime has the advantage of being independent of compile-time decisions, such as the tile size chosen for each process. Our approach allows the automatic generation of pre-compiled multi-level parallel routines, libraries, or programs that can adapt their communication, synchronization, and optimization structures to the target system, even when computing nodes have different capabilities. Our experimental results show that, despite our runtime calculation, our approach can automatically produce efficient programs compared with MPI reference codes, and with codes generated with auto-parallelizing compilers.

2017 25th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP), 2017
OpenACC is a parallel programming model for hardware accelerators, such as GPUs or Xeon Phi, whic... more OpenACC is a parallel programming model for hardware accelerators, such as GPUs or Xeon Phi, which has been in development for several years by now. During this time, different compilers have appeared, both commercial and open source, which are still on development stage. Due to the fact that both the OpenACC standard and its implementations are relatively recent, we propose a benchmark suite specifically designed to check the performance of the OpenACC features in the code generated by different compilers on different architectures. Our benchmark suite is named TORMENT OpenACC2016. Along with this tool we have developed an adequate metric for the comparison of performance among different machine-compiler pairs which we have named TORMENT ACC2016 Score. The version 1 of TORMENT OpenACC2016 presented in this paper, contains six benchmarks, and is available online.

Lecture Notes in Computer Science, 2016
OpenACC has been on development for a few years now. The OpenACC 2.5 specification was recently m... more OpenACC has been on development for a few years now. The OpenACC 2.5 specification was recently made public and there are some initiatives for developing full implementations of the standard to make use of accelerator capabilities. There is much to be done yet, but currently, OpenACC for GPUs is reaching a good maturity level in various implementations of the standard, using CUDA and OpenCL as backends. Nvidia is investing in this project and they have released an OpenACC Toolkit, including the PGI Compiler. There are, however, more developments out there. In this work, we analyze different available OpenACC compilers that have been developed by companies or universities during the last years. We check their performance and maturity, keeping in mind that OpenACC is designed to be used without extensive knowledge about parallel programming. Our results show that the compilers are on their way to a reasonable maturity, presenting different strengths and weaknesses.

The Journal of Supercomputing, 2016
Parallelization of sequential applications requires extracting information about the loops and ho... more Parallelization of sequential applications requires extracting information about the loops and how their variables are accessed, and afterwards, augmenting the source code with extra code depending on such information. In this paper we propose a framework that avoids such an error-prone, timeconsuming task. Our solution leverages the compile-time information extracted from the source code to classify all variables used inside each loop according to their accesses. Then, our system, called BFCA+, automatically instruments the source code with the necessary OpenMP directives and clauses to allow its parallel execution, using the standard shared and private clauses for variable classification. The framework is also capable of instrumenting loops for speculative parallelization, with the help of the ATLaS runtime system, that defines a new speculative clause to point out those variables that may lead to a dependency violation. As a result, the target loop is guaranteed to correctly run in parallel, ensuring that its execution follows sequential semantics even in the presence of dependency violations. Our experimental evaluation shows that the framework not only saves development time, but also leads to a faster code than the one manually parallelized.

This paper presents an extension that adds XML capabilities to Cetus, a source-to-source compiler... more This paper presents an extension that adds XML capabilities to Cetus, a source-to-source compiler developed by Purdue University. In this work, the Cetus Intermediate Represen-tation is converted into an XML DOM tree that, in turn, enables XML capabilities, such as searching specic code features through XPath expressions. As an example, we write an XPath code to nd private and shared vari-ables for parallel execution in C source code. Loopest is a Java program with embedded XPath expressions. While Cetus needs 2573 lines of internal JAVA code to locate private variables in an input code, Loopest needs a total of only 425 lines of code to determine the same private variables in the equivalent XML representation. Using XPath as search method provides a second advantage over Cetus: extensibility. Changes in Cetus requires a deep knowledge of Java, Cetus internal structure, and its Inter-mediate Representation. Moreover, changes in Loopest are easier because it only depends on XPath to ...
Uploads
Papers by Diego R. Llanos