Papers by enrique s. quintana-orti
Lecture Notes in Computer Science, 2011
We propose an efficient implementation of the Balanced Truncation (BT) method for model order red... more We propose an efficient implementation of the Balanced Truncation (BT) method for model order reduction when the state-space matrix is symmetric (positive definite). Most of the computational effort required by this method is due to the computation of matrix inverses. Two alternatives for the inversion of a symmetric positive definite matrix on multi-core platforms are studied and evaluated, the traditional approach based on the Cholesky factorization and the Gauss-Jordan elimination algorithm. Implementations of both methods have been developed and tested. Numerical experiments show the efficiency attained by the proposed implementations on the target architecture.
Message from the PDSEC-10 workshop chairs
2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010
Welcome to the 11th IEEE International Workshop on Parallel and Distributed Scientific and Engine... more Welcome to the 11th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC-10), held on April 23, 2010 in Atlanta, Georgia, USA, in conjunction with the 24th IEEE Int. Parallel and Distributed Processing Symposium (IPDPS 2010).
Fast and Reliable Noise Estimation for Hyperspectral Subspace Identification
IEEE Geoscience and Remote Sensing Letters, 2015
ABSTRACT In this letter, we introduce an efficient algorithm to estimate the noise correlation ma... more ABSTRACT In this letter, we introduce an efficient algorithm to estimate the noise correlation matrix in the initial stage of the hyperspectral signal identification by minimum error (HySime) method, commonly used for signal subspace identification in remotely sensed hyperspectral images. Compared with the current implementations of this stage, the new algorithm for noise estimation relies on the reliable QR factorization, producing correct results even when operating with single-precision arithmetic. Additionally, our algorithm exhibits a lower computational cost, and it is highly parallel. The experiments on a multicore server, using two real hyperspectral scenes, expose that these theoretical advantages carry over to the practical results.

16th Euromicro Conference on Parallel, Distributed and Network-Based Processing (PDP 2008), 2008
This paper examines the scalable parallel implementation of QR factorization of a general matrix,... more This paper examines the scalable parallel implementation of QR factorization of a general matrix, targeting SMP and multi-core architectures. Two implementations of algorithms-by-blocks are presented. Each implementation views a block of a matrix as the fundamental unit of data, and likewise, operations over these blocks as the primary unit of computation. The first is a conventional blocked algorithm similar to those included in libFLAME and LAPACK but expressed in a way that allows operations in the so-called critical path of execution to be computed as soon as their dependencies are satisfied. The second algorithm captures a higher degree of parallelism with an approach based on Givens rotations while preserving the performance benefits of algorithms based on blocked Householder transformations. We show that the implementation effort is greatly simplified by expressing the algorithms in code with the FLAME/FLASH API, which allows matrices stored by blocks to be viewed and managed as matrices of matrix blocks. The SuperMatrix run-time system utilizes FLASH to assemble and represent matrices but also provides out-of-order scheduling of operations that is transparent to the programmer. Scalability of the solution is demonstrated on a ccNUMA platform with 16 processors.
Lecture Notes in Computer Science, 2007
llc is a language based on C where parallelism is expressed using compiler directives. The llc co... more llc is a language based on C where parallelism is expressed using compiler directives. The llc compiler produces MPI code which can be ported to both shared and distributed memory systems. In this work we focus our attention in the llc implementation of the Workqueuing Model. This model is an extension of the OpenMP standard that allows an elegant implementation of irregular parallelism. We evaluate our approach by comparing the OpenMP and llc parallelizations of the symmetric rank-k update operation on shared and distributed memory parallel platforms.
Lecture Notes in Computer Science, 2010
We investigate the performance of two approaches for matrix inversion based on Gaussian (LU facto... more We investigate the performance of two approaches for matrix inversion based on Gaussian (LU factorization) and Gauss-Jordan eliminations. The target architecture is a current general-purpose multicore processor connected to a graphics processor (GPU). Parallelism is extracted in both processors by linking sequential versions of the codes with multi-threaded implementations of BLAS. Our results on a system with two Intel QuadCore processors and a Tesla C1060 GPU illustrate the performance and scalability attained by the codes on this system.

High Performance Matrix Inversion on a Multi-core Platform with Several GPUs
2011 19th International Euromicro Conference on Parallel, Distributed and Network-Based Processing, 2011
ABSTRACT Inversion of large-scale matrices appears in a few scientific applications like model re... more ABSTRACT Inversion of large-scale matrices appears in a few scientific applications like model reduction or optimal control. Matrix inversion requires an important computational effort and, therefore, the application of high performance computing techniques and architectures for matrices with dimension in the order of thousands. Following the recent uprise of graphics processors (GPUs), we present and evaluate high performance codes for matrix inversion, based on Gauss-Jordan elimination with partial pivoting, which off-load the main computational kernels to one or more GPUs while performing fine-grain operations on the general-purpose processor. The target architecture consists of a multi-core processor connected to several GPUs. Parallelism is extracted from parallel implementations of BLAS and from the concurrent execution of operations in the available computational units. Numerical experiments on a system with two Intel QuadCore processors and four NVIDIA cl060 GPUs illustrate the efficiency and the scalability of the different implementations, which deliver over 1.2 x 1012 floating point operations per second.
2011 International Conference on High Performance Computing & Simulation, 2011
We introduce high performance implementations of the inversion of a symmetric positive definite m... more We introduce high performance implementations of the inversion of a symmetric positive definite matrix. Two alternatives are studied and evaluated, the traditional approach based on the Cholesky factorization and the Gauss-Jordan elimination algorithm. Several implementations of the two algorithms are developed on a hybrid architecture equipped with a general-purpose multi-core processor and a graphics processor. Numerical experiments show the efficiency attained by the proposed implementations on the target architecture.
Out-of-core solution of linear systems on graphics processors
International Journal of Parallel, Emergent and Distributed Systems, 2009
We combine two high-level application programminginterfaces to solve large-scale linear systems w... more We combine two high-level application programminginterfaces to solve large-scale linear systems with the data stored on disk using current graphics processors. The result is a simple yet powerful tool that enables a fast development of object-oriented codes, implemented as MATLAB M-scripts, for linear algebra operations. Theapproachenhancesthe programmabilityof thesolutions in this problemdomain while unleashing the high performance of graphics processors. Experimental
Lecture Notes in Computer Science, 2004
In this paper we present our efforts towards the design and development of a parallel version of ... more In this paper we present our efforts towards the design and development of a parallel version of the Scientific Library from GNU using MPI and OpenMP. Two well-known operations arising in discrete mathematics and sparse linear algebra illustrate the architecture and interfaces of the system. Our approach, though being a general high-level proposal, achieves for these two particular examples a performance close to that obtained by an ad hoc parallel programming implementation.

IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2014
Wider coverage of observation missions will increase onboard power restrictions while, at the sam... more Wider coverage of observation missions will increase onboard power restrictions while, at the same time, pose higher demands from the perspective of processing time, thus asking for the exploration of novel high-performance and low-power processing architectures. In this paper, we analyze the acceleration of spectral unmixing, a key technique to process hyperspectral images, on multicore architectures. To meet onboard processing restrictions, we employ a low-power Digital Signal Processor (DSP), comparing processing time and energy consumption with those of a representative set of commodity architectures. We demonstrate that DSPs offer a fair balance between ease of programming, performance, and energy consumption, resulting in a highly appealing platform to meet the restrictions of current missions if onboard processing is required.
Evaluation of the Energy Performance of Dense Linear Algebra Kernels on Multi-core and Many-Core Processors
2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum, 2011
... Maribel Castillo, Manel Dolz, Juan C. Fernández, Rafael Mayo, Enrique S. Quintana-Ortı, Vicen... more ... Maribel Castillo, Manel Dolz, Juan C. Fernández, Rafael Mayo, Enrique S. Quintana-Ortı, Vicente Roca Depto. ... In our evaluation, power measurements are obtained at two different places; see Figure 1. A commercial (external) AC powermeter (Watts Up Pro .net) samples power ...
Lecture Notes in Computer Science, 2006
We present a systematic methodology for deriving and implementing linear algebra libraries. It is... more We present a systematic methodology for deriving and implementing linear algebra libraries. It is quite common that an application requires a library of routines for the computation of linear algebra operations that are not (exactly) supported by commonly used libraries like LAPACK. In this situation, the application developer has the option of casting the operation into one supported by an existing library, often at the expense of performance, or implementing a custom library, often requiring considerable effort. Our recent discovery of a methodology based on formal derivation of algorithm allows such a user to quickly derive proven correct algorithms. Furthermore it provides an API that allows the so-derived algorithms to be quickly translated into high-performance implementations.

Analysis of Strategies to Save Energy for Message-Passing Dense Linear Algebra Kernels
2012 20th Euromicro International Conference on Parallel, Distributed and Network-based Processing, 2012
In this paper we analyze the impact that energy-saving strategies, like the application of DVFS v... more In this paper we analyze the impact that energy-saving strategies, like the application of DVFS via Linux governors and the MPI communication mode, have on the performance and energy consumption of message-passing dense linear algebra operations. In the study, we employ codes from ScaLAPACK for three matrix kernels, the matrix-matrix and matrix-vector products and the Cholesky factorization, which exhibit different levels of concurrency and CPU/memory activity. Following a recent trend, we also include an accelerated version of the matrix-matrix product that off-loads all computation to a graphics processor and study the energy gains of this hybrid solver when the general-purpose cores of the system are promoted to a low consuming mode. Experimental results on a cluster equipped with state-of-the-art computation and communication hardware illustrate the results of this study.
2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum, 2011
This paper presents a detailed analysis of a mixed precision iterative refinement solver applied ... more This paper presents a detailed analysis of a mixed precision iterative refinement solver applied to a linear system obtained from the 2D discretization of a fluid flow problem. The total execution time and energy need of different softand hardware-implementations are measured and compared with those of a plain GMRES-based solver in double precision. The time and energy consumption of individual parts of the algorithm are monitored as well, enabling a deeper insight and the possibility of optimizing the energy consumption of the code on a general-purpose multi-core architecture and systems accelerated by a graphics processor.

2011 International Green Computing Conference and Workshops, 2011
Energy efficiency is a major concern in modern high-performance-computing. Still, few studies pro... more Energy efficiency is a major concern in modern high-performance-computing. Still, few studies provide a deep insight into the power consumption of scientific applications. Especially for algorithms running on hybrid platforms equipped with hardware accelerators, like graphics processors, a detailed energy analysis is essential to identify the most costly parts, and to evaluate possible improvement strategies. In this paper we analyze the computational and power performance of iterative linear solvers applied to sparse systems arising in several scientific applications. We also study the gains yield by dynamic voltage/frequency scaling (DVFS), and illustrate that this technique alone cannot to reduce the energy cost to a considerable amount for iterative linear solvers. We then apply techniques that set the (multi-core processor in the) host system to a low-consuming state for the time that the GPU is executing. Our experiments conclusively reveal how the combination of these two techniques deliver a notable reduction of energy consumption without a noticeable impact on computational performance.
Parallel model reduction of large-scale unstable systems
Advances in Parallel Computing, 2004
Lecture Notes in Computer Science, 2008
We present several algorithms to compute the solution of a linear system of equations on a GPU, a... more We present several algorithms to compute the solution of a linear system of equations on a GPU, as well as general techniques to improve their performance, such as padding and hybrid GPU-CPU computation. We also show how iterative refinement with mixed-precision can be used to regain full accuracy in the solution of linear systems. Experimental results on a G80 using CUBLAS 1.0, the implementation of BLAS for NVIDIA R GPUs with unified architecture, illustrate the performance of the different algorithms and techniques proposed.

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2009
While general-purpose homogeneous multi-core architectures are becoming ubiquitous, there are cle... more While general-purpose homogeneous multi-core architectures are becoming ubiquitous, there are clear indications that, for a number of important applications, a better performance/power ratio can be attained using specialized hardware accelerators. These accelerators require specific SDK or programming languages which are not always easy to program. Thus, the impact of the new programming paradigms on the programmer's productivity will determine their success in the highperformance computing arena. In this paper we present GPU Superscalar (GPUSs), an extension of the Star Superscalar programming model that targets the parallelization of applications on platforms consisting of a general-purpose processor connected with multiple graphics processors. GPUSs deals with architecture heterogeneity and separate memory address spaces, while preserving simplicity and portability. Preliminary experimental results for a well-known operation in numerical linear algebra illustrate the correct adaptation of the runtime to a multi-GPU system, attaining notable performance results.
Journal of Supercomputing, 2000
In this paper we present parallel algorithms for stabilizing large linear control systems on mult... more In this paper we present parallel algorithms for stabilizing large linear control systems on multicomputers. Our algorithms first separate the stable part of the linear control system and then compute a stabilizing feedback for the unstable part. Both stages are solved by means of the matrix sign function which presents a high degree of parallelism and scalability.
Uploads
Papers by enrique s. quintana-orti