Skip to main content

Renato Ferreira

Followers

0

Following

1

Public Views

Leipzig University (Universität Leipzig)

The University of Sheffield

Sebastian Kvist

University of Toronto

Mohammad Hossein Karimi-Jafari

University of Tehran

IISER, Pune

Ege University

Wichita State University

Institut Studi Islam Fahmina

Banda University of Agriculture and Technology

Rosario Mata López

Universidad Nacional Autónoma de México

Interests

Uploads

Papers by Renato Ferreira

The Virtual Microscope

Proceedings : a conference of the American Medical Informatics Association / ... AMIA Annual Fall Symposium. AMIA Fall Symposium, 1997

We present the design of the Virtual Microscope, a software system employing a client/server arch... more We present the design of the Virtual Microscope, a software system employing a client/server architecture to provide a realistic emulation of a high power light microscope. We discuss several technical challenges related to providing the performance necessary to achieve rapid response time, mainly in dealing with the enormous amounts of data (tens to hundreds of gigabytes per slide) that must be retrieved from secondary storage and processed. To effectively implement the data server, the system design relies on the computational power and high I/O throughput available from an appropriately configured parallel computer.

Economically-efficient sentiment stream analysis

Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval, 2014

Text-based social media channels, such as Twitter, produce torrents of opinionated data about the... more Text-based social media channels, such as Twitter, produce torrents of opinionated data about the most diverse topics and entities. The analysis of such data (aka. sentiment analysis) is quickly becoming a key feature in recommender systems and search engines. A prominent approach to sentiment analysis is based on the application of classification techniques, that is, content is classified according to the attitude of the writer. A major challenge, however, is that Twitter follows the data stream model, and thus classifiers must operate with limited resources, including labeled data and time for building classification models. Also challenging is the fact that sentiment distribution may change as the stream evolves. In this paper we address these challenges by proposing algorithms that select relevant training instances at each time step, so that training sets are kept small while providing to the classifier the capabilities to suit itself to, and to recover itself from, different types of sentiment drifts. Simultaneously providing capabilities to the classifier, however, is a conflicting-objective problem, and our proposed algorithms employ basic notions of Economics in order to balance both capabilities. We performed the analysis of events that reverberated on Twitter, and the comparison against the state-of-the-art reveals improvements both in terms of error reduction (up to 14%) and reduction of training resources (by orders of magnitude).

Profiling general purpose GPU applications

Proceedings - Symposium on Computer Architecture and High Performance Computing, 2009

Coordinating the use of GPU and CPU for improving performance of compute intensive applications

Proceedings - IEEE International Conference on Cluster Computing, ICCC, 2009

Smart surveillance framework: A versatile tool for video analysis

IEEE Winter Conference on Applications of Computer Vision, 2014

A Scalable Parallel Deduplication Algorithm

19th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD'07), 2007

Speeding Up Learning in Real-Time Search through Parallel Computing

2011 23rd International Symposium on Computer Architecture and High Performance Computing, 2011

Real-time search algorithms solve the problem of path planning, regardless the size and complexit... more Real-time search algorithms solve the problem of path planning, regardless the size and complexity of the maps, and the massive presence of entities in the same environment. In such methods, the learning step aims to avoid local minima and improve the results for future searches, ensuring the convergence to the optimal path when the same planning task is solved repeatedly. However, performing search in a limited area due to real-time constraints makes the run to convergence a lengthy process. In this work, we present a parallelization strategy that aims to reduce the time to convergence, maintaining the real-time properties of the search. The parallelization technique consists on using auxiliary searches without the real-time restrictions present in the main search. In addition, the same learning is shared by all searches. The empirical evaluation shows that even with the additional cost required to coordinate the auxiliary searches, the reduction in time to convergence is significant, showing gains from searches occurring in environments with fewer local minima to larger searches on complex maps, where performance improvement is even better.

The Supramap project: linking pathogen genomes with geography to fight emergent infectious diseases

Cladistics, 2011

Novel pathogens have the potential to become critical issues of national security, public health ... more Novel pathogens have the potential to become critical issues of national security, public health and economic welfare. As demonstrated by the response to Severe Acute Respiratory Syndrome (SARS) and influenza, genomic sequencing has become an important method for diagnosing agents of infectious disease. Despite the value of genomic sequences in characterizing novel pathogens, raw data on their own do not provide the information needed by public health officials and researchers. One must integrate knowledge of the genomes of pathogens with host biology and geography to understand the etiology of epidemics. To these ends, we have created an application called Supramap (http://supramap.osu.edu) to put information on the spread of pathogens and key mutations across time, space and various hosts into a geographic information system (GIS). To build this application, we created a web service for integrated sequence alignment and phylogenetic analysis as well as methods to describe the tree, mutations, and host shifts in Keyhole Markup Language (KML). We apply the application to 239 sequences of the polymerase basic 2 (PB2) gene of recent isolates of avian influenza (H5N1). We map a mutation, glutamic acid to lysine at position 627 in the PB2 protein (E627K), in H5N1 influenza that allows for increased replication of the virus in mammals. We use a statistical test to support the hypothesis of a correlation of E627K mutations with avian-mammalian host shifts but reject the hypothesis that lineages with E627K are moving westward. Data, instructions for use, and visualizations are included as supplemental materials at:

Thread scheduling and memory coalescing for dynamic vectorization of SPMD workloads

Parallel Computing, 2014

Simultaneous Multi-Threading (SMT) is a hardware model in which different threads share the same ... more Simultaneous Multi-Threading (SMT) is a hardware model in which different threads share the same processing unit. This model is a compromise between high parallelism and low hardware cost. Minimal Multi-Threading (MMT) is one architecture recently proposed that shares instruction decoding and execution between threads running the same program in an SMT processor, thereby generalizing the approach followed by Graphics Processing Units to general-purpose processors. In this paper we propose new ways to expose redundancies in the MMT execution model. First, we propose and evaluate a new thread reconvergence heuristic that handles function calls better than previous approaches. Our heuristic only inspects the program counter and the stack frame to reconverge threads; hence, it is amenable to efficient and inexpensive hardware implementation. Second, we demonstrate that this heuristic is able to reveal the existence of substantial regularity in interthread memory access patterns. We validate our results on data-parallel applications from the PARSEC and SPLASH suites. Our new reconvergence heuristic increases the throughput of our MMT model by 7%, when compared to a previous, and substantially more complex approach, due to Long et al. Moreover, it gives us an effective way to increase regularity in memory accesses. We have observed that over 70% of simultaneous memory accesses are either the same for all the threads, or are affine expressions of the thread identifier. This observation motivates the design of newly proposed hardware that benefits from regularity in inter-thread memory accesses.

Performance evaluation of client-server architectures for large-scale image-processing applications

The goal of this study is to conduct a performance evaluation of the Virtual Microscope, a softwa... more

Mapeamento de Programas I3 para Aplicações Anthill Paralelas de Fluxos de Dados baseadas em Filtros

Anais do VI Workshop em Sistemas Computacionais de Alto Desempenho (WSCAD 2005)

Aplicações atuais de mineração de dados, simulação e visualização científica oferecem várias opor... more Aplicações atuais de mineração de dados, simulação e visualização científica oferecem várias oportunidades de paralelismo por serem iterativas, irregulares e intensivas em termos de EIS (programas l3). O mapeamento e escalonamento de programas l3 para aplicações paralelas de fluxos de dados baseadas em filtros é bastante complexo, pois eles devem considerar aspectos de localidade, dependência de dados e tarefas. A platafoma Anthill provê um modelo de programação adequado para implementação e execução de aplicações paralelas baseadas em filtros. Portanto, neste trabalho, nossos objetivos principais são: a proposta e implementação do algoritmo AnthillPart para o mapeamento de um grafo de tarefas de um programa l3 em filtros; a análise do desempenho das aplicações mapeadas pelo AnthillPart e escalonadas pelo AnthillSched.

Application Performance Analysis and Efficient Execution on Systems with multi-core CPUs, GPUs and MICs: A Case Study with Microscopy Image Analysis

The international journal of high performance computing applications, 2017

We carry out a comparative performance study of multi-core CPUs, GPUs and Intel Xeon Phi (Many In... more We carry out a comparative performance study of multi-core CPUs, GPUs and Intel Xeon Phi (Many Integrated Core-MIC) with a microscopy image analysis application. We experimentally evaluate the performance of computing devices on core operations of the application. We correlate the observed performance with the characteristics of computing devices and data access patterns, computation complexities, and parallelization forms of the operations. The results show a significant variability in the performance of operations with respect to the device used. The performances of operations with regular data access are comparable or sometimes better on a MIC than that on a GPU. GPUs are more efficient than MICs for operations that access data irregularly, because of the lower bandwidth of the MIC for random data accesses. We propose new performance-aware scheduling strategies that consider variabilities in operation speedups. Our scheduling strategies significantly improve application performan...

Hierarchical Density-Based Clustering Based on GPU Accelerated Data Indexing Strategy

Procedia Computer Science, 2016

Due the recent increase of the volume of data that has been generated, organizing this data has b... more Due the recent increase of the volume of data that has been generated, organizing this data has become one of the biggest problems in Computer Science. Among the different strategies propose to deal efficiently and effectively for this purpose, we highlight those related to clustering, more specifically, density-based clustering strategies, which stands out for its ability to define clusters of arbitrary shape and the robustness to deal with the presence of data noise, such as DBSCAN and OPTICS. However, these algorithms are still a computational challenge since they are distance-based proposals. In this work we present a new approach to make OPTICS feasible based on data indexing strategy. Although the simplicity with which the data are indexed, using graphs, it allows explore various parallelization opportunities, which were explored using graphic processing unit (GPU). Based on this structure, the complexity of OPTICS is reduced to O(E * logV) in the worst case, becoming itself very fast. In our evaluation we show that our proposal can be over 200x faster than its sequential version using CPU.

Executing Multiple Pipelined Data Analysis Operations in the Grid

ACM/IEEE SC 2002 Conference (SC'02), 2002

Processing of data in many data analysis applications can be represented as an acyclic, coarse gr... more Processing of data in many data analysis applications can be represented as an acyclic, coarse grain data flow, from data sources to the client. This paper is concerned with scheduling of multiple data analysis operations, each of which is represented as a pipelined chain of processing on data. We define the scheduling problem for effectively placing components onto Grid resources, and propose two scheduling algorithms. Experimental results are presented using a visualization application.

A caGRID-ENABLED, LEARNING BASED IMAGE SEGMENTATION METHOD FOR HISTOPATHOLOGY SPECIMENS

Proceedings / IEEE International Symposium on Biomedical Imaging: from nano to macro. IEEE International Symposium on Biomedical Imaging, 2009

Accurate segmentation of tissue microarrays is a challenging topic because of some of the similar... more Accurate segmentation of tissue microarrays is a challenging topic because of some of the similarities exhibited by normal tissue and tumor regions. Processing speed is another consideration when dealing with imaged tissue microarrays as each microscopic slide may contain hundreds of digitized tissue discs. In this paper, a fast and accurate image segmentation algorithm is presented. Both a whole disc delineation algorithm and a learning based tumor region segmentation approach which utilizes multiple scale texton histograms are introduced. The algorithm is completely automatic and computationally efficient. The mean pixel-wise segmentation accuracy is about 90%. It requires about 1 second for whole disc (1024×1024 pixels) segmentation and less than 5 seconds for segmenting tumor regions. In order to enable remote access to the algorithm and collaborative studies, an analytical service is implemented using the caGrid infrastructure. This service wraps the algorithm and provides inte...

Data and Instruction Uniformity in Minimal Multi-threading

2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing, 2012

Simultaneous Multi-Threading (SMT) is a hardware model in which different threads share the same ... more Simultaneous Multi-Threading (SMT) is a hardware model in which different threads share the same instruction fetching unit. This model is a compromise between high parallelism and low hardware cost. Minimal Multi-Threading (MMT) is a technique recently proposed to share instructions and execution between threads in a SMT machine. In this paper we propose new ways to explore redundancies in the MMT execution model. First, we propose and evaluate a new thread reconvergence heuristics that handles function calls better than previous approaches. Second, we demonstrate the existence of substantial regularity in inter-thread memory access patterns. We validate our results on the four data-parallel applications present in the PARSEC benchmark suite. The new thread reconvergence heuristics is, on the average, 82% more efficient than MMT's original reconvergence method. Furthermore, about 69% to 87% of all the memory addresses are either the same for all the threads, or are affine expressions of the thread identifier. This observation motivates the design of newly proposed hardware that benefits from regularity in inter-thread memory accesses.

Optimizing reduction computations in a distributed environment

We investigate runtime strategies for data-intensive applications that involve generalized reduct... more We investigate runtime strategies for data-intensive applications that involve generalized reductions on large, distributed datasets. Our set of strategies includes replicated filter state, partitioned filter state, and hybrid options between these two extremes. We evaluate these strategies using emulators of three real applications, different query and output sizes, and a number of configurations. We consider execution in a homogeneous cluster and in a distributed environment where only a subset of nodes host the data. Our results show replicating the filter state scales well and outperforms other schemes, if sufficient memory is available and sufficient computation is involved to offset the cost of global merge step. In other cases, hybrid is usually the best. Moreover, in almost all cases, the performance of the hybrid strategy is quite close to the best strategy. Thus, we believe that hybrid is an attractive approach when the relative performance of different schemes cannot be predicted.

Efficient Execution of Microscopy Image Analysis on CPU, GPU, and MIC Equipped Cluster Systems

2014 IEEE 26th International Symposium on Computer Architecture and High Performance Computing, 2014

High performance computing is experiencing a major paradigm shift with the introduction of accele... more High performance computing is experiencing a major paradigm shift with the introduction of accelerators, such as graphics processing units (GPUs) and Intel Xeon Phi (MIC). These processors have made available a tremendous computing power at low cost, and are transforming machines into hybrid systems equipped with CPUs and accelerators. Although these systems can deliver a very high peak performance, making full use of its resources in real-world applications is a complex problem. Most current applications deployed to these machines are still being executed in a single processor, leaving other devices underutilized. In this paper we explore a scenario in which applications are composed of hierarchical data flow tasks which are allocated to nodes of a distributed memory machine in coarse-grain, but each of them may be composed of several finergrain tasks which can be allocated to different devices within the node. We propose and implement novel performance aware scheduling techniques that can be used to allocate tasks to devices. We evaluate our techniques using a pathology image analysis application used to investigate brain cancer morphology, and our experimental evaluation shows that the proposed scheduling strategies significantly outperforms other efficient scheduling techniques, such as Heterogeneous Earliest Finish Time-HEFT, in cooperative executions using CPUs, GPUs, and MICs. We also experimentally show that our strategies are less sensitive to inaccuracy in the scheduling input data and that the performance gains are maintained as the application scales.

High Level Programming Methodologies for Data Intensive Computations

Lecture Notes in Computer Science, 2000

Querying very large multi-dimensional datasets in ADR

Proceedings of the 1999 ACM/IEEE conference on Supercomputing (CDROM) - Supercomputing '99, 1999

Applications that make use of very large scientific datasets have become an increasingly importan... more Applications that make use of very large scientific datasets have become an increasingly important subset of scientific applications. In these applications, datasets are often multi-dimensional, i.e., data items are associated with points in a multi-dimensional attribute space, and access to data items is described by range queries. The basic processing involves mapping input data items to output data items, and some form of aggregation of all the input data items that project to the each output data item. We have developed an infrastructure, called the Active Data Repository (ADR), that integrates storage, retrieval and processing of multi-dimensional datasets on distributed-memory parallel architectures with multiple disks attached to each node. In this paper we address efficient execution of range queries on distributed memory parallel machines within ADR framework. We present three potential strategies, and evaluate them under different application scenarios and machine configurations. We present experimental results on the scalability and performance of the strategies on a 128-node IBM SP.

The Virtual Microscope

Proceedings : a conference of the American Medical Informatics Association / ... AMIA Annual Fall Symposium. AMIA Fall Symposium, 1997

We present the design of the Virtual Microscope, a software system employing a client/server arch... more We present the design of the Virtual Microscope, a software system employing a client/server architecture to provide a realistic emulation of a high power light microscope. We discuss several technical challenges related to providing the performance necessary to achieve rapid response time, mainly in dealing with the enormous amounts of data (tens to hundreds of gigabytes per slide) that must be retrieved from secondary storage and processed. To effectively implement the data server, the system design relies on the computational power and high I/O throughput available from an appropriately configured parallel computer.

Economically-efficient sentiment stream analysis

Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval, 2014

Text-based social media channels, such as Twitter, produce torrents of opinionated data about the... more Text-based social media channels, such as Twitter, produce torrents of opinionated data about the most diverse topics and entities. The analysis of such data (aka. sentiment analysis) is quickly becoming a key feature in recommender systems and search engines. A prominent approach to sentiment analysis is based on the application of classification techniques, that is, content is classified according to the attitude of the writer. A major challenge, however, is that Twitter follows the data stream model, and thus classifiers must operate with limited resources, including labeled data and time for building classification models. Also challenging is the fact that sentiment distribution may change as the stream evolves. In this paper we address these challenges by proposing algorithms that select relevant training instances at each time step, so that training sets are kept small while providing to the classifier the capabilities to suit itself to, and to recover itself from, different types of sentiment drifts. Simultaneously providing capabilities to the classifier, however, is a conflicting-objective problem, and our proposed algorithms employ basic notions of Economics in order to balance both capabilities. We performed the analysis of events that reverberated on Twitter, and the comparison against the state-of-the-art reveals improvements both in terms of error reduction (up to 14%) and reduction of training resources (by orders of magnitude).

Profiling general purpose GPU applications

Proceedings - Symposium on Computer Architecture and High Performance Computing, 2009

Coordinating the use of GPU and CPU for improving performance of compute intensive applications

Proceedings - IEEE International Conference on Cluster Computing, ICCC, 2009

Smart surveillance framework: A versatile tool for video analysis

IEEE Winter Conference on Applications of Computer Vision, 2014

A Scalable Parallel Deduplication Algorithm

19th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD'07), 2007

Speeding Up Learning in Real-Time Search through Parallel Computing

2011 23rd International Symposium on Computer Architecture and High Performance Computing, 2011

Real-time search algorithms solve the problem of path planning, regardless the size and complexit... more Real-time search algorithms solve the problem of path planning, regardless the size and complexity of the maps, and the massive presence of entities in the same environment. In such methods, the learning step aims to avoid local minima and improve the results for future searches, ensuring the convergence to the optimal path when the same planning task is solved repeatedly. However, performing search in a limited area due to real-time constraints makes the run to convergence a lengthy process. In this work, we present a parallelization strategy that aims to reduce the time to convergence, maintaining the real-time properties of the search. The parallelization technique consists on using auxiliary searches without the real-time restrictions present in the main search. In addition, the same learning is shared by all searches. The empirical evaluation shows that even with the additional cost required to coordinate the auxiliary searches, the reduction in time to convergence is significant, showing gains from searches occurring in environments with fewer local minima to larger searches on complex maps, where performance improvement is even better.

The Supramap project: linking pathogen genomes with geography to fight emergent infectious diseases

Cladistics, 2011

Novel pathogens have the potential to become critical issues of national security, public health ... more Novel pathogens have the potential to become critical issues of national security, public health and economic welfare. As demonstrated by the response to Severe Acute Respiratory Syndrome (SARS) and influenza, genomic sequencing has become an important method for diagnosing agents of infectious disease. Despite the value of genomic sequences in characterizing novel pathogens, raw data on their own do not provide the information needed by public health officials and researchers. One must integrate knowledge of the genomes of pathogens with host biology and geography to understand the etiology of epidemics. To these ends, we have created an application called Supramap (http://supramap.osu.edu) to put information on the spread of pathogens and key mutations across time, space and various hosts into a geographic information system (GIS). To build this application, we created a web service for integrated sequence alignment and phylogenetic analysis as well as methods to describe the tree, mutations, and host shifts in Keyhole Markup Language (KML). We apply the application to 239 sequences of the polymerase basic 2 (PB2) gene of recent isolates of avian influenza (H5N1). We map a mutation, glutamic acid to lysine at position 627 in the PB2 protein (E627K), in H5N1 influenza that allows for increased replication of the virus in mammals. We use a statistical test to support the hypothesis of a correlation of E627K mutations with avian-mammalian host shifts but reject the hypothesis that lineages with E627K are moving westward. Data, instructions for use, and visualizations are included as supplemental materials at:

Thread scheduling and memory coalescing for dynamic vectorization of SPMD workloads

Parallel Computing, 2014

Simultaneous Multi-Threading (SMT) is a hardware model in which different threads share the same ... more Simultaneous Multi-Threading (SMT) is a hardware model in which different threads share the same processing unit. This model is a compromise between high parallelism and low hardware cost. Minimal Multi-Threading (MMT) is one architecture recently proposed that shares instruction decoding and execution between threads running the same program in an SMT processor, thereby generalizing the approach followed by Graphics Processing Units to general-purpose processors. In this paper we propose new ways to expose redundancies in the MMT execution model. First, we propose and evaluate a new thread reconvergence heuristic that handles function calls better than previous approaches. Our heuristic only inspects the program counter and the stack frame to reconverge threads; hence, it is amenable to efficient and inexpensive hardware implementation. Second, we demonstrate that this heuristic is able to reveal the existence of substantial regularity in interthread memory access patterns. We validate our results on data-parallel applications from the PARSEC and SPLASH suites. Our new reconvergence heuristic increases the throughput of our MMT model by 7%, when compared to a previous, and substantially more complex approach, due to Long et al. Moreover, it gives us an effective way to increase regularity in memory accesses. We have observed that over 70% of simultaneous memory accesses are either the same for all the threads, or are affine expressions of the thread identifier. This observation motivates the design of newly proposed hardware that benefits from regularity in inter-thread memory accesses.

Performance evaluation of client-server architectures for large-scale image-processing applications

The goal of this study is to conduct a performance evaluation of the Virtual Microscope, a softwa... more

Mapeamento de Programas I3 para Aplicações Anthill Paralelas de Fluxos de Dados baseadas em Filtros

Anais do VI Workshop em Sistemas Computacionais de Alto Desempenho (WSCAD 2005)

Aplicações atuais de mineração de dados, simulação e visualização científica oferecem várias opor... more Aplicações atuais de mineração de dados, simulação e visualização científica oferecem várias oportunidades de paralelismo por serem iterativas, irregulares e intensivas em termos de EIS (programas l3). O mapeamento e escalonamento de programas l3 para aplicações paralelas de fluxos de dados baseadas em filtros é bastante complexo, pois eles devem considerar aspectos de localidade, dependência de dados e tarefas. A platafoma Anthill provê um modelo de programação adequado para implementação e execução de aplicações paralelas baseadas em filtros. Portanto, neste trabalho, nossos objetivos principais são: a proposta e implementação do algoritmo AnthillPart para o mapeamento de um grafo de tarefas de um programa l3 em filtros; a análise do desempenho das aplicações mapeadas pelo AnthillPart e escalonadas pelo AnthillSched.

Application Performance Analysis and Efficient Execution on Systems with multi-core CPUs, GPUs and MICs: A Case Study with Microscopy Image Analysis

The international journal of high performance computing applications, 2017

We carry out a comparative performance study of multi-core CPUs, GPUs and Intel Xeon Phi (Many In... more We carry out a comparative performance study of multi-core CPUs, GPUs and Intel Xeon Phi (Many Integrated Core-MIC) with a microscopy image analysis application. We experimentally evaluate the performance of computing devices on core operations of the application. We correlate the observed performance with the characteristics of computing devices and data access patterns, computation complexities, and parallelization forms of the operations. The results show a significant variability in the performance of operations with respect to the device used. The performances of operations with regular data access are comparable or sometimes better on a MIC than that on a GPU. GPUs are more efficient than MICs for operations that access data irregularly, because of the lower bandwidth of the MIC for random data accesses. We propose new performance-aware scheduling strategies that consider variabilities in operation speedups. Our scheduling strategies significantly improve application performan...

Hierarchical Density-Based Clustering Based on GPU Accelerated Data Indexing Strategy

Procedia Computer Science, 2016

Due the recent increase of the volume of data that has been generated, organizing this data has b... more Due the recent increase of the volume of data that has been generated, organizing this data has become one of the biggest problems in Computer Science. Among the different strategies propose to deal efficiently and effectively for this purpose, we highlight those related to clustering, more specifically, density-based clustering strategies, which stands out for its ability to define clusters of arbitrary shape and the robustness to deal with the presence of data noise, such as DBSCAN and OPTICS. However, these algorithms are still a computational challenge since they are distance-based proposals. In this work we present a new approach to make OPTICS feasible based on data indexing strategy. Although the simplicity with which the data are indexed, using graphs, it allows explore various parallelization opportunities, which were explored using graphic processing unit (GPU). Based on this structure, the complexity of OPTICS is reduced to O(E * logV) in the worst case, becoming itself very fast. In our evaluation we show that our proposal can be over 200x faster than its sequential version using CPU.

Executing Multiple Pipelined Data Analysis Operations in the Grid

ACM/IEEE SC 2002 Conference (SC'02), 2002

Processing of data in many data analysis applications can be represented as an acyclic, coarse gr... more Processing of data in many data analysis applications can be represented as an acyclic, coarse grain data flow, from data sources to the client. This paper is concerned with scheduling of multiple data analysis operations, each of which is represented as a pipelined chain of processing on data. We define the scheduling problem for effectively placing components onto Grid resources, and propose two scheduling algorithms. Experimental results are presented using a visualization application.

A caGRID-ENABLED, LEARNING BASED IMAGE SEGMENTATION METHOD FOR HISTOPATHOLOGY SPECIMENS

Proceedings / IEEE International Symposium on Biomedical Imaging: from nano to macro. IEEE International Symposium on Biomedical Imaging, 2009

Accurate segmentation of tissue microarrays is a challenging topic because of some of the similar... more Accurate segmentation of tissue microarrays is a challenging topic because of some of the similarities exhibited by normal tissue and tumor regions. Processing speed is another consideration when dealing with imaged tissue microarrays as each microscopic slide may contain hundreds of digitized tissue discs. In this paper, a fast and accurate image segmentation algorithm is presented. Both a whole disc delineation algorithm and a learning based tumor region segmentation approach which utilizes multiple scale texton histograms are introduced. The algorithm is completely automatic and computationally efficient. The mean pixel-wise segmentation accuracy is about 90%. It requires about 1 second for whole disc (1024×1024 pixels) segmentation and less than 5 seconds for segmenting tumor regions. In order to enable remote access to the algorithm and collaborative studies, an analytical service is implemented using the caGrid infrastructure. This service wraps the algorithm and provides inte...

Data and Instruction Uniformity in Minimal Multi-threading

2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing, 2012

Simultaneous Multi-Threading (SMT) is a hardware model in which different threads share the same ... more Simultaneous Multi-Threading (SMT) is a hardware model in which different threads share the same instruction fetching unit. This model is a compromise between high parallelism and low hardware cost. Minimal Multi-Threading (MMT) is a technique recently proposed to share instructions and execution between threads in a SMT machine. In this paper we propose new ways to explore redundancies in the MMT execution model. First, we propose and evaluate a new thread reconvergence heuristics that handles function calls better than previous approaches. Second, we demonstrate the existence of substantial regularity in inter-thread memory access patterns. We validate our results on the four data-parallel applications present in the PARSEC benchmark suite. The new thread reconvergence heuristics is, on the average, 82% more efficient than MMT's original reconvergence method. Furthermore, about 69% to 87% of all the memory addresses are either the same for all the threads, or are affine expressions of the thread identifier. This observation motivates the design of newly proposed hardware that benefits from regularity in inter-thread memory accesses.

Optimizing reduction computations in a distributed environment

We investigate runtime strategies for data-intensive applications that involve generalized reduct... more We investigate runtime strategies for data-intensive applications that involve generalized reductions on large, distributed datasets. Our set of strategies includes replicated filter state, partitioned filter state, and hybrid options between these two extremes. We evaluate these strategies using emulators of three real applications, different query and output sizes, and a number of configurations. We consider execution in a homogeneous cluster and in a distributed environment where only a subset of nodes host the data. Our results show replicating the filter state scales well and outperforms other schemes, if sufficient memory is available and sufficient computation is involved to offset the cost of global merge step. In other cases, hybrid is usually the best. Moreover, in almost all cases, the performance of the hybrid strategy is quite close to the best strategy. Thus, we believe that hybrid is an attractive approach when the relative performance of different schemes cannot be predicted.

Efficient Execution of Microscopy Image Analysis on CPU, GPU, and MIC Equipped Cluster Systems

2014 IEEE 26th International Symposium on Computer Architecture and High Performance Computing, 2014

High performance computing is experiencing a major paradigm shift with the introduction of accele... more High performance computing is experiencing a major paradigm shift with the introduction of accelerators, such as graphics processing units (GPUs) and Intel Xeon Phi (MIC). These processors have made available a tremendous computing power at low cost, and are transforming machines into hybrid systems equipped with CPUs and accelerators. Although these systems can deliver a very high peak performance, making full use of its resources in real-world applications is a complex problem. Most current applications deployed to these machines are still being executed in a single processor, leaving other devices underutilized. In this paper we explore a scenario in which applications are composed of hierarchical data flow tasks which are allocated to nodes of a distributed memory machine in coarse-grain, but each of them may be composed of several finergrain tasks which can be allocated to different devices within the node. We propose and implement novel performance aware scheduling techniques that can be used to allocate tasks to devices. We evaluate our techniques using a pathology image analysis application used to investigate brain cancer morphology, and our experimental evaluation shows that the proposed scheduling strategies significantly outperforms other efficient scheduling techniques, such as Heterogeneous Earliest Finish Time-HEFT, in cooperative executions using CPUs, GPUs, and MICs. We also experimentally show that our strategies are less sensitive to inaccuracy in the scheduling input data and that the performance gains are maintained as the application scales.

High Level Programming Methodologies for Data Intensive Computations

Lecture Notes in Computer Science, 2000

Querying very large multi-dimensional datasets in ADR

Proceedings of the 1999 ACM/IEEE conference on Supercomputing (CDROM) - Supercomputing '99, 1999

Applications that make use of very large scientific datasets have become an increasingly importan... more Applications that make use of very large scientific datasets have become an increasingly important subset of scientific applications. In these applications, datasets are often multi-dimensional, i.e., data items are associated with points in a multi-dimensional attribute space, and access to data items is described by range queries. The basic processing involves mapping input data items to output data items, and some form of aggregation of all the input data items that project to the each output data item. We have developed an infrastructure, called the Active Data Repository (ADR), that integrates storage, retrieval and processing of multi-dimensional datasets on distributed-memory parallel architectures with multiple disks attached to each node. In this paper we address efficient execution of range queries on distributed memory parallel machines within ADR framework. We present three potential strategies, and evaluate them under different application scenarios and machine configurations. We present experimental results on the scalability and performance of the strategies on a 128-node IBM SP.