Papers by Leandro A. J. Marzulo

Finding a longest common subsequence (LCS) of two sequences is an important problem for several f... more Finding a longest common subsequence (LCS) of two sequences is an important problem for several fields, such as biology, medicine and linguistics. LCS algorithm is usually implemented with dynamic programming techniques where a score matrix (C) is filled to determine the size of the LCS. Parallelization of this task usually follows the wavefront pattern. When using popular APIs, such as OpenMP, wavefront is implemented by processing the elements of each diagonal of C in parallel. This approach restricts parallelism and forces threads to wait on a barrier at the end of the computation of each diagonal. In this paper we propose a dataflow parallel version of LCS. Multiple tasks are created, each one being responsible for processing a block of C, and task execution is fired when all data dependencies are satisfied. Comparison with OpenMP implementation showed performance gains of up to 23%, which suggests that dataflow execution can be an interesting alternative to wavefront applications.

The Dataflow execution model has been shown to be a good way of exploiting Thread-Level Paralleli... more The Dataflow execution model has been shown to be a good way of exploiting Thread-Level Parallelism (TLP), making parallel programming easier. In this model, tasks must be mapped to processing elements (PEs) considering the tradeoff between communication and parallelism. Previous work on scheduling dependency graphs have mostly focused on directed acyclic graphs, which are not suitable for dataflow (loops in the code become cycles in the graph). Thus, we present the SCC-Map: a novel static mapping algorithm that considers the importance of cycles during the mapping process. To validate our approach, we ran a set of benchmarks using our dataflow simulator varying the communication latency, the number of PEs in the system and the placement algorithm. Our results show that the benchmark programs run significantly faster when mapped with SCC-Map. Moreover, we observed that SCC-Map is more effective than the other mapping algorithms when communication latency is higher.

Couillard: Parallel programming via coarse-grained Data-flow Compilation
Data-flow is a natural approach to parallelism. However, describing dependencies and control betw... more Data-flow is a natural approach to parallelism. However, describing dependencies and control between fine-grained data-flow tasks can be complex and present unwanted overheads. TALM (TALM is an Architecture and Language for Multi-threading) introduces a user-defined coarse-grained parallel data-flow model, where programmers identify code blocks, called super-instructions, to be run in parallel and connect them in a data-flow graph. TALM has been implemented as a hybrid Von Neumann/data-flow execution system: the Trebuchet. We have observed that TALM’s usefulness largely depends on how programmers specify and connect super-instructions. Thus, we present Couillard, a full compiler that creates, based on an annotated C-program, a data-flow graph and C-code corresponding to each super-instruction. We show that our toolchain allows one to benefit from data-flow execution and explore sophisticated parallel programming techniques, with small effort. To evaluate our system we have executed a set of real applications on a large multi-core machine. Comparison with popular parallel programming methods shows competitive speedups, while providing an easier parallel programing approach. More specifically, for an application that follows the wavefront method, running with big inputs, Trebuchet achieved up to 4.7% speedup over Intel® TBB novel flow-graph approach and up to 44% over OpenMP.

Finding a longest common subsequence (LCS) of two sequences is an important problem for several f... more Finding a longest common subsequence (LCS) of two sequences is an important problem for several fields, such as biology, medicine and linguistics. LCS algorithm is usually implemented with dynamic programming techniques where a score matrix (C) is filled to determine the size of the LCS. Parallelization of this task usually follows the wavefront pattern. When using popular APIs, such as OpenMP, wavefront is implemented by processing the elements of each diagonal of C in parallel. This approach restricts parallelism and forces threads to wait on a barrier at the end of the computation of each diagonal. In this paper we propose a dataflow parallel version of LCS. Multiple tasks are created, each one being responsible for processing a block of C, and task execution is fired when all data dependencies are satisfied. Comparison with OpenMP implementation showed performance gains of up to 23%, which suggests that dataflow execution can be an interesting alternative to wavefront applications.

A arquitetura WaveScalar é a primeira arquitetura dataflow a apresentar uma interface de memória ... more A arquitetura WaveScalar é a primeira arquitetura dataflow a apresentar uma interface de memória que mantém a semântica de acessos requerida pelas linguagens imperativas. Um protótipo da arquitetura, em desenvolvimento, permitiria passar de experimentação por simulação para um cenário mais real, com o processador desenvolvido em FPGA. No entanto, este protótipo não é acessível (financeiramente) para qualquer instituição que também queira produzí-lo. Neste trabalho é apresentada a FlowPGA, uma versão reduzida desta arquitetura para ser utilizada com FPGAs com pequeno número de células lógicas. Uma FPGA com 1,5 milhões de gates foi utilizada para implementação. A corretude da implementação foi avaliada com a execução de um programa de multiplicação entre dois números positivos usando sucessivas somas. Os resultados mostram que a arquitetura FlowPGA tem desempenho equivalente ao WaveScalar. Ainda, para avaliar a versatilidade do projeto, a FlowPGA foi modificada para utilizar um sistema de numeração RNS, com esforco de implementaçào de aproximadamente 20 horas.

Arxiv preprint arXiv: …, Jan 1, 2011
Data-flow is a natural approach to parallelism. However, describing dependencies and control betw... more Data-flow is a natural approach to parallelism. However, describing dependencies and control between fine-grained data-flow tasks can be complex and present unwanted overheads. TALM (TALM is an Architecture and Language for Multi-threading) introduces a user-defined coarse-grained parallel data-flow model, where programmers identify code blocks, called super-instructions, to be run in parallel and connect them in a data-flow graph. TALM has been implemented as a hybrid Von Neumann/data-flow execution system: the \emph{Trebuchet}. We have observed that TALM's usefulness largely depends on how programmers specify and connect super-instructions. Thus, we present \emph{Couillard}, a full compiler that creates, based on an annotated C-program, a data-flow graph and C-code corresponding to each super-instruction. We show that our toolchain allows one to benefit from data-flow execution and explore sophisticated parallel programming techniques, with small effort. To evaluate our system we have executed a set of real applications on a large multi-core machine. Comparison with popular parallel programming methods shows competitive speedups, while providing an easier parallel programing approach.

… Architecture and High …, Jan 1, 2008
The WaveScalar is the first DataFlow Architecture that can efficiently provide the sequential mem... more The WaveScalar is the first DataFlow Architecture that can efficiently provide the sequential memory semantics required by imperative languages. This work presents a speculative memory disambiguation mechanism for this architecture, the Transaction WaveCache. Our mechanism maintains the execution order of memory operations within blocks of code, called Waves, but adds the ability to speculatively execute, out-of-order, operations from different waves. This mechanism is inspired by progress in supporting Transactional Memories. Waves are considered as atomic regions and executed as nested transactions. Wave that have finished the execution of all their memory operations are committed, as soon as the previous waves are also committed. If a hazard is detected in a speculative Wave, all the following Waves (children) are aborted and re-executed. We evaluated the Transactional WaveCache on a set of benchmarks from Spec 2000, Mediabench and Mibench (telecomm). Speedups ranging from 1.31 to 2.24 related to the original WaveScalar) where observed when the benchmark doesn’t perform lots of emulated function calls or access memory very often. Low speedups of 1.1 to slowdowns of 0.96 were observed when the opposite happens or when the memory concurrency was high.

2009 World Congress …, Jan 1, 2009
Speculative Multithreading (SpMT) increases the performance by means of executing multiple thread... more Speculative Multithreading (SpMT) increases the performance by means of executing multiple threads speculatively to exploit thread-level parallelism. By combining software and hardware approaches, we have improved the capabilities of previous WaveScalar ISA on the basis of Transactional Memory system for the WaveCache Architecture. Threads are extracted at the course of static compiling, and speculatively executed as a thread-level transaction that is supported by extra hardware components, such as Thread-Context-Table (TCT) and Thread-Memory-History (TMH). We have evaluated the SpMT WaveCache with 6 real benchmarks from SPEC, Mediabench and Mibench. On the whole, the SpMT WaveCache outperforms superscalar architecture ranging from 2X to 3X, and great performance gains are achieved over original WaveCache and Transactional WaveCache as well. 2009 World Congress on Computer Science and Information Engineering 978-0-7695-3507-4/08 $25.00

International Journal of …, Jan 1, 2011
Parallel programming has become mandatory to fully exploit the potential of multi-core CPUs. The ... more Parallel programming has become mandatory to fully exploit the potential of multi-core CPUs. The dataflow model provides a natural way to exploit parallelism. However, specifying dependences and control using fine-grained instructions in dataflow programs can be complex and present unwanted overheads. To address this issue, we have designed TALM: a coarse-grained dataflow execution model to be used on top of widespread architectures. We implemented TALM as the Trebuchet virtual machine for multi-cores. The programmer identifies code blocks that can run in parallel and connects them to form a dataflow graph, which allows one to have the benefits of parallel dataflow execution in a Von Neumann machine, with small programming effort. We parallelised a set of seven applications using our approach and compared with OpenMP implementations. Results show that Trebuchet can be competitive with state-of-the-art technology, while providing the benefits of dataflow execution. include sequential and parallel logic programming and machine learning, namely inductive logic programming and statistical relational learning with applications to bioinformatics and to intelligence analysis. He has published more than 90 refereed papers in journals and international conferences, has led three national and two international research projects, and he has been the main developer of several systems, such as YAP Prolog, has chaired two conferences, and has supervised five PhD students and ten MSc students.

… Architecture and High …, Jan 1, 2010
Parallel programming has become mandatory to fully exploit the potential of modern CPUs. The data... more Parallel programming has become mandatory to fully exploit the potential of modern CPUs. The data-flow model provides a natural way to exploit parallelism. However, traditional data-flow programming is not trivial: specifying dependencies and control using fine-grained tasks (such as instructions) can be complex and present unwanted overheads. To address this issue we have built a coarse-grained data-flow model with speculative execution support to be used on top of widespread architectures, implemented as a hybrid Von Neumanm/data-flow execution system. We argue that speculative execution fits naturally with the data-flow model. Using speculative execution liberates the programmer to consider only the main dependencies, and still allows correct data-flow execution of coarse-grained tasks. Moreover, our speculation mechanism does not demand centralised control, which is a key feature for upcoming manycore systems, where scalability has become an important concern. An initial study on a artificial bank server application suggests that there is a wide range of scenarios where speculation can be very effective.

lbd.dcc.ufmg.br
No modelo DataFlow as instruções são executadas tão logo seus operandos de entrada estejam dispon... more No modelo DataFlow as instruções são executadas tão logo seus operandos de entrada estejam disponíveis, expondo, de forma natural, o paralelismo em nível de instrução (ILP). Por outro lado, a exploração de paralelismo em nível de thread (TLP) passa a ser também um fator de grande importância para o aumento de desempenho na execução de uma aplicação em máquinas multicore. Este trabalho propõe um modelo de execução de programas, baseado nas arquiteturas DataFlow, que transforma ILP em TLP. Esse modeloé demonstrado através da implementação de uma máquina virtual multi-threaded, a Trebuchet. A aplicaçãoé compilada para o modelo DataFlow e suas instruções independentes (segundo o fluxo de dados) são executadas em Elementos de Processamento (EPs) distintos da Trebuchet. Cada EPé mapeado em uma thread na máquina hospedeira. O modelo permite a definição de blocos de instruções de diferentes granularidades, que terão disparo guiado pelo fluxo de dados e execução direta na máquina hospedeira, para diminuir os custos de interpretação. Como a sincronizaçãoé obtida pelo modelo DataFlow, nãoé necessária a introdução de locks ou barreiras nos programas a serem paralelizados. Um conjunto de três benchmarks reduzidos, compilados em oito threads e executados por um processador quadcore Intel R Core TM i7 920, permitiu avaliar: (i) o funcionamento do modelo; (ii) a versatilidade na definição de instruções com diferentes granularidades (blocos); (iii) uma comparação com o OpenMP. Acelerações de 4,81, 2,4 e 4,03 foram atingidas em relaçãoà versão sequencial, enquanto que acelerações de 1,11, 1,3 e 1,0 foram obtidas em relação ao OpenMP.
Uploads
Papers by Leandro A. J. Marzulo