Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2005, Lecture Notes in Computer Science
The use of parallel computing and distributed information services is spreading quite rapidly, as today's difficult problems in science, engineering and industry far exceed the capabilities of the desktop PC and department file server. The availability of commodity parallel computers, ubiquitous networks, maturing Grid middleware, and portal frameworks is fostering the development and deployment of large scale simulation and data analysis solutions in many areas. This topic highlights recent progress in applications of high performance parallel and Grid computing, with an emphasis on successes, advances, and lessons learned in the development and implementation of novel scientific, engineering and industrial applications. Today's large computational solutions often operate in complex information and computation environments where efficient data access and management can be as important as computational methods and performance, so the technical approaches in this topic span high performance parallel computing, Grid computation and data access, and the associated problem-solving environments that compose and manage advanced solutions. This year the 23 papers submitted to this topic area showed a wide range of activity in high performance parallel and distributed computing, with the largest subset relating to genome sequence analysis. Nine papers were accepted as full papers for the conference, organized into three sessions. One session focuses on high performance genome sequence comparison. The second and third sessions present advanced approaches to scalable simulations, including some non-traditional arenas for high performance computing. Overall, they underscore the close relationship between advances in computer science, computational science, and applied mathematics in developing scalable applications for parallel and distributed systems.
2007
The beginning of the twenty-first century has been characterized by an explosion of biological information. The avalanche of data grows daily and arises as a consequence of advances in the fields o ...
Recent advances in computer science improve research in several areas. These include an area with great emphasis, the biology, which is the field of study of this work. The use of computer for analysis of data, and processing methods of biology is called bioinformatics. However, when are analyzed DNA sequences from complex organisms which have thousands nucleotides is required a greater processing power. Due to the increase the amount of data to be processed we can use HPCA. Within this context in this work, we study the use of computing capabilities in performance processing DNA sequences. More specifically the parallelization of methods used by the research group in bioinformatics at UCS(University of Caxias do Sul). For development this work we chose to use computational grids, since this type of platform provides a high processing capacity at low cost. Keywords: Parallel Computing, Bioinformatics, Computing Grid, Neural Network, DNA
2006 Second IEEE International Conference on e-Science and Grid Computing (e-Science'06), 2006
Bioinformatics algorithms such as sequence alignment methods based on profile-HMM (Hidden Markov Model) are popular but CPU-intensive. If large amounts of data are processed, a single computer often runs for many hours or even days. High performance infrastructures such as clusters or computational Grids provide the techniques to speed up the process by distributing the workload to remote nodes, running parts of the work load in parallel. Biologists often do not have access to such hardware systems. Therefore, we propose a new system using a modern Grid approach to optimise an embarrassingly parallel problem. We achieve speed ups by at least two orders of magnitude given that we can use a powerful, world-wide distributed Grid infrastructure. For large-scale problems our method can outperform algorithms designed for mid-size clusters even considering additional latencies imposed by Grid infrastructures.
European Journal of Human Genetics, 2007
In genetics, with increasing data sizes and more advanced algorithms for mining complex data, a point is reached where increased computational capacity or alternative solutions becomes unavoidable. Most contemporary methods for linkage analysis are based on the Lander-Green hidden Markov model (HMM), which scales exponentially with the number of pedigree members. In whole genome linkage analysis, genotype simulations become prohibitively time consuming to perform on single computers. We have developed 'Grid-Allegro', a Grid aware implementation of the Allegro software, by which several thousands of genotype simulations can be performed in parallel in short time. With temporary installations of the Allegro executable and datasets on remote nodes at submission, the need of predefined Grid runtime environments is circumvented. We evaluated the performance, efficiency and scalability of this implementation in a genome scan on Swedish multiplex Alzheimer's disease families. We demonstrate that 'Grid-Allegro' allows for the full exploitation of the features available in Allegro for genome-wide linkage. The implementation of existing bioinformatics applications on Grids (Distributed Computing) represent a cost-effective alternative for addressing highly resource-demanding and data-intensive bioinformatics task, compared to acquiring and setting up clusters of computational hardware in house (Parallel Computing), a resource not available to most geneticists today.
Lecture Notes in Computer Science, 2007
The potential for Grid technologies in applied bioinformatics is largely unexplored. We have developed a model for solving computationally demanding bioinformatics tasks in distributed Grid environments, designed to ease the usability for scientists unfamiliar with Grid computing. With a script-based implementation that uses a strategy of temporary installations of databases and existing executables on remote nodes at submission, we propose a generic solution that do not rely on predefined Grid runtime environments and that can easily be adapted to other bioinformatics tasks suitable for parallelization. This implementation has been successfully applied to whole proteome sequence similarity analyses and to genome-wide genotype simulations, where computation time was reduced from years to weeks. We conclude that computational Grid technology is a useful resource for solving high compute tasks in genetics and proteomics using existing algorithms.
Future Generation Computer Systems, 2009
In past years, researchers from many domains have discovered Grid technology which opens up new possibilities in solving problems that are difficult to handle with traditional cluster computing. With the rapidly increasing number of partially or completely sequenced genomes, computational genome annotation is a particularly challenging task in computational biology. In this paper, we describe how we adapted the gene-finding tool AUGUSTUS to Grid computing in the context of the German MediGRID project. The gridification process starts with providing security requirements and running the application manually using Grid middleware. Afterwards, the application is described as a workflow of successive program executions, which are automatically distributed to appropriate Grid resources by a workflow engine. Finally, we show how a convenient graphical user interface for end users is created by means of a portal framework.
Concurrency and Computation: Practice and Experience, 2005
Improvements in the performance of processors and networks have made it feasible to treat collections of workstations, servers, clusters and supercomputers as integrated computing resources or Grids. However, the very heterogeneity that is the strength of computational and data Grids can also make application development for such an environment extremely difficult. Application development in a Grid computing environment faces significant challenges in the form of problem granularity, latency and bandwidth issues as well as job scheduling. Currently existing Grid technologies limit the development of Grid applications to certain classes, namely, embarrassingly parallel, hierarchical parallelism, work flow and database applications. Of all these classes, embarrassingly parallel applications are the easiest to develop in a Grid computing framework. The work presented here deals with creating a Grid-enabled, high-throughput, standalone version of a bioinformatics application, BLAST, using Globus as the Grid middleware. BLAST is a sequence alignment and search technique that is embarrassingly parallel in nature and thus amenable to adaptation to a Grid environment. A detailed methodology for creating the Grid-enabled application is presented, which can be used as a template for the development of similar applications. The application has been tested on a 'mini-Grid' testbed and the results presented here show that for large problem sizes, a distributed, Grid-enabled version can help in significantly reducing execution times.
Message from Workshop Chairs Welcome to the first international workshop on High Performance Computational Biology. With the explosion of biological data and the compute-intensive nature of many biological applications, the use of high performance computing will become increasingly important in furthering biological knowledge. The goal of this workshop is to provide a forum for discussion of latest research in developing high-performance computing solutions to problems arising from molecular biology. The technical program was put together with the help of a distinguished program committee consisting of 12 members. Each submission was thoroughly reviewed by three to five program committee members. Manuscripts submitted by the workshop organizers were subjected to a more stringent review. Based on the reviews, ten submissions have been selected for presentation at the workshop and inclusion in the workshop proceedings. We are grateful to the program committee members for submitting timely and thoughtful reviews. We wish to thank all the authors who submitted manuscripts to this workshop, without which this high-quality technical program would not have been possible. We plan to continue this workshop in the forthcoming years and look forward to your continuing support in this endeavor.
DNA Multiple sequence alignment is widespread bioinformatics application that determines the similarity between a new sequence with other exist sequences. Along with the growth in the heterogeneous biological data, many research groups designed tools to analyze them. Integration of these biological data and tools is becoming one of the major topics in the field of bioinformatics. Grid computing provides the ability to perform high performance computing by taking advantage of many computers geographically distributed and connected by a network. Task granulation can greatly affect the processing time of the multiple sequence alignment on the grid. This paper shows a DNA multiple sequence alignment framework using the GDS Grid. Finally, the paper study the effect of the task granulation on the processing time and its effect on the computation to communication ratio.
Studies in health technology and informatics, 2010
Especially in the life-science and the health-care sectors the huge IT requirements are imminent due to the large and complex systems to be analysed and simulated. Grid infrastructures play here a rapidly increasing role for research, diagnostics, and treatment, since they provide the necessary large-scale resources efficiently. Whereas grids were first used for huge number crunching of trivially parallelizable problems, increasingly parallel high-performance computing is required. Here, we show for the prime example of molecular dynamic simulations how the presence of large grid clusters including very fast network interconnects within grid infrastructures allows now parallel high-performance grid computing efficiently and thus combines the benefits of dedicated super-computing centres and grid infrastructures. The demands for this service class are the highest since the user group has very heterogeneous requirements: i) two to many thousands of CPUs, ii) different memory architect...
ACM/IEEE SC 2006 Conference (SC'06), 2006
The Basic Local Alignment Search Tool (BLAST) allows bioinformaticists to characterize an unknown sequence by comparing it against a database of known sequences. The similarity between sequences enables biologists to detect evolutionary relationships and infer biological properties of the unknown sequence. mpiBLAST, our parallel BLAST, decreases the search time of a 300 KB query on the current NT database from over two full days to under 10 minutes on a 128processor cluster and allows larger query files to be compared. Consequently, we propose to compare the largest query available, the entire NT database, against the largest database available, the entire NT database. The result of this comparison will provide critical information to the biology community, including insightful evolutionary, structural, and functional relationships between every sequence and family in the NT database. Preliminary projections indicated that to complete the above task in a reasonable length of time required more processors than were available to us at a single site. Hence, we assembled GreenGene, an ad-hoc grid that was constructed "on the fly" from donated computational, network, and storage resources during last year's SC|05. GreenGene consisted of 3048 processors from machines that were distributed across the United States. This paper presents a case study of mpiBLAST on GreenGene-specifically, a pre-run characterization of the computation, the hardware and software architectural design, experimental results, and future directions.
Recently, the amount of genome sequence is increasing rapidly due to advanced computational techniques and experimental tools in the biological area. Sequence comparisons are very useful operations to predict the functions of the genes or proteins. However, it takes too much time to compare long sequence data and there are many research results for fast sequence comparisons. In this paper, we propose a hybrid grid system to improve the performance of the sequence comparisons based on the LanLinux system. Compared with conventional approaches, hybrid grid is easy to construct, maintain, and manage because there is no need to install SWs for every node. As a real experiment, we constructed an orthologous database for 89 prokaryotes just in a week under hybrid grid; note that it requires 33 weeks on a single computer.
Concurrency and Computation: Practice and Experience, 2009
Over the past few years, research and development in bioinformatics (e.g. genomic sequence alignment) has grown with each passing day fueling continuing demands for vast computing power to support better performance. This trend usually requires solutions involving parallel computing techniques because cluster computing technology reduces execution times and increases genomic sequence alignment efficiency. One example, mpiBLAST is a parallel version of NCBI BLAST that combines NCBI BLAST with message passing interface (MPI) standards. However, as most laboratories cannot build up powerful cluster computing environments, Grid computing framework concepts have been designed to meet the need. Grid computing environments coordinate the resources of distributed virtual organizations and satisfy the various computational demands of bioinformatics applications. In this paper, we report on designing and implementing a BioGrid framework, called G-BLAST, that performs genomic sequence alignments using Grid computing environments and accessible mpiBLAST applications. G-BLAST is also suitable for cluster computing environments with a server node and several client nodes. G-BLAST is able to select the most appropriate work nodes, dynamically fragment genomic databases, and self-adjust according to performance data. To enhance G-BLAST capability and usability, we also employ a WSRF Grid Service Portal and a Grid Service GUI desk application for general users to submit jobs and host administrators to maintain work nodes.
2016
Abstract. Especially in the life-science and the health-care sectors the huge IT requirements are imminent due to the large and complex systems to be analysed and simulated. Grid infrastructures play here a rapidly increasing role for research, diagnostics, and treatment, since they provide the necessary large-scale resources efficiently. Whereas grids were first used for huge number crunching of trivially parallelizable problems, increasingly parallel high-performance computing is required. Here, we show for the prime example of molecular dynamic simulations how the presence of large grid clusters including very fast network interconnects within grid infrastructures allows now parallel high-performance grid computing efficiently and thus combines the benefits of dedicated super-computing centres and grid infrastructures. The demands for this service class are the highest since the user group has very heterogeneous requirements: i) two to many thousands of CPUs, ii) different memory...
Briefings in Bioinformatics, 2001
This paper surveys the computational strategies followed to parallelise the most used software in the bioinformatics arena. The studied algorithms are computationally expensive and their computational patterns range from regular, such as database-searching applications, to very irregularly structured patterns (phylogenetic trees). Fine-and coarse-grained parallel strategies are discussed for these very diverse sets of applications. This overview outlines computational issues related to parallelism, physical machine models, parallel programming approaches and scheduling strategies for a broad range of computer architectures. In particular, it deals with shared, distributed and shared/distributed memory architectures.
Data intensive computing, cloud computing, and multicore computing are converging as frontiers to address massive data problems with hybrid programming models and/or runtimes including MapReduce, MPI, and parallel threading on multicore platforms. A major challenge is to utilize these technologies and large scale computing resources effectively to advance fundamental science discoveries such as those in Life Sciences. The recently developed next-generation sequencers have enabled large-scale genome sequencing in areas such as environmental sample sequencing leading to metagenomic studies of collections of genes. Metagenomic research is just one of the areas that present a significant computational challenge because of the amount and complexity of data to be processed. Hadoop (open source) (Apache Hadoop, 2009) to address problems in several areas, such as particle physics and biology. The latter often have the striking all pairs (or doubly data parallel) structure highlighted by Thain ). We discuss here, work on new algorithms in section 2, and new programming models in sections 3 and 4.
2011
textabstractThe amount of information is growing exponentially with ever-new technologies emerging and is believed to be always at the limit. In contrast, huge resources are obviously available, which are underused in the IT sector, similar as e.g. in the renewable energy sector. Genome research is one of the boosting areas, which needs an extreme amount of IT resources to analyse the sequential organization of genomes, i.e. the relations between distant base pairs and regions within sequences, and its connection to the three-dimensional organization of genomes, which is still a largely unresolved problem. The underusage of resources as those accessible by grid with its fast turnover rates is very astonishing considering the barriers for further development put forward by the inability to satisfy the need for such resources. The phenomenon is a typical example of the Inverse Tragedy of the Commons, i.e. resources are underexploited in contrast to the unsustainable and destructing ov...
Springer eBooks, 2005
The deployment of biomedical applications in a grid environment has started about three years ago in several European projects and national initiatives. These applications have demonstrated that the grid paradigm was relevant to the needs of the biomedical community. They have also highlighted that this community had very specific requirements on middleware and needed further structuring in large collaborations in order to participate to the deployment of grid infrastructures in the coming years. In this paper, we propose several areas where grid technology can today improve research and healthcare. A crucial issue is to maximize the cross fertilization among projects in the perspective of an environment where data of medical interest can be stored and made easily available to the different actors of healthcare, the physicians, the healthcare centres and administrations, and of course the citizens.
vecpar.fe.up.pt
Bioinformatics is an area which involves the execution of many computing-intensive applications. Due to the development of new complex techniques and the increasing size of databases, this area demands the execution of experiments which exceed the resources of most research groups. The use case presented in this work is one of the most representative applications in the Bioinformatics field: computing the alignment of genomic and proteomic samples with respect to annotated databases through BLAST, developed at the NCBI in USA. Performing the homology search of one sequence with BLAST, even with large databases such as GenBank, takes only a few minutes. However, processing millions of sequences, with a sequential approach, requires years of CPU computation. Nevertheless, the computing time of this massive parallel application can be intensively reduced by means of e-Science infrastructures, such as EGEE, EELA and some NGIs, splitting the work into thousands of loosely-coupled tasks. In order to improve the performance of these experiments, a key aspect has proven to be the development of sophisticated automatisms which predict the jobs' elapsed time. Thus, this article focuses on describing all the details regarding the characterization of these experiments to improve the performance results.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.