Skip to main content

Murray Patterson

Followers

8

Following

1

Public Views

Università di Milano-Bicocca

Marco Antoniotti

Università degli Studi di Milano-Bicocca

Daniele Ramazzotti

Università degli Studi di Milano-Bicocca

Pascale Walters

Guido Sanguinetti

Ylenia Giarratano

University of Edinburgh

Martine Robbeets

Max Planck Institute for the Science of Human History

Interests

Uploads

Papers by Murray Patterson

gpps: An ILP-based approach for inferring cancer progression with mutation losses from single cell data

MotivationIn recent years, the well-known Infinite Sites Assumption (ISA) has been a fundamental ... more MotivationIn recent years, the well-known Infinite Sites Assumption (ISA) has been a fundamental feature of computational methods devised for reconstructing tumor phylogenies and inferring cancer progression where mutations are accumulated through histories. However, some recent studies leveraging Single Cell Sequencing (SCS) techniques have shown evidence of mutation losses in several tumor samples [19], making the inference problem harder.ResultsWe present a new tool, gpps, that reconstructs a tumor phylogeny from single cell data, allowing each mutation to be lost at most a fixed number of times.AvailabilityThe General Parsimony Phylogeny from Single cell (gpps) tool is open source and available at https://github.com/AlgoLab/gppf.

Inferring Cancer Progression from Single-cell Sequencing while Allowing Mutation Losses

Motivation: In recent years, the well-known Infinite Sites Assumption (ISA) has been a fundamenta... more Motivation: In recent years, the well-known Infinite Sites Assumption (ISA) has been a fundamental feature of computational methods devised for reconstructing tumor phylogenies and inferring cancer progressions seen as an accumulation of mutations. However, recent studies (Kuipers et al., 2017) leveraging Single-cell Sequencing (SCS) techniques have shown evidence of the widespread recurrence and, especially, loss of mutations in several tumor samples. Still, established methods that can infer phylogenies with mutation losses are however lacking. Results: We present the SASC (Simulated Annealing Single-Cell inference) tool which is a new and robust approach based on simulated annealing for the inference of cancer progression from SCS data. More precisely, we introduce a simple extension of the model of evolution where mutations are only accumulated, by allowing also a limited amount of back mutations in the evolutionary history of the tumor: the Dollo-k model. We demonstrate that SA...

HapCHAT: Adaptive haplotype assembly for efficiently leveraging high coverage in long reads

Motivation: Haplotype assembly is the process of reconstructing the haplotypes of an individual f... more Motivation: Haplotype assembly is the process of reconstructing the haplotypes of an individual from sequencing reads. Computational methods for this problem have shown to achieve high accuracy on long reads, which are becoming cheaper to produce and more widely available. Larger amounts of data, usually originating from increased coverage, are highly beneficial for improving the quality of the detection of the genetic variations that are intrinsic to the diploid nature of the human genome. However, the high accuracy of such methods comes at a cost of computational resources. The increased error rates that affect all current long-read technologies require even higher coverage: making the analysis of such data the key computational task to be solved in order to improve the accuracy of the predictions made by haplotype assembly methods. Results: We propose a new computational approach for assembling haplotypes that is specifically designed to cope with a different error rate at each v...

PWHATSHAP: efficient haplotyping for future generation sequencing

BMC Bioinformatics, 2016

Background: Haplotype phasing is an important problem in the analysis of genomics information. Gi... more Background: Haplotype phasing is an important problem in the analysis of genomics information. Given a set of DNA fragments of an individual, it consists of determining which one of the possible alleles (alternative forms of a gene) each fragment comes from. Haplotype information is relevant to gene regulation, epigenetics, genome-wide association studies, evolutionary and population studies, and the study of mutations. Haplotyping is currently addressed as an optimisation problem aiming at solutions that minimise, for instance, error correction costs, where costs are a measure of the confidence in the accuracy of the information acquired from DNA sequencing. Solutions have typically an exponential computational complexity. WHATSHAP is a recent optimal approach which moves computational complexity from DNA fragment length to fragment overlap, i.e., coverage, and is hence of particular interest when considering sequencing technology's current trends that are producing longer fragments. Results: Given the potential relevance of efficient haplotyping in several analysis pipelines, we have designed and engineered PWHATSHAP, a parallel, high-performance version of WHATSHAP. PWHATSHAP is embedded in a toolkit developed in Python and supports genomics datasets in standard file formats. Building on WHATSHAP, PWHATSHAP exhibits the same complexity exploring a number of possible solutions which is exponential in the coverage of the dataset. The parallel implementation on multi-core architectures allows for a relevant reduction of the execution time for haplotyping, while the provided results enjoy the same high accuracy as that provided by WHATSHAP, which increases with coverage. Conclusions: Due to its structure and management of the large datasets, the parallelisation of WHATSHAP posed demanding technical challenges, which have been addressed exploiting a high-level parallel programming framework. The result, PWHATSHAP, is a freely available toolkit that improves the efficiency of the analysis of genomics information.

WhatsHap: fast and accurate read-based phasing

Read-based phasing allows to reconstruct the haplotype structure of a sample purely from sequenci... more Read-based phasing allows to reconstruct the haplotype structure of a sample purely from sequencing reads. While phasing is a required step for answering questions about population genetics, compound heterozygosity, and to aid in clinical decision making, there has been a lack of an accurate, usable and standards-based software. WhatsHap is a production-ready tool for highly accurate read-based phasing. It was designed from the beginning to leverage third-generation sequencing technologies, whose long reads can span many variants and are therefore ideal for phasing. WhatsHap works also well with second-generation data, is easy to use and will phase not only SNVs, but also indels and other variants. It is unique in its ability to combine read-based with genetic phasing, allowing to further improve accuracy if multiple related samples are provided.

The gapped consecutive-ones property

F1000posters, Nov 8, 2012

Hardness Results for the Gapped Consecutive-Ones Property

Eprint Arxiv 0912 0309, Dec 2, 2009

Motivated by problems of comparative genomics and paleogenomics, in [Chauve et al., 2009], the au... more Motivated by problems of comparative genomics and paleogenomics, in [Chauve et al., 2009], the authors introduced the Gapped Consecutive-Ones Property Problem (k,delta)-C1P: given a binary matrix M and two integers k and delta, can the columns of M be permuted such that each row contains at most k blocks of ones and no two consecutive blocks of ones are separated by a gap of more than delta zeros. The classical C1P problem, which is known to be polynomial is equivalent to the (1,0)-C1P problem. They showed that the (2,delta)-C1P Problem is NP-complete for all delta >= 2 and that the (3,1)-C1P problem is NP-complete. They also conjectured that the (k,delta)-C1P Problem is NP-complete for k >= 2, delta >= 1 and (k,delta) =/= (2,1). Here, we prove that this conjecture is true. The only remaining case is the (2,1)-C1P Problem, which could be polynomial-time solvable.

High-Performance Haplotype Assembly

Lecture Notes in Computer Science, 2015

The problem of Haplotype Assembly is an essential step in human genome analysis. It is typically ... more The problem of Haplotype Assembly is an essential step in human genome analysis. It is typically formalised as the Minimum Error Correction (MEC) problem which is NP-hard. MEC has been approached using heuristics, integer linear programming, and fixedparameter tractability (FPT), including approaches whose runtime is exponential in the length of the DNA fragments obtained by the sequencing process. Technological improvements are currently increasing fragment length, which drastically elevates computational costs for such methods. We present pWhatsHap, a multi-core parallelisation of WhatsHap, a recent FPT optimal approach to MEC. WhatsHap moves complexity from fragment length to fragment overlap and is hence of particular interest when considering sequencing technology's current trends. pWhat-sHap further improves the efficiency in solving the MEC problem, as shown by experiments performed on datasets with high coverage.

Hypergraph Covering Problems Motivated by Genome Assembly Questions

Lecture Notes in Computer Science, 2013

The Consecutive-Ones Property (C1P) is a classical concept in discrete mathematics that has been ... more The Consecutive-Ones Property (C1P) is a classical concept in discrete mathematics that has been used in several genomics applications, from physical mapping of contemporary genomes to the assembly of ancient genomes. A common issue in genome assembly concerns repeats, genomic sequences that appear in several locations of a genome. Handling repeats leads to a variant of the C1P, the C1P with multiplicity (mC1P), that can also be seen as the problem of covering edges of hypergraphs by linear and circular walks. In the present work, we describe variants of the mC1P that address specific issues of genome assembly, and polynomial time or fixed-parameter algorithms to solve them.

WhatsHap: Weighted Haplotype Assembly for Future-Generation Sequencing Reads

Journal of computational biology : a journal of computational molecular cell biology, Jan 6, 2015

The human genome is diploid, which requires assigning heterozygous single nucleotide polymorphism... more The human genome is diploid, which requires assigning heterozygous single nucleotide polymorphisms (SNPs) to the two copies of the genome. The resulting haplotypes, lists of SNPs belonging to each copy, are crucial for downstream analyses in population genetics. Currently, statistical approaches, which are oblivious to direct read information, constitute the state-of-the-art. Haplotype assembly, which addresses phasing directly from sequencing reads, suffers from the fact that sequencing reads of the current generation are too short to serve the purposes of genome-wide phasing. While future-technology sequencing reads will contain sufficient amounts of SNPs per read for phasing, they are also likely to suffer from higher sequencing error rates. Currently, no haplotype assembly approaches exist that allow for taking both increasing read length and sequencing error information into account. Here, we suggest WhatsHap, the first approach that yields provably optimal solutions to the wei...

Linearization of ancestral multichromosomal genomes

BMC bioinformatics, 2012

Recovering the structure of ancestral genomes can be formalized in terms of properties of binary ... more Recovering the structure of ancestral genomes can be formalized in terms of properties of binary matrices such as the Consecutive-Ones Property (C1P). The Linearization Problem asks to extract, from a given binary matrix, a maximum weight subset of rows that satisfies such a property. This problem is in general intractable, and in particular if the ancestral genome is expected to contain only linear chromosomes or a unique circular chromosome. In the present work, we consider a relaxation of this problem, which allows ancestral genomes that can contain several chromosomes, each either linear or circular. We show that, when restricted to binary matrices of degree two, which correspond to adjacencies, the genomic characters used in most ancestral genome reconstruction methods, this relaxed version of the Linearization Problem is polynomially solvable using a reduction to a matching problem. This result holds in the more general case where columns have bounded multiplicity, which model...

Towards a Characterisation of the Generalised Cladistic Character Compatibility Problem for Non-branching Character Trees

Lecture Notes in Computer Science, 2011

In [3,2], the authors introduced the Generalised Cladistic Character Compatibility (GCCC) Problem... more In [3,2], the authors introduced the Generalised Cladistic Character Compatibility (GCCC) Problem which generalises a variant of the Perfect Phylogeny Problem in order to model better experiments in molecular biology showing that genes contain information for currently unexpressed traits, e.g., having teeth. In [3], the authors show that this problem is NP-complete and give some special cases which are polynomial. The authors also pose an open case of this problem where each character has only one generalised state, and each character tree is non-branching, a case that models these experiments particularly closely, which we call the Benham-Kannan-Warnow (BKW) Case. In [18], the authors study the complexity of a set of cases of the GCCC Problem for non-branching character trees when the phylogeny tree that is a solution to this compatibility problem is restricted to be either a tree, path or single-branch tree. In particular, they show that if the phylogeny tree must have only one branch, the BKW Case is polynomial-time solvable, by giving a novel algorithm based on PQ-trees used for the consecutive-ones property of binary matrices. In this work, we characterise the complexity of the remainder of the cases considered in [18] for the single-branch tree and the path. We show that some of the open cases are polynomial-time solvable, one by using an algorithm based on directed paths in the character trees similar to the algorithm in [2], and the second by showing that this case can be reduced to a polynomial-time solvable case of [18]. On the other hand, we will show that other open cases are NP-complete using an interesting variation of the ordering problems we study here. In particular, we show that the BKW Case for the path is NP-complete.

On the Generalised Character Compatibility Problem for Non-branching Character Trees

Lecture Notes in Computer Science, 2009

In [3], the authors introduced the Generalised Character Compatibility Problem as a generalisatio... more In [3], the authors introduced the Generalised Character Compatibility Problem as a generalisation of the Perfect Phylogeny Problem for a set of species. This generalised problem takes into account the fact that while a species may not be expressing a certain trait, i.e., having teeth, its DNA may contain data for this trait in a non-functional region. The authors showed that the Generalised Character Compatibility Problem is NP-complete for an instance of the problem involving five states, where the characters' state transition trees are branching. They also presented a class of instances of the problem that is polynomial-time solvable. The authors posed an open problem about the complexity of this problem when no branching is allowed in the character trees. They answered this question in [2], where they showed that for an instance in which each character tree is 0 → 1 → 2 (no branching), and only the states {1}, {0, 2}, {0, 1, 2} are allowed, is NP-complete. This, however, does not provide an answer to the exact question posed in [3], which allows only one type of generalised state: {0, 2}, called here the Benham-Kannan-Warnow (BKW) Case. In this paper, we study the complexity of various versions of this problem with non-branching character trees, depending on the set of states allowed, and depending on the restriction on the phylogeny tree: any tree, path or single-branch tree. In particular, we show that if the phylogeny tree is required to have only one branch: (a) the problem still remains NP-complete (for instance with states {1}, {0, 2}, {0, 1, 2}), and (b) the problem is polynomial-time solvable in the BKW Case (with states {0}, {1}, {2}, {0, 2}). We show the second result by unveiling a surprising connection to the Consecutive-Ones Property (C1P) Problem, used for instance, in DNA physical mapping, interval graph recognition and data retrieval.

Variants of the Consecutive-Ones Property motivated by the reconstruction of ancestral species

Tractability Results for the Consecutive-Ones Property with Multiplicity

Lecture Notes in Computer Science, 2011

A binary matrix has the Consecutive-Ones Property (C1P) if its columns can be ordered in such a w... more A binary matrix has the Consecutive-Ones Property (C1P) if its columns can be ordered in such a way that all 1's in each row are consecutive. We consider here a variant of the C1P where columns can appear multiple times in the ordering. Although the general problem of deciding the C1P with multiplicity is NP-complete, we present here a case of interest in comparative genomics that is tractable.

WhatsHap: Haplotype Assembly for Future-Generation Sequencing Reads

Research in Computational Molecular Biology, 2014

Complexity of Finding Non-Planar Rectilinear Drawings of Graphs

Lecture Notes in Computer Science, 2011

We study the complexity of the problem of finding nonplanar rectilinear drawings of graphs. This ... more We study the complexity of the problem of finding nonplanar rectilinear drawings of graphs. This problem is known to be NPcomplete. We consider natural restrictions of this problem where constraints are placed on the possible orientations of edges. In particular, we show that if each edge has prescribed direction "left", "right", "down" or "up", the problem of finding a rectilinear drawing is polynomial, while finding such a drawing with the minimum area is NP-complete. When assigned directions are "horizontal" or "vertical" or a cyclic order of the edges at each vertex is specified, the problem is NP-complete. We show that these two NP-complete cases are fixed parameter tractable in the number of vertices of degree 3 or 4.

On the Gapped Consecutive-Ones Property

Electronic Notes in Discrete Mathematics, 2009

Motivated by problems of comparative genomics and paleogenomics, we introduce the Gapped Consecut... more Motivated by problems of comparative genomics and paleogenomics, we introduce the Gapped Consecutive-Ones Property Problem (k,δ)-C1P: given a binary matrix M and two integers k and δ, can the columns of M be permuted such that each row contains at most k sequences of 1's and no two consecutive sequences of 1's are separated by a gap of more than δ 0's. The classical C1P problem, which is known to be polynomial, is equivalent to the (1,0)-C1P Problem. We show that the (2,δ)-C1P Problem is NP-complete for δ ≥ 2. We conjecture that the (k, δ)-C1P Problem is NPcomplete for k ≥ 2, δ ≥ 1, (k, δ) = (2, 1). We also show that the (k,δ)-C1P problem can be reduced to a graph bandwidth problem parameterized by a function of k, δ and of the maximum number s of 1's in a row of M , and hence is polytime solvable if all three parameters are constant.

Hardness results on the gapped consecutive-ones property problem

Discrete Applied Mathematics, 2012

Motivated by problems of comparative genomics and paleogenomics, in [6] the authors introduced th... more Motivated by problems of comparative genomics and paleogenomics, in [6] the authors introduced the Gapped Consecutive-Ones Property Problem (k, δ)-C1P: given a binary matrix M and two integers k and δ, can the columns of M be permuted such that each row contains at most k blocks of ones and no two consecutive blocks of ones are separated by a gap of more than δ zeros. The classical C1P problem, which is known to be polynomial is equivalent to the (1, 0)-C1P problem. They showed that the (2, δ)-C1P Problem is NP-complete for all δ ≥ 2 and that the (3, 1)-C1P problem is NP-complete. They also conjectured that the (k, δ)-C1P Problem is NP-complete for k ≥ 2, δ ≥ 1 and (k, δ) = (2, 1). Here, we prove that this conjecture is true. The only remaining case is the (2, 1)-C1P Problem, which could be polynomial-time solvable.

Mapping proteins in the presence of paralogs using units of coevolution

BMC Bioinformatics, 2013

Background: We study the problem of mapping proteins between two protein families in the presence... more Background: We study the problem of mapping proteins between two protein families in the presence of paralogs. This problem occurs as a difficult subproblem in coevolution-based computational approaches for protein-protein interaction prediction. Results: Similar to prior approaches, our method is based on the idea that coevolution implies equal rates of sequence evolution among the interacting proteins, and we provide a first attempt to quantify this notion in a formal statistical manner. We call the units that are central to this quantification scheme the units of coevolution. A unit consists of two mapped protein pairs and its score quantifies the coevolution of the pairs. This quantification allows us to provide a maximum likelihood formulation of the paralog mapping problem and to cast it into a binary quadratic programming formulation. Conclusion: CUPID, our software tool based on a Lagrangian relaxation of this formulation, makes it, for the first time, possible to compute state-of-the-art quality pairings in a few minutes of runtime. In summary, we suggest a novel alternative to the earlier available approaches, which is statistically sound and computationally feasible.

gpps: An ILP-based approach for inferring cancer progression with mutation losses from single cell data

MotivationIn recent years, the well-known Infinite Sites Assumption (ISA) has been a fundamental ... more MotivationIn recent years, the well-known Infinite Sites Assumption (ISA) has been a fundamental feature of computational methods devised for reconstructing tumor phylogenies and inferring cancer progression where mutations are accumulated through histories. However, some recent studies leveraging Single Cell Sequencing (SCS) techniques have shown evidence of mutation losses in several tumor samples [19], making the inference problem harder.ResultsWe present a new tool, gpps, that reconstructs a tumor phylogeny from single cell data, allowing each mutation to be lost at most a fixed number of times.AvailabilityThe General Parsimony Phylogeny from Single cell (gpps) tool is open source and available at https://github.com/AlgoLab/gppf.

Inferring Cancer Progression from Single-cell Sequencing while Allowing Mutation Losses

Motivation: In recent years, the well-known Infinite Sites Assumption (ISA) has been a fundamenta... more Motivation: In recent years, the well-known Infinite Sites Assumption (ISA) has been a fundamental feature of computational methods devised for reconstructing tumor phylogenies and inferring cancer progressions seen as an accumulation of mutations. However, recent studies (Kuipers et al., 2017) leveraging Single-cell Sequencing (SCS) techniques have shown evidence of the widespread recurrence and, especially, loss of mutations in several tumor samples. Still, established methods that can infer phylogenies with mutation losses are however lacking. Results: We present the SASC (Simulated Annealing Single-Cell inference) tool which is a new and robust approach based on simulated annealing for the inference of cancer progression from SCS data. More precisely, we introduce a simple extension of the model of evolution where mutations are only accumulated, by allowing also a limited amount of back mutations in the evolutionary history of the tumor: the Dollo-k model. We demonstrate that SA...

HapCHAT: Adaptive haplotype assembly for efficiently leveraging high coverage in long reads

Motivation: Haplotype assembly is the process of reconstructing the haplotypes of an individual f... more Motivation: Haplotype assembly is the process of reconstructing the haplotypes of an individual from sequencing reads. Computational methods for this problem have shown to achieve high accuracy on long reads, which are becoming cheaper to produce and more widely available. Larger amounts of data, usually originating from increased coverage, are highly beneficial for improving the quality of the detection of the genetic variations that are intrinsic to the diploid nature of the human genome. However, the high accuracy of such methods comes at a cost of computational resources. The increased error rates that affect all current long-read technologies require even higher coverage: making the analysis of such data the key computational task to be solved in order to improve the accuracy of the predictions made by haplotype assembly methods. Results: We propose a new computational approach for assembling haplotypes that is specifically designed to cope with a different error rate at each v...

PWHATSHAP: efficient haplotyping for future generation sequencing

BMC Bioinformatics, 2016

Background: Haplotype phasing is an important problem in the analysis of genomics information. Gi... more Background: Haplotype phasing is an important problem in the analysis of genomics information. Given a set of DNA fragments of an individual, it consists of determining which one of the possible alleles (alternative forms of a gene) each fragment comes from. Haplotype information is relevant to gene regulation, epigenetics, genome-wide association studies, evolutionary and population studies, and the study of mutations. Haplotyping is currently addressed as an optimisation problem aiming at solutions that minimise, for instance, error correction costs, where costs are a measure of the confidence in the accuracy of the information acquired from DNA sequencing. Solutions have typically an exponential computational complexity. WHATSHAP is a recent optimal approach which moves computational complexity from DNA fragment length to fragment overlap, i.e., coverage, and is hence of particular interest when considering sequencing technology's current trends that are producing longer fragments. Results: Given the potential relevance of efficient haplotyping in several analysis pipelines, we have designed and engineered PWHATSHAP, a parallel, high-performance version of WHATSHAP. PWHATSHAP is embedded in a toolkit developed in Python and supports genomics datasets in standard file formats. Building on WHATSHAP, PWHATSHAP exhibits the same complexity exploring a number of possible solutions which is exponential in the coverage of the dataset. The parallel implementation on multi-core architectures allows for a relevant reduction of the execution time for haplotyping, while the provided results enjoy the same high accuracy as that provided by WHATSHAP, which increases with coverage. Conclusions: Due to its structure and management of the large datasets, the parallelisation of WHATSHAP posed demanding technical challenges, which have been addressed exploiting a high-level parallel programming framework. The result, PWHATSHAP, is a freely available toolkit that improves the efficiency of the analysis of genomics information.

WhatsHap: fast and accurate read-based phasing

Read-based phasing allows to reconstruct the haplotype structure of a sample purely from sequenci... more Read-based phasing allows to reconstruct the haplotype structure of a sample purely from sequencing reads. While phasing is a required step for answering questions about population genetics, compound heterozygosity, and to aid in clinical decision making, there has been a lack of an accurate, usable and standards-based software. WhatsHap is a production-ready tool for highly accurate read-based phasing. It was designed from the beginning to leverage third-generation sequencing technologies, whose long reads can span many variants and are therefore ideal for phasing. WhatsHap works also well with second-generation data, is easy to use and will phase not only SNVs, but also indels and other variants. It is unique in its ability to combine read-based with genetic phasing, allowing to further improve accuracy if multiple related samples are provided.

The gapped consecutive-ones property

F1000posters, Nov 8, 2012

Hardness Results for the Gapped Consecutive-Ones Property

Eprint Arxiv 0912 0309, Dec 2, 2009

Motivated by problems of comparative genomics and paleogenomics, in [Chauve et al., 2009], the au... more Motivated by problems of comparative genomics and paleogenomics, in [Chauve et al., 2009], the authors introduced the Gapped Consecutive-Ones Property Problem (k,delta)-C1P: given a binary matrix M and two integers k and delta, can the columns of M be permuted such that each row contains at most k blocks of ones and no two consecutive blocks of ones are separated by a gap of more than delta zeros. The classical C1P problem, which is known to be polynomial is equivalent to the (1,0)-C1P problem. They showed that the (2,delta)-C1P Problem is NP-complete for all delta >= 2 and that the (3,1)-C1P problem is NP-complete. They also conjectured that the (k,delta)-C1P Problem is NP-complete for k >= 2, delta >= 1 and (k,delta) =/= (2,1). Here, we prove that this conjecture is true. The only remaining case is the (2,1)-C1P Problem, which could be polynomial-time solvable.

High-Performance Haplotype Assembly

Lecture Notes in Computer Science, 2015

The problem of Haplotype Assembly is an essential step in human genome analysis. It is typically ... more The problem of Haplotype Assembly is an essential step in human genome analysis. It is typically formalised as the Minimum Error Correction (MEC) problem which is NP-hard. MEC has been approached using heuristics, integer linear programming, and fixedparameter tractability (FPT), including approaches whose runtime is exponential in the length of the DNA fragments obtained by the sequencing process. Technological improvements are currently increasing fragment length, which drastically elevates computational costs for such methods. We present pWhatsHap, a multi-core parallelisation of WhatsHap, a recent FPT optimal approach to MEC. WhatsHap moves complexity from fragment length to fragment overlap and is hence of particular interest when considering sequencing technology's current trends. pWhat-sHap further improves the efficiency in solving the MEC problem, as shown by experiments performed on datasets with high coverage.

Hypergraph Covering Problems Motivated by Genome Assembly Questions

Lecture Notes in Computer Science, 2013

The Consecutive-Ones Property (C1P) is a classical concept in discrete mathematics that has been ... more The Consecutive-Ones Property (C1P) is a classical concept in discrete mathematics that has been used in several genomics applications, from physical mapping of contemporary genomes to the assembly of ancient genomes. A common issue in genome assembly concerns repeats, genomic sequences that appear in several locations of a genome. Handling repeats leads to a variant of the C1P, the C1P with multiplicity (mC1P), that can also be seen as the problem of covering edges of hypergraphs by linear and circular walks. In the present work, we describe variants of the mC1P that address specific issues of genome assembly, and polynomial time or fixed-parameter algorithms to solve them.

WhatsHap: Weighted Haplotype Assembly for Future-Generation Sequencing Reads

Journal of computational biology : a journal of computational molecular cell biology, Jan 6, 2015

The human genome is diploid, which requires assigning heterozygous single nucleotide polymorphism... more The human genome is diploid, which requires assigning heterozygous single nucleotide polymorphisms (SNPs) to the two copies of the genome. The resulting haplotypes, lists of SNPs belonging to each copy, are crucial for downstream analyses in population genetics. Currently, statistical approaches, which are oblivious to direct read information, constitute the state-of-the-art. Haplotype assembly, which addresses phasing directly from sequencing reads, suffers from the fact that sequencing reads of the current generation are too short to serve the purposes of genome-wide phasing. While future-technology sequencing reads will contain sufficient amounts of SNPs per read for phasing, they are also likely to suffer from higher sequencing error rates. Currently, no haplotype assembly approaches exist that allow for taking both increasing read length and sequencing error information into account. Here, we suggest WhatsHap, the first approach that yields provably optimal solutions to the wei...

Linearization of ancestral multichromosomal genomes

BMC bioinformatics, 2012

Recovering the structure of ancestral genomes can be formalized in terms of properties of binary ... more Recovering the structure of ancestral genomes can be formalized in terms of properties of binary matrices such as the Consecutive-Ones Property (C1P). The Linearization Problem asks to extract, from a given binary matrix, a maximum weight subset of rows that satisfies such a property. This problem is in general intractable, and in particular if the ancestral genome is expected to contain only linear chromosomes or a unique circular chromosome. In the present work, we consider a relaxation of this problem, which allows ancestral genomes that can contain several chromosomes, each either linear or circular. We show that, when restricted to binary matrices of degree two, which correspond to adjacencies, the genomic characters used in most ancestral genome reconstruction methods, this relaxed version of the Linearization Problem is polynomially solvable using a reduction to a matching problem. This result holds in the more general case where columns have bounded multiplicity, which model...

Towards a Characterisation of the Generalised Cladistic Character Compatibility Problem for Non-branching Character Trees

Lecture Notes in Computer Science, 2011

In [3,2], the authors introduced the Generalised Cladistic Character Compatibility (GCCC) Problem... more In [3,2], the authors introduced the Generalised Cladistic Character Compatibility (GCCC) Problem which generalises a variant of the Perfect Phylogeny Problem in order to model better experiments in molecular biology showing that genes contain information for currently unexpressed traits, e.g., having teeth. In [3], the authors show that this problem is NP-complete and give some special cases which are polynomial. The authors also pose an open case of this problem where each character has only one generalised state, and each character tree is non-branching, a case that models these experiments particularly closely, which we call the Benham-Kannan-Warnow (BKW) Case. In [18], the authors study the complexity of a set of cases of the GCCC Problem for non-branching character trees when the phylogeny tree that is a solution to this compatibility problem is restricted to be either a tree, path or single-branch tree. In particular, they show that if the phylogeny tree must have only one branch, the BKW Case is polynomial-time solvable, by giving a novel algorithm based on PQ-trees used for the consecutive-ones property of binary matrices. In this work, we characterise the complexity of the remainder of the cases considered in [18] for the single-branch tree and the path. We show that some of the open cases are polynomial-time solvable, one by using an algorithm based on directed paths in the character trees similar to the algorithm in [2], and the second by showing that this case can be reduced to a polynomial-time solvable case of [18]. On the other hand, we will show that other open cases are NP-complete using an interesting variation of the ordering problems we study here. In particular, we show that the BKW Case for the path is NP-complete.

On the Generalised Character Compatibility Problem for Non-branching Character Trees

Lecture Notes in Computer Science, 2009

In [3], the authors introduced the Generalised Character Compatibility Problem as a generalisatio... more In [3], the authors introduced the Generalised Character Compatibility Problem as a generalisation of the Perfect Phylogeny Problem for a set of species. This generalised problem takes into account the fact that while a species may not be expressing a certain trait, i.e., having teeth, its DNA may contain data for this trait in a non-functional region. The authors showed that the Generalised Character Compatibility Problem is NP-complete for an instance of the problem involving five states, where the characters' state transition trees are branching. They also presented a class of instances of the problem that is polynomial-time solvable. The authors posed an open problem about the complexity of this problem when no branching is allowed in the character trees. They answered this question in [2], where they showed that for an instance in which each character tree is 0 → 1 → 2 (no branching), and only the states {1}, {0, 2}, {0, 1, 2} are allowed, is NP-complete. This, however, does not provide an answer to the exact question posed in [3], which allows only one type of generalised state: {0, 2}, called here the Benham-Kannan-Warnow (BKW) Case. In this paper, we study the complexity of various versions of this problem with non-branching character trees, depending on the set of states allowed, and depending on the restriction on the phylogeny tree: any tree, path or single-branch tree. In particular, we show that if the phylogeny tree is required to have only one branch: (a) the problem still remains NP-complete (for instance with states {1}, {0, 2}, {0, 1, 2}), and (b) the problem is polynomial-time solvable in the BKW Case (with states {0}, {1}, {2}, {0, 2}). We show the second result by unveiling a surprising connection to the Consecutive-Ones Property (C1P) Problem, used for instance, in DNA physical mapping, interval graph recognition and data retrieval.

Variants of the Consecutive-Ones Property motivated by the reconstruction of ancestral species

Tractability Results for the Consecutive-Ones Property with Multiplicity

Lecture Notes in Computer Science, 2011

A binary matrix has the Consecutive-Ones Property (C1P) if its columns can be ordered in such a w... more A binary matrix has the Consecutive-Ones Property (C1P) if its columns can be ordered in such a way that all 1's in each row are consecutive. We consider here a variant of the C1P where columns can appear multiple times in the ordering. Although the general problem of deciding the C1P with multiplicity is NP-complete, we present here a case of interest in comparative genomics that is tractable.

WhatsHap: Haplotype Assembly for Future-Generation Sequencing Reads

Research in Computational Molecular Biology, 2014

Complexity of Finding Non-Planar Rectilinear Drawings of Graphs

Lecture Notes in Computer Science, 2011

We study the complexity of the problem of finding nonplanar rectilinear drawings of graphs. This ... more We study the complexity of the problem of finding nonplanar rectilinear drawings of graphs. This problem is known to be NPcomplete. We consider natural restrictions of this problem where constraints are placed on the possible orientations of edges. In particular, we show that if each edge has prescribed direction "left", "right", "down" or "up", the problem of finding a rectilinear drawing is polynomial, while finding such a drawing with the minimum area is NP-complete. When assigned directions are "horizontal" or "vertical" or a cyclic order of the edges at each vertex is specified, the problem is NP-complete. We show that these two NP-complete cases are fixed parameter tractable in the number of vertices of degree 3 or 4.

On the Gapped Consecutive-Ones Property

Electronic Notes in Discrete Mathematics, 2009

Motivated by problems of comparative genomics and paleogenomics, we introduce the Gapped Consecut... more Motivated by problems of comparative genomics and paleogenomics, we introduce the Gapped Consecutive-Ones Property Problem (k,δ)-C1P: given a binary matrix M and two integers k and δ, can the columns of M be permuted such that each row contains at most k sequences of 1's and no two consecutive sequences of 1's are separated by a gap of more than δ 0's. The classical C1P problem, which is known to be polynomial, is equivalent to the (1,0)-C1P Problem. We show that the (2,δ)-C1P Problem is NP-complete for δ ≥ 2. We conjecture that the (k, δ)-C1P Problem is NPcomplete for k ≥ 2, δ ≥ 1, (k, δ) = (2, 1). We also show that the (k,δ)-C1P problem can be reduced to a graph bandwidth problem parameterized by a function of k, δ and of the maximum number s of 1's in a row of M , and hence is polytime solvable if all three parameters are constant.

Hardness results on the gapped consecutive-ones property problem

Discrete Applied Mathematics, 2012

Motivated by problems of comparative genomics and paleogenomics, in [6] the authors introduced th... more Motivated by problems of comparative genomics and paleogenomics, in [6] the authors introduced the Gapped Consecutive-Ones Property Problem (k, δ)-C1P: given a binary matrix M and two integers k and δ, can the columns of M be permuted such that each row contains at most k blocks of ones and no two consecutive blocks of ones are separated by a gap of more than δ zeros. The classical C1P problem, which is known to be polynomial is equivalent to the (1, 0)-C1P problem. They showed that the (2, δ)-C1P Problem is NP-complete for all δ ≥ 2 and that the (3, 1)-C1P problem is NP-complete. They also conjectured that the (k, δ)-C1P Problem is NP-complete for k ≥ 2, δ ≥ 1 and (k, δ) = (2, 1). Here, we prove that this conjecture is true. The only remaining case is the (2, 1)-C1P Problem, which could be polynomial-time solvable.

Mapping proteins in the presence of paralogs using units of coevolution

BMC Bioinformatics, 2013

Background: We study the problem of mapping proteins between two protein families in the presence... more Background: We study the problem of mapping proteins between two protein families in the presence of paralogs. This problem occurs as a difficult subproblem in coevolution-based computational approaches for protein-protein interaction prediction. Results: Similar to prior approaches, our method is based on the idea that coevolution implies equal rates of sequence evolution among the interacting proteins, and we provide a first attempt to quantify this notion in a formal statistical manner. We call the units that are central to this quantification scheme the units of coevolution. A unit consists of two mapped protein pairs and its score quantifies the coevolution of the pairs. This quantification allows us to provide a maximum likelihood formulation of the paralog mapping problem and to cast it into a binary quadratic programming formulation. Conclusion: CUPID, our software tool based on a Lagrangian relaxation of this formulation, makes it, for the first time, possible to compute state-of-the-art quality pairings in a few minutes of runtime. In summary, we suggest a novel alternative to the earlier available approaches, which is statistically sound and computationally feasible.