Differences in transcriptional regulatory networks underlie much of the phenotypic variation obse... more Differences in transcriptional regulatory networks underlie much of the phenotypic variation observed across organisms. Changes to cis-regulatory elements are widely believed to be the predominant means by which regulatory networks evolve, yet examples of regulatory network divergence due to transcription factor (TF) variation have also been observed. To systematically ascertain the extent to which TFs contribute to regulatory divergence, we analyzed the evolution of the largest class of metazoan TFs, Cys2-His2 zinc finger (C2H2-ZF) TFs, across 12 Drosophila species spanning~45 million years of evolution. Remarkably, we uncovered that a significant fraction of all C2H2-ZF 1-to-1 orthologs in flies exhibit variations that can affect their DNA-binding specificities. In addition to loss and recruitment of C2H2-ZF domains, we found diverging DNA-contacting residues in~44% of domains shared between D. melanogaster and the other fly species. These diverging DNA-contacting residues, found in~70% of the D. melanogaster C2H2-ZF genes in our analysis and corresponding to~26% of all annotated D. melanogaster TFs, show evidence of functional constraint: they tend to be conserved across phylogenetic clades and evolve slower than other diverging residues. These same variations were rarely found as polymorphisms within a population of D. melanogaster flies, indicating their rapid fixation. The predicted specificities of these dynamic domains gradually change across phylogenetic distances, suggesting stepwise evolutionary trajectories for TF divergence. Further, whereas proteins with conserved C2H2-ZF domains are enriched in developmental functions, those with varying domains exhibit no functional enrichments. Our work suggests that a subset of highly dynamic and largely unstudied TFs are a likely source of regulatory variation in Drosophila and other metazoans.
Questions about the structure/microstructure versus properties relationships in strongly correlat... more Questions about the structure/microstructure versus properties relationships in strongly correlated electronic oxides will be illustrated by two examples. First the Mn/Ni ordering in the La 2 NiMnO 6 compounds prepared in the form of thin films and secondly the La/Ba ordering on bulk 112-type cobalt oxides LaBaCo 2 O 6 will be discussed in relation with their magnetic and electric properties. Using transmission electron microscopy as a main tool, several questions are addressed regarding the need but also the difficulty to establish the structure, microstructure and nanostructure at all relevant temperature as a prerequisite for further analysis of the physical properties in these compounds.
Identifying a protein's functional sites is an important step towards characterizing its mole... more Identifying a protein's functional sites is an important step towards characterizing its molecular function. Numerous structure- and sequence-based methods have been developed for this problem. Here we introduce ConCavity, a small molecule binding site prediction algorithm that integrates evolutionary sequence conservation estimates with structure-based methods for identifying protein surface cavities. In large-scale testing on a diverse set of single- and multi-chain protein structures, we show that ConCavity substantially outperforms existing methods for identifying both 3D ligand binding pockets and individual ligand binding residues. As part of our testing, we perform one of the first direct comparisons of conservation-based and structure-based methods. We find that the two approaches provide largely complementary information, which can be combined to improve upon either approach alone. We also demonstrate that ConCavity has state-of-the-art performance in predicting catalyt...
We report the structural and electrical properties of erbium oxide films grown on Si(100) in the ... more We report the structural and electrical properties of erbium oxide films grown on Si(100) in the temperature range 450-600 °C by low-pressure metalorganic chemical vapour deposition (MOCVD) using Er(acac)3.phen, the phenanthroline adduct of erbium acetylacetonate, as precursor. Film properties are correlated with growth and processing conditions. Structural characterization reveals that films grown at lower temperatures are smooth, but poorly crystalline,
High-throughput experimental and computational approaches to characterize proteins and their inte... more High-throughput experimental and computational approaches to characterize proteins and their interactions have resulted in large-scale biological networks for many organisms, from bacteria to yeast to human. These complex networks are comprised of a number of distinct ...
Abstract. In the motif finding problem one seeks a set of mutually similar subsequences within a ... more Abstract. In the motif finding problem one seeks a set of mutually similar subsequences within a collection of biological sequences. This is an important and widely-studied problem, as such shared motifs in DNA often correspond to regulatory elements. We study a combinatorial ...
G-quadruplex DNA is a four-stranded DNA structure formed by non-Watson-Crick base pairing between... more G-quadruplex DNA is a four-stranded DNA structure formed by non-Watson-Crick base pairing between stacked sets of four guanines. Many possible functions have been proposed for this structure, but its in vivo role in the cell is still largely unresolved. We carried out a genome-wide survey of the evolutionary conservation of regions with the potential to form Gquadruplex DNA structures (G4 DNA motifs) across seven yeast species. We found that G4 DNA motifs were significantly more conserved than expected by chance, and the nucleotide-level conservation patterns suggested that the motif conservation was the result of the formation of G4 DNA structures. We characterized the association of conserved and nonconserved G4 DNA motifs in Saccharomyces cerevisiae with more than 40 known genome features and gene classes. Our comprehensive, integrated evolutionary and functional analysis confirmed the previously observed associations of G4 DNA motifs with promoter regions and the rDNA, and it identified several previously unrecognized associations of G4 DNA motifs with genomic features, such as mitotic and meiotic double-strand break sites (DSBs). Conserved G4 DNA motifs maintained strong associations with promoters and the rDNA, but not with DSBs. We also performed the first analysis of G4 DNA motifs in the mitochondria, and surprisingly found a tenfold higher concentration of the motifs in the AT-rich yeast mitochondrial DNA than in nuclear DNA. The evolutionary conservation of the G4 DNA motif and its association with specific genome features supports the hypothesis that G4 DNA has in vivo functions that are under evolutionary constraint. (VAZ) " JAC and KP are co-first authors on this work.
A study of growth, structure, and properties of Eu 2 O 3 thin films were carried out. Films were ... more A study of growth, structure, and properties of Eu 2 O 3 thin films were carried out. Films were grown at 500-600 • C temperature range on Si(1 0 0) and fused quartz from the complex of Eu(acac) 3 ·Phen by low pressure metalorganic chemical vapor deposition technique which has been rarely used for Eu 2 O 3 deposition. These films were polycrystalline. Depending on growth conditions and substrates employed, these films had also possessed a parasitic phase. This phase can be removed by post-deposition annealing in oxidizing ambient. Morphology of the films was characterized by well-packed spherical mounds. Optical measurements exhibited that the bandgap of pure Eu 2 O 3 phase was 4.4 eV. High frequency 1 MHz capacitance-voltage (C-V) measurements showed that the dielectric constant of pure Eu 2 O 3 film was about 12. Possible effects of cation and oxygen deficiency and parasitic phase on the optical and electrical properties of Eu 2 O 3 films have been briefly discussed.
In this paper, we report the structural and magnetic properties of electron-doped Ca 1−x Sm x MnO... more In this paper, we report the structural and magnetic properties of electron-doped Ca 1−x Sm x MnO 3 (CSM) nanoparticles. The samarium's composition "x" was varied from 0 to 0.2 with the special attention up to 0.05. Spherical 60-70 nm polycrystalline CSM nanoparticles were synthesised by chemical coprecipitation technique. Doping of Sm 3+ in antiferromagnetic CaMnO 3 has drastically altered its magnetic behavior due to the formation of ferromagnetic clusters. For example, the CSM powder with x = 0.04 displays about 115 K magnetic Curie temperature and about 0.1 emu/mole saturation magnetization. Physical properties of our nano-CSM powders are also compared with identical bulk-samples. To understand the differences, we invoked the intra-grain and inter-grain magnetic coupling process that facilitates to enhance their ferromagnetic behaviors. Unlike the bulk samples, such magnetic couplings in nanoparticles are favored by the presence of low-level crystal and interfacial defects.
The fluorescence behavior of molecular dyes at discrete distances from 1.5 nm diameter gold nanop... more The fluorescence behavior of molecular dyes at discrete distances from 1.5 nm diameter gold nanoparticles as a function of distance and energy is investigated. Photoluminescence and luminescence lifetime measurements both demonstrate quenching behavior consistent with 1/d 4 separation distance from dye to the surface of the nanoparticle. In agreement with the model of Persson and Lang, all experimental data show that energy transfer to the metal surface is the dominant quenching mechanism, and the radiative rate is unchanged throughout the experiment.
We report the structural and optical properties of oriented polycrystalline thin films of rare ea... more We report the structural and optical properties of oriented polycrystalline thin films of rare earth oxides (REO), namely Er2O3, Gd2O3, Eu2O3, and Yb2O3 grown on fused quartz by low-pressure metalorganic chemical vapour deposition (MOCVD) in the temperature ...
S ide chain positioning is an important subproblem of the general protein-structure-prediction pr... more S ide chain positioning is an important subproblem of the general protein-structure-prediction problem, with applications in homology modeling and protein design. The side chain positioning problem takes a fixed backbone and a protein sequence and predicts the lowest energy conformation of the protein's side chains on this backbone. We study a widely used version of the problem where the side chain positioning procedure uses a rotamer library and an energy function that can be expressed as a sum of pairwise terms. The problem is NP-complete; we show that it cannot even be approximated. In practice, it is tackled by a variety of general search techniques and specialized heuristics. Here, we propose formulating the side chain positioning problem as an instance of semidefinite programming (SDP). We introduce two novel rounding schemes and provide theoretical justification for their effectiveness under various conditions. We apply our method on simulated data, as well as on the computational redesign of two naturally occurring protein cores, and show that our SDP approach generally finds good solutions. Beyond the context of side chain positioning, our very general rounding schemes should be applicable elsewhere.
Sound velocities, viscosity and density of aqueous solution of PEG of average molecular weight of... more Sound velocities, viscosity and density of aqueous solution of PEG of average molecular weight of 4000 g/mole have been measured as a function of temperature in the range 308338 K at different frequencies. Isentropic compressibility, ultrasonic attenuation and acoustic ...
The chemical structures of biomolecules, whether naturally occurring or synthetic, are composed o... more The chemical structures of biomolecules, whether naturally occurring or synthetic, are composed of functionally important building blocks. Given a set of small molecules-for example, those known to bind a particular protein-computationally decomposing them into chemically meaningful fragments can help elucidate their functional properties, and may be useful for designing novel compounds with similar properties. Here we introduce molBLOCKS, a suite of programs for breaking down sets of small molecules into fragments according to a predefined set of chemical rules, clustering the resulting fragments, and uncovering statistically enriched fragments. Among other applications, our software should be a great aid in large-scale chemical analysis of ligands binding specific targets of interest. Availability and implementation: molBLOCKS is available as GPL Cþþ source code at
Motivation: Within a homologous protein family, proteins may be grouped into subtypes that share ... more Motivation: Within a homologous protein family, proteins may be grouped into subtypes that share specific functions that are not common to the entire family. Often, the amino acids present in a small number of sequence positions determine each protein's particular functional specificity. Knowledge of these specificity determining positions (SDPs) aids in protein function prediction, drug design, and experimental analysis. A number of sequence-based computational methods have been introduced for identifying SDPs; however, their further development and evaluation have been hindered by the limited number of known experimentally-determined SDPs. Results: We combine several bioinformatics resources to automate a process, typically undertaken manually, to build a data set of SDPs. The resulting large data set, which consists of SDPs in enzymes, enables us to characterize SDPs in terms of their physicochemical and evolutionary properties. It also facilitates the large-scale evaluation of sequence-based SDP prediction methods. We present a simple sequence-based SDP prediction method, GroupSim, and show that, surprisingly, it is competitive with a representative set of current methods. We also describe ConsWin, a heuristic that considers sequence conservation of neighboring amino acids, and demonstrate that it improves the performance of all methods tested on our large data set of enzyme SDPs. Availability: Data sets and GroupSim code are available online at
Motivation: An important step in unravelling the transcriptional regulatory network of an organis... more Motivation: An important step in unravelling the transcriptional regulatory network of an organism is to identify, for each transcription factor, all of its DNA binding sites. Several approaches are commonly used in searching for a transcription factor's binding sites, including consensus sequences and position-specific scoring matrices. In addition, methods that compute the average number of nucleotide matches between a putative site and all known sites can be employed. Such basic approaches can all be naturally extended by incorporating pairwise nucleotide dependencies and per-position information content. In this paper, we evaluate the effectiveness of these basic approaches and their extensions in finding binding sites for a transcription factor of interest without erroneously identifying other genomic sequences. Results: In cross-validation testing on a dataset of Escherichia coli transcription factors and their binding sites, we show that there are statistically significant differences in how well various methods identify transcription factor binding sites. The use of per-position information content improves the performance of all basic approaches. Furthermore, including local pairwise nucleotide dependencies within binding site models results in statistically significant performance improvements for approaches based on nucleotide matches. Based on our analysis, the best results when searching for DNA binding sites of a particular transcription factor are obtained by methods that incorporate both information content and local pairwise correlations. Availability: The software is available at
Side-chain positioning is a central component of homology modeling and protein design. In a commo... more Side-chain positioning is a central component of homology modeling and protein design. In a common formulation of the problem, the backbone is fixed, side-chain conformations come from a rotamer library, and a pairwise energy function is optimized. It is NP-complete to find even a reasonable approximate solution to this problem. We seek to put this hardness result into practical context. We present an integer linear programming (ILP) formulation of side-chain positioning that allows us to tackle large problem sizes. We relax the integrality constraint to give a polynomial-time linear programming (LP) heuristic. We apply LP to position side chains on native and homologous backbones and to choose side chains for protein design. Surprisingly, when positioning side chains on native and homologous backbones, optimal solutions using a simple, biologically relevant energy function can usually be found using LP. On the other hand, the design problem often cannot be solved using LP directly; however, optimal solutions for large instances can still be found using the computationally more expensive ILP procedure. While different energy functions also affect the difficulty of the problem, the LP/ILP approach is able to find optimal solutions. Our analysis is the first large-scale demonstration that LP-based approaches are highly effective in finding optimal (and successive near-optimal) solutions for the side-chain positioning problem.
Motivation: All residues in a protein are not equally important. Some are essential for the prope... more Motivation: All residues in a protein are not equally important. Some are essential for the proper structure and function of the protein, whereas others can be readily replaced. Conservation analysis is one of the most widely used methods for predicting these functionally important residues in protein sequences. Results: We introduce an information-theoretic approach for estimating sequence conservation based on Jensen-Shannon divergence. We also develop a general heuristic that considers the estimated conservation of sequentially neighboring sites. In largescale testing, we demonstrate that our combined approach outperforms previous conservation-based measures in identifying functionally important residues; in particular, it is significantly better than the commonly used Shannon entropy measure. We find that considering conservation at sequential neighbors improves the performance of all methods tested. Our analysis also reveals that many existing methods that attempt to incorporate the relationships between amino acids do not lead to better identification of functionally important sites. Finally, we find that while conservation is highly predictive in identifying catalytic sites and residues near bound ligands, it is much less effective in identifying residues in protein-protein interfaces. Availability: Data sets and code for all conservation measures evaluated are available at
Motivation: Determining protein function is one of the most important problems in the post-genomi... more Motivation: Determining protein function is one of the most important problems in the post-genomic era. For the typical proteome, there are no functional annotations for one-third or more of its proteins. Recent high-throughput experiments have determined proteome-scale protein physical interaction maps for several organisms. These physical interactions are complemented by an abundance of data about other types of functional relationships between proteins, including genetic interactions, knowledge about co-expression and shared evolutionary history. Taken together, these pairwise linkages can be used to build whole-proteome protein interaction maps. Results: We develop a network-flow based algorithm, Func-tionalFlow, that exploits the underlying structure of protein interaction maps in order to predict protein function. In crossvalidation testing on the yeast proteome, we show that Func-tionalFlow has improved performance over previous methods in predicting the function of proteins with few (or no) annotated protein neighbors. By comparing several methods that use protein interaction maps to predict protein function, we demonstrate that FunctionalFlow performs well because it takes advantage of both network topology and some measure of locality. Finally, we show that performance can be improved substantially as we consider multiple data sources and use them to create weighted interaction networks.
Differences in transcriptional regulatory networks underlie much of the phenotypic variation obse... more Differences in transcriptional regulatory networks underlie much of the phenotypic variation observed across organisms. Changes to cis-regulatory elements are widely believed to be the predominant means by which regulatory networks evolve, yet examples of regulatory network divergence due to transcription factor (TF) variation have also been observed. To systematically ascertain the extent to which TFs contribute to regulatory divergence, we analyzed the evolution of the largest class of metazoan TFs, Cys2-His2 zinc finger (C2H2-ZF) TFs, across 12 Drosophila species spanning~45 million years of evolution. Remarkably, we uncovered that a significant fraction of all C2H2-ZF 1-to-1 orthologs in flies exhibit variations that can affect their DNA-binding specificities. In addition to loss and recruitment of C2H2-ZF domains, we found diverging DNA-contacting residues in~44% of domains shared between D. melanogaster and the other fly species. These diverging DNA-contacting residues, found in~70% of the D. melanogaster C2H2-ZF genes in our analysis and corresponding to~26% of all annotated D. melanogaster TFs, show evidence of functional constraint: they tend to be conserved across phylogenetic clades and evolve slower than other diverging residues. These same variations were rarely found as polymorphisms within a population of D. melanogaster flies, indicating their rapid fixation. The predicted specificities of these dynamic domains gradually change across phylogenetic distances, suggesting stepwise evolutionary trajectories for TF divergence. Further, whereas proteins with conserved C2H2-ZF domains are enriched in developmental functions, those with varying domains exhibit no functional enrichments. Our work suggests that a subset of highly dynamic and largely unstudied TFs are a likely source of regulatory variation in Drosophila and other metazoans.
Questions about the structure/microstructure versus properties relationships in strongly correlat... more Questions about the structure/microstructure versus properties relationships in strongly correlated electronic oxides will be illustrated by two examples. First the Mn/Ni ordering in the La 2 NiMnO 6 compounds prepared in the form of thin films and secondly the La/Ba ordering on bulk 112-type cobalt oxides LaBaCo 2 O 6 will be discussed in relation with their magnetic and electric properties. Using transmission electron microscopy as a main tool, several questions are addressed regarding the need but also the difficulty to establish the structure, microstructure and nanostructure at all relevant temperature as a prerequisite for further analysis of the physical properties in these compounds.
Identifying a protein's functional sites is an important step towards characterizing its mole... more Identifying a protein's functional sites is an important step towards characterizing its molecular function. Numerous structure- and sequence-based methods have been developed for this problem. Here we introduce ConCavity, a small molecule binding site prediction algorithm that integrates evolutionary sequence conservation estimates with structure-based methods for identifying protein surface cavities. In large-scale testing on a diverse set of single- and multi-chain protein structures, we show that ConCavity substantially outperforms existing methods for identifying both 3D ligand binding pockets and individual ligand binding residues. As part of our testing, we perform one of the first direct comparisons of conservation-based and structure-based methods. We find that the two approaches provide largely complementary information, which can be combined to improve upon either approach alone. We also demonstrate that ConCavity has state-of-the-art performance in predicting catalyt...
We report the structural and electrical properties of erbium oxide films grown on Si(100) in the ... more We report the structural and electrical properties of erbium oxide films grown on Si(100) in the temperature range 450-600 °C by low-pressure metalorganic chemical vapour deposition (MOCVD) using Er(acac)3.phen, the phenanthroline adduct of erbium acetylacetonate, as precursor. Film properties are correlated with growth and processing conditions. Structural characterization reveals that films grown at lower temperatures are smooth, but poorly crystalline,
High-throughput experimental and computational approaches to characterize proteins and their inte... more High-throughput experimental and computational approaches to characterize proteins and their interactions have resulted in large-scale biological networks for many organisms, from bacteria to yeast to human. These complex networks are comprised of a number of distinct ...
Abstract. In the motif finding problem one seeks a set of mutually similar subsequences within a ... more Abstract. In the motif finding problem one seeks a set of mutually similar subsequences within a collection of biological sequences. This is an important and widely-studied problem, as such shared motifs in DNA often correspond to regulatory elements. We study a combinatorial ...
G-quadruplex DNA is a four-stranded DNA structure formed by non-Watson-Crick base pairing between... more G-quadruplex DNA is a four-stranded DNA structure formed by non-Watson-Crick base pairing between stacked sets of four guanines. Many possible functions have been proposed for this structure, but its in vivo role in the cell is still largely unresolved. We carried out a genome-wide survey of the evolutionary conservation of regions with the potential to form Gquadruplex DNA structures (G4 DNA motifs) across seven yeast species. We found that G4 DNA motifs were significantly more conserved than expected by chance, and the nucleotide-level conservation patterns suggested that the motif conservation was the result of the formation of G4 DNA structures. We characterized the association of conserved and nonconserved G4 DNA motifs in Saccharomyces cerevisiae with more than 40 known genome features and gene classes. Our comprehensive, integrated evolutionary and functional analysis confirmed the previously observed associations of G4 DNA motifs with promoter regions and the rDNA, and it identified several previously unrecognized associations of G4 DNA motifs with genomic features, such as mitotic and meiotic double-strand break sites (DSBs). Conserved G4 DNA motifs maintained strong associations with promoters and the rDNA, but not with DSBs. We also performed the first analysis of G4 DNA motifs in the mitochondria, and surprisingly found a tenfold higher concentration of the motifs in the AT-rich yeast mitochondrial DNA than in nuclear DNA. The evolutionary conservation of the G4 DNA motif and its association with specific genome features supports the hypothesis that G4 DNA has in vivo functions that are under evolutionary constraint. (VAZ) " JAC and KP are co-first authors on this work.
A study of growth, structure, and properties of Eu 2 O 3 thin films were carried out. Films were ... more A study of growth, structure, and properties of Eu 2 O 3 thin films were carried out. Films were grown at 500-600 • C temperature range on Si(1 0 0) and fused quartz from the complex of Eu(acac) 3 ·Phen by low pressure metalorganic chemical vapor deposition technique which has been rarely used for Eu 2 O 3 deposition. These films were polycrystalline. Depending on growth conditions and substrates employed, these films had also possessed a parasitic phase. This phase can be removed by post-deposition annealing in oxidizing ambient. Morphology of the films was characterized by well-packed spherical mounds. Optical measurements exhibited that the bandgap of pure Eu 2 O 3 phase was 4.4 eV. High frequency 1 MHz capacitance-voltage (C-V) measurements showed that the dielectric constant of pure Eu 2 O 3 film was about 12. Possible effects of cation and oxygen deficiency and parasitic phase on the optical and electrical properties of Eu 2 O 3 films have been briefly discussed.
In this paper, we report the structural and magnetic properties of electron-doped Ca 1−x Sm x MnO... more In this paper, we report the structural and magnetic properties of electron-doped Ca 1−x Sm x MnO 3 (CSM) nanoparticles. The samarium's composition "x" was varied from 0 to 0.2 with the special attention up to 0.05. Spherical 60-70 nm polycrystalline CSM nanoparticles were synthesised by chemical coprecipitation technique. Doping of Sm 3+ in antiferromagnetic CaMnO 3 has drastically altered its magnetic behavior due to the formation of ferromagnetic clusters. For example, the CSM powder with x = 0.04 displays about 115 K magnetic Curie temperature and about 0.1 emu/mole saturation magnetization. Physical properties of our nano-CSM powders are also compared with identical bulk-samples. To understand the differences, we invoked the intra-grain and inter-grain magnetic coupling process that facilitates to enhance their ferromagnetic behaviors. Unlike the bulk samples, such magnetic couplings in nanoparticles are favored by the presence of low-level crystal and interfacial defects.
The fluorescence behavior of molecular dyes at discrete distances from 1.5 nm diameter gold nanop... more The fluorescence behavior of molecular dyes at discrete distances from 1.5 nm diameter gold nanoparticles as a function of distance and energy is investigated. Photoluminescence and luminescence lifetime measurements both demonstrate quenching behavior consistent with 1/d 4 separation distance from dye to the surface of the nanoparticle. In agreement with the model of Persson and Lang, all experimental data show that energy transfer to the metal surface is the dominant quenching mechanism, and the radiative rate is unchanged throughout the experiment.
We report the structural and optical properties of oriented polycrystalline thin films of rare ea... more We report the structural and optical properties of oriented polycrystalline thin films of rare earth oxides (REO), namely Er2O3, Gd2O3, Eu2O3, and Yb2O3 grown on fused quartz by low-pressure metalorganic chemical vapour deposition (MOCVD) in the temperature ...
S ide chain positioning is an important subproblem of the general protein-structure-prediction pr... more S ide chain positioning is an important subproblem of the general protein-structure-prediction problem, with applications in homology modeling and protein design. The side chain positioning problem takes a fixed backbone and a protein sequence and predicts the lowest energy conformation of the protein's side chains on this backbone. We study a widely used version of the problem where the side chain positioning procedure uses a rotamer library and an energy function that can be expressed as a sum of pairwise terms. The problem is NP-complete; we show that it cannot even be approximated. In practice, it is tackled by a variety of general search techniques and specialized heuristics. Here, we propose formulating the side chain positioning problem as an instance of semidefinite programming (SDP). We introduce two novel rounding schemes and provide theoretical justification for their effectiveness under various conditions. We apply our method on simulated data, as well as on the computational redesign of two naturally occurring protein cores, and show that our SDP approach generally finds good solutions. Beyond the context of side chain positioning, our very general rounding schemes should be applicable elsewhere.
Sound velocities, viscosity and density of aqueous solution of PEG of average molecular weight of... more Sound velocities, viscosity and density of aqueous solution of PEG of average molecular weight of 4000 g/mole have been measured as a function of temperature in the range 308338 K at different frequencies. Isentropic compressibility, ultrasonic attenuation and acoustic ...
The chemical structures of biomolecules, whether naturally occurring or synthetic, are composed o... more The chemical structures of biomolecules, whether naturally occurring or synthetic, are composed of functionally important building blocks. Given a set of small molecules-for example, those known to bind a particular protein-computationally decomposing them into chemically meaningful fragments can help elucidate their functional properties, and may be useful for designing novel compounds with similar properties. Here we introduce molBLOCKS, a suite of programs for breaking down sets of small molecules into fragments according to a predefined set of chemical rules, clustering the resulting fragments, and uncovering statistically enriched fragments. Among other applications, our software should be a great aid in large-scale chemical analysis of ligands binding specific targets of interest. Availability and implementation: molBLOCKS is available as GPL Cþþ source code at
Motivation: Within a homologous protein family, proteins may be grouped into subtypes that share ... more Motivation: Within a homologous protein family, proteins may be grouped into subtypes that share specific functions that are not common to the entire family. Often, the amino acids present in a small number of sequence positions determine each protein's particular functional specificity. Knowledge of these specificity determining positions (SDPs) aids in protein function prediction, drug design, and experimental analysis. A number of sequence-based computational methods have been introduced for identifying SDPs; however, their further development and evaluation have been hindered by the limited number of known experimentally-determined SDPs. Results: We combine several bioinformatics resources to automate a process, typically undertaken manually, to build a data set of SDPs. The resulting large data set, which consists of SDPs in enzymes, enables us to characterize SDPs in terms of their physicochemical and evolutionary properties. It also facilitates the large-scale evaluation of sequence-based SDP prediction methods. We present a simple sequence-based SDP prediction method, GroupSim, and show that, surprisingly, it is competitive with a representative set of current methods. We also describe ConsWin, a heuristic that considers sequence conservation of neighboring amino acids, and demonstrate that it improves the performance of all methods tested on our large data set of enzyme SDPs. Availability: Data sets and GroupSim code are available online at
Motivation: An important step in unravelling the transcriptional regulatory network of an organis... more Motivation: An important step in unravelling the transcriptional regulatory network of an organism is to identify, for each transcription factor, all of its DNA binding sites. Several approaches are commonly used in searching for a transcription factor's binding sites, including consensus sequences and position-specific scoring matrices. In addition, methods that compute the average number of nucleotide matches between a putative site and all known sites can be employed. Such basic approaches can all be naturally extended by incorporating pairwise nucleotide dependencies and per-position information content. In this paper, we evaluate the effectiveness of these basic approaches and their extensions in finding binding sites for a transcription factor of interest without erroneously identifying other genomic sequences. Results: In cross-validation testing on a dataset of Escherichia coli transcription factors and their binding sites, we show that there are statistically significant differences in how well various methods identify transcription factor binding sites. The use of per-position information content improves the performance of all basic approaches. Furthermore, including local pairwise nucleotide dependencies within binding site models results in statistically significant performance improvements for approaches based on nucleotide matches. Based on our analysis, the best results when searching for DNA binding sites of a particular transcription factor are obtained by methods that incorporate both information content and local pairwise correlations. Availability: The software is available at
Side-chain positioning is a central component of homology modeling and protein design. In a commo... more Side-chain positioning is a central component of homology modeling and protein design. In a common formulation of the problem, the backbone is fixed, side-chain conformations come from a rotamer library, and a pairwise energy function is optimized. It is NP-complete to find even a reasonable approximate solution to this problem. We seek to put this hardness result into practical context. We present an integer linear programming (ILP) formulation of side-chain positioning that allows us to tackle large problem sizes. We relax the integrality constraint to give a polynomial-time linear programming (LP) heuristic. We apply LP to position side chains on native and homologous backbones and to choose side chains for protein design. Surprisingly, when positioning side chains on native and homologous backbones, optimal solutions using a simple, biologically relevant energy function can usually be found using LP. On the other hand, the design problem often cannot be solved using LP directly; however, optimal solutions for large instances can still be found using the computationally more expensive ILP procedure. While different energy functions also affect the difficulty of the problem, the LP/ILP approach is able to find optimal solutions. Our analysis is the first large-scale demonstration that LP-based approaches are highly effective in finding optimal (and successive near-optimal) solutions for the side-chain positioning problem.
Motivation: All residues in a protein are not equally important. Some are essential for the prope... more Motivation: All residues in a protein are not equally important. Some are essential for the proper structure and function of the protein, whereas others can be readily replaced. Conservation analysis is one of the most widely used methods for predicting these functionally important residues in protein sequences. Results: We introduce an information-theoretic approach for estimating sequence conservation based on Jensen-Shannon divergence. We also develop a general heuristic that considers the estimated conservation of sequentially neighboring sites. In largescale testing, we demonstrate that our combined approach outperforms previous conservation-based measures in identifying functionally important residues; in particular, it is significantly better than the commonly used Shannon entropy measure. We find that considering conservation at sequential neighbors improves the performance of all methods tested. Our analysis also reveals that many existing methods that attempt to incorporate the relationships between amino acids do not lead to better identification of functionally important sites. Finally, we find that while conservation is highly predictive in identifying catalytic sites and residues near bound ligands, it is much less effective in identifying residues in protein-protein interfaces. Availability: Data sets and code for all conservation measures evaluated are available at
Motivation: Determining protein function is one of the most important problems in the post-genomi... more Motivation: Determining protein function is one of the most important problems in the post-genomic era. For the typical proteome, there are no functional annotations for one-third or more of its proteins. Recent high-throughput experiments have determined proteome-scale protein physical interaction maps for several organisms. These physical interactions are complemented by an abundance of data about other types of functional relationships between proteins, including genetic interactions, knowledge about co-expression and shared evolutionary history. Taken together, these pairwise linkages can be used to build whole-proteome protein interaction maps. Results: We develop a network-flow based algorithm, Func-tionalFlow, that exploits the underlying structure of protein interaction maps in order to predict protein function. In crossvalidation testing on the yeast proteome, we show that Func-tionalFlow has improved performance over previous methods in predicting the function of proteins with few (or no) annotated protein neighbors. By comparing several methods that use protein interaction maps to predict protein function, we demonstrate that FunctionalFlow performs well because it takes advantage of both network topology and some measure of locality. Finally, we show that performance can be improved substantially as we consider multiple data sources and use them to create weighted interaction networks.
Uploads
Papers by Mona Singh