Papers by W. Brad Barbazuk

BioEssays, 2005
Plant, and particularly cereal genomes, are challenging to sequence due to their large size and h... more Plant, and particularly cereal genomes, are challenging to sequence due to their large size and high repetitive DNA content. Gene-enrichment strategies are alternative or complementary approaches to complete genome sequencing that yield, rapidly and inexpensively, useful sequence data from large and complex genomes. The maize genome is large (2.7 Gbp) and contains large amounts of conserved repetitive elements. Furthermore, the high allelic diversity found between maize inbred lines may necessitate sequencing several inbred lines in order to recover the maize ''gene pool''. Two gene-enrichment approaches, methylation filtration (MF) and high C o t (HC) sequencing have been tested in maize and their ability to sample the gene space has been examined. Combined with other genomic sequencing strategies, gene-enriched genomic sequencing is a practical way to examine the maize gene pool, to order and orient the genic sequences on the genome, and to enable investigation of gene content of other complex plant genomes.
Sequencing Genes and Gene Islands by Gene Enrichment
Springer eBooks, Jan 15, 2009
Access to the sequence of any gene in a genome greatly accelerates genetics research. Whole genom... more Access to the sequence of any gene in a genome greatly accelerates genetics research. Whole genome sequencing is a way to retrieve such information although, for large genomes such as that of maize, it represents a huge effort. Fortunately, a maize genome sequencing project is currently underway but, before this project started, the maize research community benefited from the development

SNP Discovery by Transcriptome Pyrosequencing
Methods in molecular biology, 2011
Single nucleotide polymorphisms (SNPs) are single base differences between haplotypes. SNPs are a... more Single nucleotide polymorphisms (SNPs) are single base differences between haplotypes. SNPs are abundant in many species and valuable as markers for genetic map construction, modern molecular breeding programs, and quantitative genetic studies. SNPs are readily mined from genomic DNA or cDNA sequence obtained from individuals having two or more distinct genotypes. While automated Sanger sequencing has become less expensive over time, it is still costly to acquire deep Sanger sequence from several genotypes. "Next-generation" DNA sequencing technologies that utilize new chemistries and massively parallel approaches have enabled DNA sequences to be acquired at extremely high depths of coverage faster and for less cost than traditional sequencing. One such method is represented by the Roche/454 Life Sciences GS-FLX Titanium Series, which currently uses pyrosequencing to produce up to 400-600 million bases of DNA sequence/run (>1 million reads, ~400 bp/read). This chapter discusses the use of high-throughput pyrosequencing for SNP discovery by focusing on 454 sequencing of maize cDNA, the development of a computational pipeline for polymorphism detection, and the subsequent identification of over 7,000 putative SNPs between Mo17 and B73 maize. In addition, alternative alignment and polymorphism detection strategies that implement Illumina short reads, data processing and visualization tools, and reduced representation techniques that reduce the sequencing of repeat DNA, thus enabling efficient analysis of genome sequence, are discussed.

Plant Journal, May 7, 2013
The large genome size of many species hinders the development and application of genomic tools to... more The large genome size of many species hinders the development and application of genomic tools to study them. For instance, loblolly pine (Pinus taeda L.), an ecologically and economically important conifer, has a large and yet uncharacterized genome of 21.7 Gbp. To characterize the pine genome, we performed exome capture and sequencing of 14,729 genes derived from an assembly of expressed sequence tags. Efficiency of sequence capture was evaluated and shown to be similar across samples with increasing levels of complexity, including haploid cDNA, haploid genomic DNA and diploid genomic DNA. However, this efficiency was severely reduced for probes that overlapped multiple exons, presumably because intron sequences hindered probe:exon hybridizations. Such regions could not be entirely avoided during probe design, because of the lack of a reference sequence. To improve the throughput and reduce the cost of sequence capture, a method to multiplex the analysis of up to eight samples was developed. Sequence data showed that multiplexed capture was reproducible among 24 haploid samples, and can be applied for high-throughput analysis of targeted genes in large populations. Captured sequences were de novo assembled, resulting in 11,396 expanded and annotated gene models, significantly improving the knowledge about the pine gene space. Interspecific capture was also evaluated with over 98% of all probes designed from P. taeda that were efficient in sequence capture, were also suitable for analysis of the related species Pinus elliottii Engelm.

The Plant Genome, Jul 1, 2013
Associations between arbuscular mycorrhizal (AM) fungi and plants are an ancient and widespread p... more Associations between arbuscular mycorrhizal (AM) fungi and plants are an ancient and widespread plant microbe symbioses. Most land plants can associate with this specialized group of soil fungi (in the Glomeromycota), which enhance plant nutrient uptake in return for C derived from plant photosynthesis. Elucidating the mechanisms involved in the symbiosis between obligate symbionts such as AM fungi and plant roots is challenging because AM fungal transcripts in roots are in low abundance and reference genomes for the fungi have not been available. A deep sequencing metatranscriptomics approach was applied to a wild-type tomato and a tomato mutant (Solanum lycopersicum L. cultivar RioGrande 76R) incapable of supporting a functional AM symbiosis, revealing novel AM fungal and microbial transcripts expressed in colonized roots. We confirm transcripts known to be mycorrhiza associated and report the discovery of more than 500 AM fungal and novel plant transcripts associated with mycorrhizal tomato roots including putative Zn, Fe, aquaporin, and carbohydrate transporters as well as mycorrhizal-associated alternative gene splicing. This analysis provides a fundamental step toward identifying the molecular mechanisms of mineral and carbohydrate exchange during the symbiosis. The utility of this metatranscriptomic approach to explore an obligate biotrophic interaction is illustrated, especially as it relates to agriculturally relevant biological processes. A rbuscular mycorrhizal (AM) fungi are important root symbionts that associate with the majority of land plants including most agricultural species (Smith and Read, 2008). They are obligate mutualistic biotrophs that provide an additional (fungal) pathway of mineral nutrient (mainly inorganic P, N, S, and Zn) uptake from the soil (Allen and Shachar-Hill, 2009; Govindarajulu et al., 2005; Javot et al., 2007), enhance drought tolerance (Aroca et al., 2008), and increased pathogen protection (Liu et al., 2007). In return for soil-derived nutrients, the plant supplies C to the fungus in the form of photosynthesis-derived sugars (Pfeffer et al., 1999). Establishment

Gene, Apr 1, 1999
In the nematode Caenorhabditis elegans, the maternal effect lethal gene mel-32 encodes a serine h... more In the nematode Caenorhabditis elegans, the maternal effect lethal gene mel-32 encodes a serine hydroxymethyltransferase isoform. Since interspecies DNA comparison is a valuable tool for identifying sequences that have been conserved because of their functional importance or role in regulating gene activity, mel-32(SHMT) genomic DNA from C. elegans was used to screen a genomic library from the closely related nematode Caenorhabditis briggsae. The C. briggsae genomic clone identified fully rescues the Mel-32 phenotype in C. elegans, indicating functional and regulatory conservation. Computer analysis reveals that CbMEL-32(SHMT) is 92% identical (97% similar) to CeMEL-32(SHMT) at the amino acid level over the entire length of the protein (484 amino acids), whereas the coding DNA is 82.5% identical (over 1455 nucleotides). Several highly conserved noncoding regions upstream and downstream of the mel-32(SHMT) gene reveal potential regulatory sites that may bind transacting protein factors.

Applications in Plant Sciences, Jun 1, 2013
Primers were developed to amplify 12 intron-less, low-copy nuclear genes in the Hawaiian genus Cl... more Primers were developed to amplify 12 intron-less, low-copy nuclear genes in the Hawaiian genus Clermontia (Campanulaceae), a suspected tetraploid. • Methods and Results: Data from a pooled 454 titanium run of the partial transcriptomes of seven Clermontia species were used to identify the loci of interest. Most loci were amplifi ed and sequenced directly with success in a representative selection of lobeliads even though several of these loci turned out to be duplicated. Levels of variation were comparable to those observed in commonly used plastid and ribosomal markers. • Conclusions: We found evidence of a genome duplication that likely predates the diversifi cation of the Hawaiian lobeliads. Some genes nevertheless appear to be single-copy and should be useful for phylogenetic studies of Clermontia or the entire Lobelioideae subfamily.

Gene, Feb 1, 2019
Computational analyses play crucial roles in charactering splicing isoforms in plant genomes. In ... more Computational analyses play crucial roles in charactering splicing isoforms in plant genomes. In this review, we provide a survey of computational tools used in recently published, genome-scale splicing analyses in plants. We summarize the commonly used software and pipelines for read mapping, isoform reconstruction, isoform quantification, and differential expression analysis. We also discuss methods for analyzing long reads and the strategies to combine long and short reads in identifying splicing isoforms. We review several tools for characterizing local splicing events, splicing graphs, coding potential, and visualizing splicing isoforms. We further discuss the procedures for identifying conserved splicing isoforms across plant species. Finally, we discuss the outlook of integrating other genomic data with splicing analyses to identify regulatory mechanisms of AS on genome-wide scale.
Frontiers in Plant Science, May 10, 2017

G3: Genes, Genomes, Genetics, 2014
Loblolly pine (Pinus taeda L.) is an economically and ecologically important conifer for which a ... more Loblolly pine (Pinus taeda L.) is an economically and ecologically important conifer for which a suite of genomic resources is being generated. Despite recent attempts to sequence the large genome of conifers, their assembly and the positioning of genes remains largely incomplete. The interspecific synteny in pines suggests that a gene-based map would be useful to support genome assemblies and analysis of conifers. To establish a reference gene-based genetic map, we performed exome sequencing of 14729 genes on a mapping population of 72 haploid samples, generating a resource of 7434 sequence variants segregating for 3787 genes. Most markers are single-nucleotide polymorphisms, although short insertions/deletions and multiple nucleotide polymorphisms also were used. Marker segregation in the population was used to generate a high-density, gene-based genetic map. A total of 2841 genes were mapped to pine's 12 linkage groups with an average of one marker every 0.58 cM. Capture data were used to detect gene presence/absence variations and position 65 genes on the map. We compared the marker order of genes previously mapped in loblolly pine and found high agreement. We estimated that 4123 genes had enough sequencing depth for reliable detection of markers, suggesting a high marker conversation rate of 92% (3787/4123). This is possible because a significant portion of the gene is captured and sequenced, increasing the chances of identifying a polymorphic site for characterization and mapping. This sub-centiMorgan genetic map provides a valuable resource for gene positioning on chromosomes and guide for the assembly of a reference pine genome. KEYWORDS loblolly pine exome sequencing high-throughput genotyping high-density genetic map copy number variation Loblolly pine (Pinus taeda L.) covers 11.7 million hectares of natural and planted forests in North America and provides 58% of timber in the United States and 16% in the world's (Wear and Greis 2002). Loblolly pine is also an important species for comparative studies between gymnosperms and angiosperms, and genomic resources are becoming increasingly available to enable these studies (Mackay et al. 2012). For instance, single-nucleotide polymorphisms (SNPs) and microsatellites have been identified and applied to generate genetic maps (Elsik and Williams 2001; Eckert et al. 2009; Echt et al. 2011), identify population genetic parameters and associations to phenotype (Gonzalez-Martinez et al. 2007; Eckert et al. 2010; Stewart et al. 2012), and develop genomic selection prediction models (Resende et al. 2012). However, the number of available genetic markers remains small, particularly considering the large size of the loblolly pine genome. Advances in high-throughput DNA sequencing and other genomic tools are making it possible to genotype large numbers of individuals by sequencing reduced representations of the genome (Davey et al. 2011). For example, targeted resequencing after in solution sequence capture (Gnirke et al. 2009) has proven useful in variant detection. Using this method, probes complementary to the target regions of the genome are designed and hybridized to genomic DNA for sequence capture and subsequent sequencing. Sequence capture is being optimized for an increasing number of plant species (Saintenac et al. 2011; Bundock et al. 2012; Zhou and Holliday 2012) including conifers (Neves et al. 2013). To evaluate the potential of targeted resequencing

PLOS Genetics, Nov 20, 2009
Following the domestication of maize over the past ,10,000 years, breeders have exploited the ext... more Following the domestication of maize over the past ,10,000 years, breeders have exploited the extensive genetic diversity of this species to mold its phenotype to meet human needs. The extent of structural variation, including copy number variation (CNV) and presence/absence variation (PAV), which are thought to contribute to the extraordinary phenotypic diversity and plasticity of this important crop, have not been elucidated. Whole-genome, array-based, comparative genomic hybridization (CGH) revealed a level of structural diversity between the inbred lines B73 and Mo17 that is unprecedented among higher eukaryotes. A detailed analysis of altered segments of DNA conservatively estimates that there are several hundred CNV sequences among the two genotypes, as well as several thousand PAV sequences that are present in B73 but not Mo17. Haplotype-specific PAVs contain hundreds of single-copy, expressed genes that may contribute to heterosis and to the extraordinary phenotypic diversity of this important crop.
SNP Mining from Maize 454 EST Sequences
CSH Protocols, Jul 1, 2007
INTRODUCTIONIn this protocol, 454 expressed sequence tags (ESTs) are generated by sequencing shoo... more INTRODUCTIONIn this protocol, 454 expressed sequence tags (ESTs) are generated by sequencing shoot apical meristem (SAM) cDNA from maize inbred lines on the 454 Life Sciences GS-20 sequencing system. The computational tool PolyBayes (Marth et al. 1999) is then used to identify single-nucleotide polymorphisms (SNPs). PolyBayes has been used successfully to identify SNPs in many different systems, including maize, and is particularly recommended for identifying SNPs in 454 sequences.
Genome Research, Aug 1, 2020

BMC Genomics, Aug 22, 2017
Background: The vast diversification of proteins in eukaryotic cells has been related with multip... more Background: The vast diversification of proteins in eukaryotic cells has been related with multiple transcript isoforms from a single gene that result in alternative splicing (AS) of primary transcripts. Analysis of RNA sequencing data from expressed sequence tags and next generation RNA sequencing has been crucial for AS identification and genome-wide AS studies. For the identification of AS events from the related legume species Phaseolus vulgaris and Glycine max, 157 and 88 publicly available RNA-seq libraries, respectively, were analyzed. Results: We identified 85,570 AS events from P. vulgaris in 72% of expressed genes and 134,316 AS events in 70% of expressed genes from G. max. These were categorized in seven AS event types with intron retention being the most abundant followed by alternative acceptor and alternative donor, representing~75% of all AS events in both plants. Conservation of AS events in homologous genes between the two species was analyzed where an overrepresentation of AS affecting 5'UTR regions was observed for certain types of AS events. The conservation of AS events was experimentally validated for 8 selected genes, through RT-PCR analysis. The different types of AS events also varied by relative position in the genes. The results were consistent in both species. Conclusions: The identification and analysis of AS events are first steps to understand their biological relevance. The results presented here from two related legume species reveal high conservation, over~15-20 MY of divergence, and may point to the biological relevance of AS.

Plant Physiology, Oct 1, 2004
Maize (Zea mays) possesses a large, highly repetitive genome, and subsequently a number of reduce... more Maize (Zea mays) possesses a large, highly repetitive genome, and subsequently a number of reduced-representation sequencing approaches have been used to try and enrich for gene space while eluding difficulties associated with repetitive DNA. This article documents the ability of publicly available maize expressed sequence tag and Genome Survey Sequences (GSSs; many of which were isolated through the use of reduced representation techniques) to recognize and provide coverage of 78 maize full-length cDNAs (FLCs). All 78 FLCs in the dataset were identified by at least three GSSs, indicating that the majority of maize genes have been identified by at least one currently available GSS. Both methyl-filtration and high-Cot enrichment methods provided a 7-to 8-fold increase in gene discovery rates as compared to random sequencing. The available maize GSSs aligned to 75% of the FLC nucleotides used to perform searches, while the expressed sequence tag sequences aligned to 73% of the nucleotides. Our data suggest that at least approximately 95% of maize genes have been tagged by at least one GSS. While the GSSs are very effective for gene identification, relatively few (18%) of the FLCs are completely represented by GSSs. Analysis of the overlap of coverage and bias due to position within a gene suggest that RescueMu, methyl-filtration, and high-Cot methods are at least partially nonredundant.

Plant Biotechnology Journal, Apr 7, 2010
Rice transcription factor RF2a binds to the BoxII cis element of the promoter of rice tungro baci... more Rice transcription factor RF2a binds to the BoxII cis element of the promoter of rice tungro bacilliform virus and activates promoter expression. The acidic acid-rich domain of RF2a is a transcription activator and has been partially characterized (Dai et al., 2003). The RF2a acidic domain (A; amino acids 49-116) was fused with the synthetic zinc finger ZF-TF 2C7 and was co-introduced with a reporter gene into transgenic Arabidopsis plants. Expression of the reporter gene was increased up to seven times by the effector. In transient assays in tobacco BY-2 protoplasts, we identified a subdomain comprising amino acids 56-84 (A5) that was equally as effective as an activator as the entire acidic domain. A chemically inducible system was used to show determined that A and A5 domains are equally as effective in transcription activation as the well-characterized VP16 activation domain. Bioinformatics analyses revealed that the A5 domain is present only in b-ZIP transcription factors. In dicots, the A domain contains an insertion of four amino acids that is not present in monocot proteins. The A5 domain, and similar domains in other b-ZIP transcription factors, is predicted to form an anti-parallel beta sheet structure.

BMC Bioinformatics, Feb 27, 2021
Background: microRNAs (miRNAs) have been shown to play essential roles in a wide range of biologi... more Background: microRNAs (miRNAs) have been shown to play essential roles in a wide range of biological processes. Many computational methods have been developed to identify targets of miRNAs. However, the majority of these methods depend on predefined features that require considerable efforts and resources to compute and often prove suboptimal at predicting miRNA targets. Results: We developed a novel hybrid deep learning-based (DL-based) approach that is capable of predicting miRNA targets at a higher accuracy. This approach integrates convolutional neural networks (CNNs) that excel in learning spatial features and recurrent neural networks (RNNs) that discern sequential features. Therefore, our approach has the advantages of learning both the intrinsic spatial and sequential features of miRNA:target. The inputs for our approach are raw sequences of miRNAs and genes that can be obtained effortlessly. We applied our approach on two human datasets from recently miRNA target prediction studies and trained two models. We demonstrated that the two models consistently outperform the previous methods according to evaluation metrics on test datasets. Comparing our approach with currently available alternatives on independent datasets shows that our approach delivers substantial improvements in performance. We also show with multiple evidences that our approach is more robust than other methods on small datasets. Our study is the first study to perform comparisons across multiple existing DL-based approaches on miRNA target prediction. Furthermore, we examined the contribution of a Max pooling layer in between the CNN and RNN and demonstrated that it improves the performance of all our models. Finally, a unified model was developed that is robust on fitting different input datasets. Conclusions: We present a new DL-based approach for predicting miRNA targets and demonstrate that our approach outperforms the current alternatives. We supplied an easy-to-use tool, miTAR, at https ://githu b.com/tjgu/miTAR. Furthermore, our analysis results support that Max Pooling generally benefits the hybrid models and potentially prevents overfitting for hybrid models.

BMC Bioinformatics, Oct 10, 2005
Background: The degree to which conventional DNA sequencing techniques will be successful for hig... more Background: The degree to which conventional DNA sequencing techniques will be successful for highly repetitive genomes is unclear. Investigators are therefore considering various filtering methods to select against high-copy sequence in DNA clone libraries. The standard model for random sequencing, Lander-Waterman theory, does not account for two important issues in such libraries, discontinuities and position-based sampling biases (the so-called "edge effect"). We report an extension of the theory for analyzing such configurations. Results: The edge effect cannot be neglected in most cases. Specifically, rates of coverage and gap reduction are appreciably lower than those for conventional libraries, as predicted by standard theory. Performance decreases as read length increases relative to island size. Although opposite of what happens in a conventional library, this apparent paradox is readily explained in terms of the edge effect. The model agrees well with prototype gene-tagging experiments for Zea mays and Sorghum bicolor. Moreover, the associated density function suggests well-defined probabilistic milestones for the number of reads necessary to capture a given fraction of the gene space. An exception for applying standard theory arises if sequence redundancy is less than about 1-fold. Here, evolution of the random quantities is independent of library gaps and edge effects. This observation effectively validates the practice of using standard theory to estimate the genic enrichment of a library based on light shotgun sequencing. Conclusion: Coverage performance using a filtered library is significantly lower than that for an equivalent-sized conventional library, suggesting that directed methods may be more critical for the former. The proposed model should be useful for analyzing future projects.

Nature Communications, Sep 13, 2022
Historically, xenia effects were hypothesized to be unique genetic contributions of pollen to see... more Historically, xenia effects were hypothesized to be unique genetic contributions of pollen to seed phenotype, but most examples represent standard complementation of Mendelian traits. We identified the imprinted dosageeffect defective1 (ded1) locus in maize (Zea mays) as a paternal regulator of seed size and development. Hypomorphic alleles show a 5-10% seed weight reduction when ded1 is transmitted through the male, while homozygous mutants are defective with a 70-90% seed weight reduction. Ded1 encodes an R2R3-MYB transcription factor expressed specifically during early endosperm development with paternal allele bias. DED1 directly activates early endosperm genes and endosperm adjacent to scutellum cell layer genes, while directly repressing late grain-fill genes. These results demonstrate xenia as originally defined: Imprinting of Ded1 causes the paternal allele to set the pace of endosperm development thereby influencing grain set and size.
Uploads
Papers by W. Brad Barbazuk