100% found this document useful (8 votes)

1K views755 pages

Plant Breeding

Plant breeding book

Uploaded by

meriam nefzaoui

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (8 votes)

1K views755 pages

Plant Breeding

Plant breeding book

Uploaded by

meriam nefzaoui

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Molecular Plant Breeding

In Memoriam
Norman Ernest Borlaug
(25 March 191412 September 2009)

Norman Borlaug was one of the greatest men of our times a steadfast champion and
spokesman against hunger and poverty. He dedicated his 95 richly lived years to filling the
bellies of others, and is credited by the United Nations World Food Program with saving
more lives than any other man in history.
An American plant pathologist who spent most of his years in Mexico, it was Dr
Borlaugs high-yielding dwarf wheat varieties that prevented wide-spread famine in South
Asia, specifically India and Pakistan, and also in Turkey. Known as the Green Revolution,
this feat earned him the Nobel Peace Prize in 1970. He was instrumental in establishing
the International Maize and Wheat Improvement Center, known by its Spanish acronym
CIMMYT, and later the Consultative Group of International Agricultural Research (CGIAR),
a network of 15 agricultural research centres.
Dr Borlaug spent time as a microbiologist with DuPont before moving to Mexico in
1944 as a geneticist and plant pathologist to develop stem rust resistant wheat cultivars. In
1966 he became the director of CIMMYTs Wheat Program, seconded from the Rockefeller
Foundation. His full-time employment with the Center ended in 1979, although he remained
a part-time consultant until his death. In 1984 he began a new career as a university pro-
fessor and went on to establish the World Food Prize, which honours the achievements of
individuals who have advanced human development by improving the quality, quantity or
availability of food in the world. In 1986, he joined forces with former US President Jimmy
Carter and the Nippon Foundation of Japan, under the chairmanship of Ryoichi Sasakawa,
to establish Sasakawa Africa Association (SAA) to address Africas food problems. Since
then, more than 1 million small-scale African farmers in 15 countries have been trained by
SAA in improved farming techniques.
Dr Borlaug influenced the thinking of thousands of agricultural scientists. He was a path-
breaking wheat breeder and, equally important, his stature enabled him to influence politi-
cians and leaders around the world. His legacy and his work ethic to get things done and not
mind getting your hands dirty influenced us all and remain CIMMYT guiding principles.
We will honor Dr Borlaugs memory by carrying forward his mission and spirit of inno-
vation: applying agricultural science to help smallholder farmers produce more and better-
quality food using fewer resources. At stake is no less than the future of humanity, for, as
Borlaug said: The destiny of world civilization depends upon providing a decent standard
of living for all. His presence will never really leave CIMMYT; it is embedded in our soul.

Thomas A. Lumpkin
Director General, CIMMYT
Marianne Bnziger
Deputy Director General for Research and Partnerships, CIMMYT
Hans-Joachim Braun
Director for Global Wheat Program, CIMMYT
Molecular Plant Breeding

Yunbi Xu

International Maize and Wheat Improvement Center (CIMMYT)

Apdo Postal 6-641
06600 Mexico, DF
Mexico
CABI is a trading name of CAB International

CABI Head Ofce CABI North American Ofce

Nosworthy Way 875 Massachusetts Avenue
Wallingford 7th Floor
Oxfordshire OX10 8DE Cambridge, MA 02139
UK USA

Tel: +44 (0)1491 832111 Tel: +1 617 395 4056

Fax: +44 (0)1491 833508 Fax: +1 617 354 6875
E-mail: [email protected] E-mail: [email protected]
Web site: www.cabi.org
CAB International 2010. All rights reserved. No part of this publication
may be reproduced in any form or by any means, electronically,
mechanically, by photocopying, recording or otherwise, without the
prior permission of the copyright owners.
A catalogue record for this book is available from the British Library,
London, UK.
Library of Congress Cataloging-in-Publication Data
Xu, Yunbi.
Molecular plant breeding / Yunbi Xu.
p. cm.
ISBN 978-1-84593-392-0 (alk. paper)
1. Crop improvement. 2. Plant breeding. 3. Crops--Molecular genet-
ics. 4. Crops--Genetics. I. Title.
SB106.147X8 2010
631.5233--dc22
20009033246

ISBN: 978 1 84593 392 0

Typeset by SPi, Pondicherry, India.

Printed and bound in the UK by MPG Books Group.
Contents

Preface ix
Foreword by Dr Norman E. Borlaug xv
Foreword by Dr Ronald L. Phillips xvii

1 Introduction 1
1.1 Domestication of Crop Plants 1
1.2 Early Efforts at Plant Breeding 3
1.3 Major Developments in the History of Plant Breeding 4
1.4 Genetic Variation 9
1.5 Quantitative Traits: Variance, Heritability and Selection Index 10
1.6 The Green Revolution and the Challenges Ahead 16
1.7 Objectives of Plant Breeding 17
1.8 Molecular Breeding 18

2 Molecular Breeding Tools: Markers and Maps 21

2.1 Genetic Markers 21
2.2 Molecular Maps 43

3 Molecular Breeding Tools: Omics and Arrays 59

3.1 Molecular Techniques in Omics 59
3.2 Structural Genomics 68
3.3 Functional Genomics 81
3.4 Phenomics 91
3.5 Comparative Genomics 93
3.6 Array Technologies in Omics 100

4 Populations in Genetics and Breeding 113

4.1 Properties and Classification of Populations 113
4.2 Doubled Haploids (DHs) 116
4.3 Recombinant Inbred Lines (RILs) 131
4.4 Near-isogenic Lines (NILs) 138
4.5 Cross-population Comparison: Recombination Frequency and Selection 145

5 Plant Genetic Resources: Management, Evaluation and Enhancement 151

5.1 Genetic Erosion and Potential Vulnerability 152
5.2 The Concept of Germplasm 155

v
vi Contents

5.3 Collection/Acquisition 161

5.4 Maintenance, Rejuvenation and Multiplication 166
5.5 Evaluation 171
5.6 Germplasm Enhancement 186
5.7 Information Management 188
5.8 Future Prospects 192

6 Molecular Dissection of Complex Traits: Theory 195

6.1 Single Marker-based Approaches 197
6.2 Interval Mapping 202
6.3 Composite Interval Mapping 205
6.4 Multiple Interval Mapping 209
6.5 Multiple Populations/Crosses 214
6.6 Multiple QTL 217
6.7 Bayesian Mapping 219
6.8 Linkage Disequilibrium Mapping 223
6.9 Meta-analysis 233
6.10 In Silico Mapping 237
6.11 Sample Size, Power and Thresholds 239
6.12 Summary and Prospects 247

7 Molecular Dissection of Complex Traits: Practice 249

7.1 QTL Separating 249
7.2 QTL for Complicated Traits 258
7.3 QTL Mapping across Species 262
7.4 QTL across Genetic Backgrounds 264
7.5 QTL across Growth and Developmental Stages 270
7.6 Multiple Traits and Gene Expression 274
7.7 Selective Genotyping and Pooled DNA Analysis 277

8 Marker-assisted Selection: Theory 286

8.1 Components of Marker-assisted Selection 288
8.2 Marker-assisted Gene Introgression 293
8.3 Marker-assisted Gene Pyramiding 308
8.4 Selection for Quantitative Traits 318
8.5 Long-term Selection 327

9 Marker-assisted Selection: Practice 336

9.1 Selection Schemes for Marker-assisted Selection 337
9.2 Bottlenecks in Application of Marker-assisted Selection 339
9.3 Reducing Costs and Increasing Scale and Efficiency 344
9.4 Traits Most Suitable for MAS 350
9.5 Marker-assisted Gene Introgression 356
9.6 Marker-assisted Gene Pyramiding 363
9.7 Marker-assisted Hybrid Prediction 367
9.8 Opportunities and Challenges 378

10 Genotype-by-environment Interaction 381

10.1 Multi-environment Trials 383
10.2 Environmental Characterization 386
10.3 Stability of Genotype Performance 394
10.4 Molecular Dissection of GEI 402
Contents vii

10.5 Breeding for GEI 410

10.6 Future Perspectives 414

11 Isolation and Functional Analysis of Genes 417

11.1 In Silico Prediction 419
11.2 Comparative Approaches for Gene Isolation 426
11.3 Cloning Based on cDNA Sequencing 431
11.4 Positional Cloning 435
11.5 Identification of Genes by Mutagenesis 441
11.6 Other Approaches for Gene Isolation 454

12 Gene Transfer and Genetically Modied Plants 458

12.1 Plant Tissue Culture and Genetic Transformation 458
12.2 Transformation Approaches 461
12.3 Expression Vectors 468
12.4 Selectable Marker Genes 473
12.5 Transgene Integration, Expression and Localization 480
12.6 Transgene Stacking 487
12.7 Transgenic Crop Commercialization 492
12.8 Perspectives 499

13 Intellectual Property Rights and Plant Variety Protection 501

13.1 Intellectual Property and Plant Breeders Rights 502
13.2 Plant Variety Protection: Needs and Impacts 504
13.3 International Agreements Affecting Plant Breeding 509
13.4 Plant Variety Protection Strategies 518
13.5 Intellectual Property Rights Affecting Molecular Breeding 524
13.6 Use of Molecular Techniques in Plant Variety Protection 535
13.7 Plant Variety Protection Practice 541
13.8 Future Perspectives 543

14 Breeding Informatics 550

14.1 Information-driven Plant Breeding 550
14.2 Information Collection 554
14.3 Information Integration 562
14.4 Information Retrieval and Mining 568
14.5 Information Management Systems 572
14.6 Plant Databases 579
14.7 Future Prospects for Breeding Informatics 595

15 Decision Support Tools 599

15.1 Germplasm and Breeding Population Management and Evaluation 600
15.2 Genetic Mapping and MarkerTrait Association Analysis 605
15.3 Marker-assisted Selection 613
15.4 Simulation and Modelling 615
15.5 Breeding by Design 621
15.6 Future Perspectives 623

References 627
Index 717
The colour plate section can be found following p. 270
This page intentionally left blank
Preface

The genomics revolution of the past decade has greatly enhanced our understanding of
the genetic composition of living organisms including many plant species of economic
importance. Complete genomic sequences of Arabidopsis and several major crops, together
with high-throughput technologies for analyses of transcripts, proteins and mutants, pro-
vide the basis for understanding the relationship between genes, proteins and phenotypes.
Sequences and genes have been used to develop functional and biallelic markers, such as
single nucleotide polymorphism (SNP), that are powerful tools for genetic mapping, germ-
plasm evaluation and marker-assisted selection.
The road from basic genomics research to impacts on routine breeding programmes has
been long, windy and bumpy, not to mention scattered with wrong turns and unexpected
blockades. As a result, genomics can be applied to plant breeding only when an integrated
package becomes available that combines multiple components such as high-throughput
techniques, cost-effective protocols, global integration of genetic and environmental fac-
tors and precise knowledge of quantitative trait inheritance. More recently, the end of the
tunnel has come in sight, and the multinational corporations have ramped up their invest-
ments in and expectations from these technologies. The challenge now is to translate and
integrate the new knowledge from genomics and molecular biology into appropriate tools
and methodologies for public-sector plant breeding programmes, particularly those in low-
income countries. It is expected that harnessing the outputs of genomics research will be
an important component in successfully addressing the challenge of doubling world food
production by 2050.

What does Molecular Plant Breeding include?

The term molecular plant breeding has been much used and abused in the literature, and
thus loved or maligned in equal measure by the readership. In the context of this book, the
term is used to provide a simple umbrella for the multidisciplinary field of modern plant
breeding that combines molecular tools and methodologies with conventional approaches
for improvement of crop plants. This book is intended to provide comprehensive coverage
of the components that should be integrated within plant breeding programmes to develop
crop products in a more efficient and targeted way.

ix
x Preface

The first chapter introduces some basic concepts that are required for understanding
fundamentally important issues described in subsequent chapters. The concepts include
crop domestication, critical events in the history of plant breeding, basics of quantitative
genetics (variance, heritability and selection index), plant breeding objectives and molecu-
lar breeding goals. Chapters 2 and 3 introduce the key genomics tools that are used in
molecular breeding programmes, including molecular markers, maps, omics technolo-
gies and arrays. Different types of molecular markers are compared and construction of
molecular maps is discussed. Chapter 4 describes common types of populations that have
been used in genetics and plant breeding, with a focus on recombinant inbred lines, dou-
bled haploids and near-isogenic lines. Chapter 5 provides an overview of marker-assisted
germplasm evaluation, management and enhancement. Chapters 6 and 7 discuss the theory
and practice, respectively, of using molecular markers to dissect complex traits and locate
quantitative trait loci (QTL). Chapters 8 and 9 cover the theory and practice, respectively,
of marker-assisted selection. Genotype-by-environment interaction (GEI) is discussed in
Chapter 10, including multi-environment trials, stability of genotype performance, molecu-
lar dissection of GEI and breeding for optimum GEI. Chapter 11 provides a summary of
gene isolation and functional analysis approaches, including in silico prediction of genes,
comparative approaches for gene isolation, gene cloning based on cDNA sequencing, posi-
tional cloning and identification of genes by mutagenesis. Chapter 12 describes the use of
isolated and characterized genes for gene transfer and the generation of genetically modi-
fied plants, focusing on the vital elements of expression vectors, selectable marker genes,
transgene integration, expression and localization, transgene stacking and transgenic crop
commercialization. Chapter 13 is devoted to intellectual property rights and plant vari-
ety protection, including plant breeders rights, international agreements affecting plant
breeding, plant variety protection strategies, intellectual property rights affecting molec-
ular breeding and the use of molecular techniques in plant variety protection. The last
two chapters (14 and 15) discuss supporting tools that are required in molecular breeding
for information management and decision making, including data collection, integration,
retrieval and mining and information management systems. Decision support tools are
described for germplasm and breeding population management and evaluation, genetic
mapping and marker-trait association analysis, marker-assisted selection, simulation and
modelling, and breeding by design.

Intended audience and guidance for reading and using this book

This book is intended to provide a handbook for biologists, geneticists and breeders, as well
as a textbook for final year undergraduates and graduate students specializing in agronomy,
genetics, genomics and plant breeding. Although the book has attempted to cover all rel-
evant areas of molecular breeding in plants, many examples have been drawn from the
genomics research and molecular breeding of major cereal crops. It is hoped that the book
can also serve as a resource for training courses as described below. As each chapter covers
a complete story on a special topic, readers can choose to read chapters in any order.
Advanced Course on Quantitative Genetics: Chapters 1, 2, 4, 6, 7, 10 and 14, which
cover all molecular marker-based QTL mapping, including markers, maps, populations,
statistics and genotype-by-environment interaction.
Comprehensive Course on Marker-assisted Plant Breeding: Chapters 1, 2, 3, 4, 5, 8, 9,
10, 13, 14 and 15, which cover basic theories, tools, methodologies about markers, maps,
omics, arrays, informatics and support tools for marker-assisted selection.
Short Course on Genetic Transformation: Chapters 1, 11, 12 and 13, which provide
a brief introduction to gene isolation, transformation techniques, genetic-transformation-
related intellectual property and genetically modified organism (GMO) issues.
Preface xi

Introductory Course on Breeding Informatics: Chapters 1, 2, 3, 4, 5, 10, 14 and 15,

which cover bioinformatics, focusing on plant breeding-related applications, including
basic concepts in plant breeding, markers, maps, omics, arrays, population and germplasm
management, environment and geographic information system (GIS) information, data col-
lection, integration and mining, and bioinformatics tools required to support molecular
breeding. Additional introductory information can be found in other chapters.

History of writing this book

This book has been almost a decade in preparation. In fact, the initial idea for the book
was stimulated by the impact from my previous book Molecular Quantitative Genetics
published by China Agriculture Press (Xu and Zhu, 1994), which was well received by
colleagues and students in China and used as a textbook in many universities. Preliminary
ideas related to the book were developed in a review article on QTL separation, pyramiding
and cloning in Plant Breeding Reviews (Xu, 1997). Much of the hopeful thinking described
in this paper has fortunately come true during the following 10 years, and the manipula-
tion of QTL has been revolutionized and become mainstream. As complete sequences for
several plant genomes have become available and with more anticipated, as shown by
numerous genes and QTL that have been separated and cloned individually, some of them
have been pyramided for plant breeding through genetic transformation or marker-assisted
selection.
I started making tangible progress on this book while working as a molecular breeder
for hybrid rice at RiceTec, Inc., Texas (19982003). This experience shaped my thinking
about how an applied breeding programme could be integrated with molecular approaches.
With numerous QTL accumulating for a model crop, taking all the QTL into consideration
becomes necessary. Initial thoughts on this were described in Global view of QTL . . ., pub-
lished in the proceedings on quantitative genetics and plant breeding, which considered
various genetic background effects and genotype-by-environment interaction (Xu, Y., 2002).
Hybrid rice breeding, which involves a three-line system, requires a large number of test-
crosses in order to identify traits that perform well in seed and grain production. My expe-
rience in development of marker-assisted selection strategies for breeding hybrid rice was
then summarized in a review article in Plant Breeding Reviews (Xu, Y., 2003), which also
covered general strategies for other crops using hybrids.
Moving on to research at Cornell University with Dr Susan McCouch helped me to bet-
ter understand how molecular techniques could facilitate breeding of complex traits such
as water-use efficiency, which is a difficult trait to measure and requires strong collabora-
tion among researchers across many disciplines. In addition, this experience with rice as a
model crop raised the issue of how we can use rice as a reference genome for improvement
of other crops, which was discussed in an article published in a special rice issue of Plant
Molecular Biology (Xu et al., 2005).
With over 20 years experience in rice, I decided to shift to another major crop by work-
ing for the International Maize and Wheat Improvement Center (CIMMYT) as the principle
maize molecular breeder. CIMMYT has given me exposure to an interface connecting basic
research with applied breeding for developing countries and the resource-poor. Comparing
public- and private-sector breeding programmes has given me an intense understanding
of the importance of making the type of breeding systems that have been working well
for the private sector a practical reality for the public sector, particularly in developing
countries. This has been addressed in a recent review paper published in Crop Science
(Xu and Crouch, 2008), which discussed the critical issues for achieving this translation.
My most recent research has focused on the development of various molecular breeding
xii Preface

platforms that can be used to facilitate breeding procedures through seed DNA-based geno-
typing, selective and pooled DNA analysis, and chip-based large-scale germplasm evalua-
tion, markertrait association and marker-assisted selection (see Xu et al., 2009b for further
details). Thus, my career has evolved alongside the transition from molecular biology
research to routine molecular plant breeding applications and I strongly believe that now
is the right time for a mainstream publication providing comprehensive coverage of all
fields relevant for a new generation of molecular breeders.

Acknowledgements

Assistance and professional support

The dream of writing this book could not have become reality without the wonderful sup-
port of Dr Susan McCouch at Cornell University and Dr Jinhua Xiao, now at Monsanto, who
have both fully supported my proposal since 2002. Their support and consistent encour-
agement has greatly motivated me throughout the process. While working with Susan,
she allowed me so much flexibility in my research projects and working hours so that I
could continue to make progress on the writing of this book. At the same time the Cornell
libraries were an indispensible source of the major references cited throughout the book.
Susans encouragement provided the impetus to keep working on the book through a very
difficult time in my life. I also extend my appreciation to Dr Jonathan Crouch, the Director
of the Germplasm Resources Program at CIMMYT, where I received his full understanding
and support so that I could complete the second half of the book. Jonathans guidance and
contribution to my research projects and publications while at CIMMYT has significantly
impacted the preparation of the book.
I would also like to thank the chief editors of the three journals for which I have served
on the editorial boards during the preparation of this book: Dr Paul Christou for Molecular
Breeding, Dr Albrecht Melchinger for Theoretical and Applied Genetics, and Dr Hongbin
Zhang for International Journal of Plant Genomics. I thank them for their patience, support
and flexibility with my editorial responsibilities during the preparation of the book. In addi-
tion, Drs Christou and Melchinger also reviewed several chapters in their respective fields.
My appreciation also goes to Yanli Lu (a graduate student from Sichuan Agricultural
University of China) and Dr Zhuanfang Hao (a visiting scientist from the Chinese Academy
of Agricultural Sciences) who helped prepare some figures and tables during their work
in my lab at CIMMYT, Mexico. I would like to give special thanks to Dr Rodomiro Ortiz
at CIMMYT for his consistent information sharing and stimulating discussions during our
years together at CIMMYT. Finally, I would like to thank my colleagues at CIMMYT, par-
ticularly Drs Kevin Pixley, Manilal William, Jose Crossa and Guy Davenport, who provided
useful discussions on various molecular breeding-related issues.

Forewords

I am greatly indebted to Dr Norman E. Borlaug, visioned plant breeder and Nobel laure-
ate for his role in the Green Revolution, and Dr Ronald L. Phillips, Regents Professor and
McKnight Presidential Chair in Genomics, University of Minnesota, who each contributed
a foreword for the book. Their contributions emphasized the importance of molecular
breeding in crop improvement and the role that this book will play in molecular breeding
education and practice.
Preface xiii

Reviewers

Each chapter of the book has undergone comprehensive peer review and revision before
finalization. The constructive comments and critical advice of these reviewers have greatly
improved this book. The reviewers were selected for their active expertise in the field of the
respective chapter. Reviewers come from almost all continents and work in various fields
including plant breeding, quantitative genetics, genetic transformation, intellectual prop-
erty protection, bioinformatics and molecular biology, many of whom are CIMMYT scien-
tists and managers. Considering that each chapter is relatively large in content, reviewers
had to contribute a lot of time and effort to complete their reviews. Although these inputs
were indispensible, any remaining errors remain my sole responsibility. The names and
affiliations of the reviewers (alphabetically) are:

Raman Babu (Chapters 7 and 9), CIMMYT, Mexico

Paul Christou (Chapter 12), Lleida, Spain
Jose Crossa (Chapter 10), CIMMYT, Mexico
Jonathan H. Crouch (Chapters 13 and 15), CIMMYT, Mexico
Jedidah Danson (Chapters 7 and 9), African Center for Crop Improvement, South Africa
Guy Davenport (Chapter 14), CIMMYT, Mexico
Yuqing He (Chapter 8), Huazhong Agricultural University, Wuhan, China
Gurdev S. Khush (Chapter 1), IRRI, Philippines
Alan F. Krivanek (Chapter 4), Monsanto, Illinois, USA
Huihui Li (Chapter 6), Chinese Academy of Agricultural Sciences, China
George H. Liang (Chapter 12), San Diego, California, USA
Christopher Graham McLaren (Chapter 14), GCP/CIMMYT, Mexico
Kenneth L. McNally (Chapter 5), IRRI, Philippines
Albrecht E. Melchinger (Chapter 8), University of Hohenheim, Germany
Rodomiro Ortiz (Chapters 12, 13 and 15), CIMMYT, Mexico
Edie Paul (Chapter 14), GeneFlow, Inc., Virginia, USA
Kevin V. Pixley (Chapters 1, 4 and 5), CIMMYT, Mexico
Trushar Shah (Chapter 14), CIMMYT, Mexico
Daniel Z. Skinner (Chapter 12), Washington State University, USA
Debra Skinner (Chapter 11), University of Illinois, USA
Michael J. Thomson (Chapters 2 and 3), IRRI, Philippines
Bruce Walsh (Chapters 1, 6 and 8), University of Arizona, USA
Marilyn L. Warburton (Chapter 5), USDA/Mississippi State University, USA
Huixia Wu (Chapter 12), CIMMYT, Mexico
Rongling Wu (Chapter 1), University of Florida, Gainesville, USA
Weikai Yan (Chapter 10), Agriculture and Agri-Food Canada, Ottawa, Canada
Qifa Zhang (Chapters 8 and 12), Huazhong Agricultural University, Wuhan, China
Wanggen Zhang (Chapter 12), Syngenta, Beijing, China
Yuhua Zhang (Chapter 12), Rothamsted Research, UK

Publishers and development editors

Several editors at CABI have been working with me over the years: Tim Hardwick (2002
2006), Sarah Hulbert (20062007), Stefanie Gehrig (20072008), Claire Parfitt (20082009),
Meredith Caroll (2009) and Tracy Head (2009). These editors and their associates have
done a superb job of converting a series of manuscripts into a useable and coherent book.
I thank them for their effort, consideration and cooperation.
xiv Preface

Research grants

During the preparation of the book, my research on genomic analysis of plant water-use
efficiency at Cornell University was supported by the National Science Foundation (Plant
Genome Research Project Grant DBI-0110069). My molecular breeding research at CIMMYT
has been supported by the Rockefeller Foundation, the Generation Challenge Programme
(GCP), Bill and Melinda Gates Foundation and the European Community, and through
other attributed or unrestricted funds provided by the members of the Consultative Group
on International Agricultural Research (CGIAR) and national governments of the USA,
Japan and the UK.

Family

It is difficult to imagine writing a book without the full support and understanding of ones
family. My greatest thanks go to my wife, Yu Wang, who has given me her wholehearted and
unwavering support, and to my sons, Sheng, Benjamin and Lawrence, who have retained
great patience during this long adventure. And finally to my parents, for their love, encour-
agement and vision that unveiled in me from my earliest years the desire to thrive on the
challenge of always striving to reach the highest mountain in everything I do.
Foreword
DR NORMAN E. BORLAUG

The past 50 years have been the most productive period in world agricultural history.
Innovations in agricultural science and technology enabled the Green Revolution, which
is reputed to have spared one billion people the pains of hunger and even starvation.
Although we have seen the greatest reductions in hunger in history, it has not been enough.
There are still one billion people who suffer chronic hunger, with more than half being
small-scale farmers who cultivate environmentally sensitive marginal lands in developing
countries.
Within the next 50 years, the world population is likely to increase by 6080%, requir-
ing global food production to nearly double. We will have to achieve this feat on a shrinking
agricultural land base, and most of the increased production must occur in those countries
that will consume it. Unless global grain supplies are expanded at an accelerated rate, food
prices will remain high, or be driven up even further.
Spectacular economic growth in many newly industrializing developing countries,
especially in Asia, has spurred rapid growth in global cereal demand, as more people eat
better, especially through more protein-heavy diets. More recently, the subsidized conver-
sion of grains into biofuels in the USA and Europe has accelerated demand even faster.
On the supply side, a slowing in research investment in the developing world and more
frequent climatic shocks (droughts, floods) have led to greater volatility in production.
Higher food prices affect everyone, but especially the poor, who spend most of their
disposable income on food. Increasing supply, primarily through the generation and diffu-
sion of productivity-enhancing new technologies, is the best way to bring food prices down
and secure minimum nutritional standards for the poor.
Todays agricultural development challenges are centred on marginal lands and in
regions that have been bypassed during the Green Revolution, such as Africa and resource-
poor parts of Asia, and are experiencing the ripple effects of food insecurity through hun-
ger, malnutrition and poverty.
Despite these serious and daunting challenges, there is cause for hope. New science
and technology including biotechnology have the potential to help the worlds poor and
food insecure. Biotechnologies have developed invaluable new scientific methodologies
and products for more productive agriculture and added-value food. This journey deeper
into the genome to the molecular level is the consequence of our progressive understanding
of the workings of nature. Genomics-based methods have enabled breeders greater preci-
sion in selecting and transferring genes, which has not only reduced the time needed to

xv
xvi Foreword

eliminate undesirable genes, but has also allowed breeders to access useful genes from
distant species.
Bringing the power of science and technology to bear on the challenges of these riskier
environments is one of the great challenges of the 21st century. With the new tools of bio-
technology, we are poised for another explosion in agricultural innovation. New science
has the power to increase yields, address agroclimatic extremes and mitigate a range of
environmental and biological challenges.
Molecular Plant Breeding, authored by my CIMMYT colleague Yunbi Xu, is an out-
standing review and synthesis of the theory and practice of genetics and genomics that
can drive progress in modern plant breeding. Dr Xu has done a masterful job in integrating
information about traditional and molecular plant breeding approaches. This encyclope-
dic handbook is poised to become a standard reference for experienced breeders and stu-
dents alike. I commend him for this prodigious new contribution to the body of scientific
literature.
Foreword
DR RONALD L. PHILLIPS

The New Plant Breeding Roadmap

The road is long from basic research findings to final destinations reflecting important
applications but it is a road that can ultimately save time and money. There may be obsta-
cles along the way that delay building that road but they are generally overcome by careful
thought and timely considerations. A new road may involve the former road but with some
widening and the filling in of certain potholes. We seldom look back and think that the
improvements were not useful.
The road to improved varieties by traditional plant breeding has and continues to serve
society well. That approach has been based on careful observation, evaluation of multi-
ple genotypes (parents and progenies), selection at various generational levels, extensive
testing and the sophisticated utilization of statistical analyses and quantitative genetics.
About 50% of the increased productivity of new varieties is generally attributed to genetic
improvements, with the remaining 50% due to many other factors such as time of planting,
irrigation, fertilizer, pesticide applications and planting densities.
The statistical genetics associated with traditional plant breeding can now be supple-
mented by extensive genomic information, gene sequences, regulatory factors and linked
genetic markers. We can now draw on a broader genetic base, the identification of major
loci controlling various traits and expression analyses across the entire genome under vari-
ous biotic and abiotic conditions. One can anticipate a future when the networking of
genes, genotype-by-environment (G E) interactions, and even hybrid vigour will be better
understood and lead to new breeding approaches. The importance of de novo variation may
modify much of our current interpretation of breeding behaviour; de novo variation such as
mutation, intragenic recombination, methylation, transposable elements, unequal crossing
over, generation of genomic changes due to recombination among dispersed repeated ele-
ments, gene amplification and other mechanisms will need to be incorporated into plant
breeding theory.
This book calls for an integration of approaches traditional and molecular and
represents a theoretical/practical handbook reflecting modern plant breeding at its fin-
est. I believe the reader will be surprised to find that that this single-authored book is so
full of information that is useful in plant genetics and plant breeding. Students as well
as established researchers wanting to learn more about molecular plant breeding will be

xvii
xviii Foreword

well-served by reading this book. The information is up-to-date with many current refer-
ences. Even many of the tables are packed with information and references. A good rep-
resentation of international and domestic breeding is reflected through many examples.
The importance of G E interactions is clearly demonstrated. Various statistical models
are provided as appropriate. The importance of defining mega-environments for varietal
development is made clear. The role of core germplasm collections, appropriate population
sizes, major databases and data management issues are all integrated with various plant
breeding approaches. Marker-assisted selection receives considerable attention, includ-
ing its requirements and advantages, along with the multitude of quantitative trait locus
(QTL) analysis methods. Transformation technologies leading to the extensive use of trans-
genic crops are reviewed along with the increased use of trait stacking. The procurement of
intellectual property that, in part, is driving the application of molecular genetics in plant
breeding provides the reader with an understanding of why private industry is now more
involved and why some common crops represent new business opportunities.
Molecular Plant Breeding is not like other plant breeding books. The interconnecting
road that it depicts is one where you can look at the beautiful new scenery and appreciate
the current view, yet see the horizon down the road.
1
Introduction

Several definitions of plant breeding have and technologies discussed in the following
been put forward, such as the art and sci- chapters of this book.
ence of improving the heredity of plants for
the benefit of humankind (J.M. Poehlman),
or evolution directed by the will of man
(N.I. Vavilov). Bernardo (2002), however, 1.1 Domestication of Crop Plants
offers the most universal description: Plant
breeding is the science, art, and business of The earliest records indicate that agricul-
improving plants for human benefit. ture developed some 11,000 years ago in
Plants are employed in the manufac- the so-called Fertile Crescent, a hilly region
ture of a multitude of products for domes- in south-western Asia. Agriculture devel-
tic (cosmetics, medicines and clothing), oped later in other regions. Archaeologists
industrial (manufacture of rubber, cork suggest that plant domestication began
and engine fuel) and recreational uses because of the increasing size of popula-
(paper, art supplies, sports equipment and tions and changes in the exploitation of
musical instruments) and plant breeders local resources (see http://www.ngdc.noaa.
have therefore been driven by the chal- gov/paleo/ctl/10k.html for further details).
lenges of meeting the ever increasing Domestication is a selection process carried
demands of the manufacturers of these out by man to adapt plants and animals to
products. Lewington has described the their own needs, whether as farmers or con-
diverse uses of plants in his book Plants sumers. Successive selection of desirable
for People (2003). plants changed the genetic composition of
Plant breeding began by the domesti- early crops. Primitive farmers, knowing little
cation of crop plants and has become ever or nothing about genetics or plant breeding,
more sophisticated. New developments accomplished much in a short time. They
in molecular biology have now led to an did so by unconsciously altering the natural
increasing number of methods which can process of evolution. Indeed, domestication
be used to enhance breeding effective- is nothing more than directed evolution; as
ness and efficiency. This chapter includes a result, the process of evolution is acceler-
a brief history of plant breeding together ated. The key to domestication is the selec-
with breeding objectives and some back- tive advantage of rare mutant alleles, which
ground information relevant to the theories are desirable for successful cultivation,

Yunbi Xu 2010. Molecular Plant Breeding (Yunbi Xu) 1

2 Chapter 1

but unnecessary for survival in the wild. domesticated plants is another example.
The process of selection continues until For further information see http://oregon-
the desired mutant phenotype dominates state.edu/instruct/css/330/index.htm and
the population. There are three important Swaminathan (2006).
steps in the domestication process. Man It is generally believed that domesti-
not only planted seeds, but also: (i) moved cation of crop plants was undertaken in
seeds from their native habitat and planted several regions of the world independ-
them in areas to which they were perhaps ently. The Russian geneticist and plant
not as well adapted; (ii) removed certain geographer N.I. Vavilov, collected plants
natural selection pressures by growing the from all over the world and identified
plants in a cultivated field; and (iii) applied regions where crop species and their wild
artificial selection pressures by choosing relatives showed great genetic diversity. In
characteristics that would not necessarily 1926 he published Studies on the origin
have been beneficial for the plants under of cultivated plants in which he described
natural conditions. Cultivation also cre- his theories regarding the origins of crops.
ates selection pressure, resulting in changes Vavilov concluded that each crop had a
in allele frequency, gradations within and characteristic primary centre of diversity
between species, fixation of major genes, which was also its centre of origin. He
and improvement of quantitative traits. By identified eight areas and hypothesized
the end of the 18th century, the informal that these were the centres from which all
processes of selection practised by farmers our modern major crops originated. Later,
everywhere led to the worldwide creation he modified his theory to include second-
of thousands of different cultivars or land- ary centres of diversity for some crops.
races for each major crop species. These centres of origin included China,
More than 1000 species of plants have India, Central Asia, the Near East, the
been domesticated at one time or another, of Mediterranean, East Africa, Mesoamerica,
which about 100200 are now major com- and South America. From these foci, agri-
ponents of the human diet. The 15 most culture was progressively disseminated to
important examples can be divided into the other regions such as Europe and North
following four groups: America. Subsequently, others includ-
ing the American geographer Jack Harlan,
1. Cereals: rice, wheat, maize, sorghum,
challenged Vavilovs hypothesis because
barley.
many cultivated plants did not fit Vavilovs
2. Roots and stems: sugarbeet, sugarcane,
pattern, and appeared to have been domes-
potato, yam, cassava.
ticated over a broad geographical area for a
3. Legumes: bean, soybean, groundnut.
long period of time.
4. Fruits: coconut, banana.
In recent years, variation in DNA frac-
Certain characteristics may have been tions and other approaches have been used
selected deliberately or unwittingly. When to study the diversity of crop species. In
farmers set aside a portion of their har- general, these studies have not confirmed
vest for planting in the next season, they Vavilovs theory that the centres of origin
were selecting seeds with specific char- are the areas of greatest diversity, because
acteristics. This selection has resulted in while centres of diversity have been iden-
profound differences between crop plants tified, these are often not the centres of
and their progenitors. For example, many origin. For some crops there is little connec-
wild plants have a seed dispersal mecha- tion between the source of their wild ances-
nism that ensures that seeds will be sepa- tors, areas of domestication, and the areas
rated from the plants and distributed over of evolutionary diversification. Species
as large an area as possible, while mod- may have originated in one geographic area,
ern crops have been modified by selec- but domesticated in a different region and
tion against seed dispersal. The absence some crops do not appear to have centres
of seed dormancy mechanisms in some of diversity, thus a continuum of evolution-
Introduction 3

ary activity is perceived rather than discrete selected the best plants to provide seed for
centres. their next crop. When they found particular
In 1971, Jack Harlan described his own plants that fared well even in bad weather,
views on the origins of agriculture. He pro- were especially prolific, or resisted disease
posed three independent systems, each that had destroyed neighbouring crops,
with a centre and a concentre (larger, dif- they naturally tried to capture these desir-
fuse areas where domestication is thought to able traits by crossbreeding them into other
have occurred): Near East + Africa, China + plants. In this way, they selected and bred
South-east Asia, and Mesoamerica + South plants to improve their crop for commercial
America. purposes. Although unbeknown to them,
Evidence gathered since that time farmers have been utilizing genetics for cen-
suggests that these centres are also more turies to modify the food we eat by selecting
diffuse than he had envisioned. After the and growing seeds which produce a health-
initial phases of evolution, species spread ier crop that has a better flavour, richer col-
out over large, ill-defined areas. This is our and stronger resistance to certain plant
probably due to the dispersal and evolu- diseases.
tion of crops associated with iterant popu- Modern plant breeding started with
lations. Regional and/or multiple areas of sedentary agriculture and the domestica-
origin may prove to be more accurate than tion of the first agricultural plants, cereals.
the hypothesis of a unique, localized ori- This led to the rapid elimination of undesir-
gin for many crops. However, the probable able characters such as seed-shattering and
geographic origin of many crops is listed in dormancy and we can only speculate on
Table 1.1. how much foresight or what kind of plan-
ning based on experience was used by the
first selectors of non-shattering wheat and
1.2 Early Efforts at Plant Breeding rice, compact-headed sorghum, or soft-
shelled gourds. For 10,000 years man has
For thousands of years selective breed- consciously been moulding the phenotype
ing has been employed to re-engineer (and so the genotype) of hundreds of plant
plants to produce traits or qualities that species as one of the many routine activi-
were considered to be desirable to con- ties in the normal course of making a living
sumers. Selective breeding began with the (Harlan, 1992). Over long periods of time
early farmers, ranchers and vintners who there was a transition from the collection of

Table 1.1. Probable geographic origins for crops.

Region Crops

Near East (Fertile Crescent) Wheat and barley, flax, lentils, chickpea, figs, dates, grapes, olives,
lettuce, onions, cabbage, carrots, cucumbers, melons;
fruits and nuts
Africa Pearl millet, Guinea millet, African rice, sorghum, cowpea, groundnut,
yam, oil palm, watermelon, okra
China Japanese millet, rice, buckwheat, soybean
South-east Asia Wet- and dryland rice, pigeon pea, mung bean, citrus fruits, coconut,
taro, yams, banana, breadfruit, coconut, sugarcane
Mesoamerica and Maize, squash, common bean, lima bean, peppers, amaranth,
North America sweet potato, sunflower
South America Lowlands: cassava; Mid-altitudes and uplands (Peru): potato,
groundnut, cotton, maize

See http://agronomy.ucdavis.edu/gepts/pb143/lec10/pb143l10.htm for a thorough presentation on the geographic origins

of crops.
4 Chapter 1

wild plants for food to the selection of those of commercial seed production enterprises.
to be cultivated which began to guide the Besides selecting plants with useful charac-
evolutionary process. Now plant breeders teristics, breeders also arrange marriages
accelerate the evolution of major crop spe- between plants with different traits in the
cies through skilful manipulation of breed- hope of producing fertile offspring carry-
ing procedures. High-input agriculture ing both traits. The use of artificial crosses
emerged as a result of voyages of discovery in pre-Mendelian breeding is exemplified
and modern science. by the case of Fragaria ananassa devel-
Many traits important to early agricul- oped in the botanical garden of Paris by
turists were heritable and, therefore, could Duchesne, in the 17th century by crossing
be reliably selected. However, this phase Fragaria chiloense with Fragaria virgin-
of breeding was empirical and generally iana. In England, at about the same time
not considered scientific in the modern new cultivars of fruits, wheat and peas were
sense because changes in these plant and being obtained by artificial hybridization
animal populations were not analysed in (Snchez-Monge, 1993).
an attempt to explain biological phenom- Hybridization combined with selec-
ena. At this stage of agriculture, the focus tion was adopted by Patrik Sheireff in
was on the practical goal of producing 1819 in wheat and rice where the new
food rather than finding rational explana- selections were grown along with culti-
tions for nature (Harlan, 1992). Ideas about vars for comparative purposes. He specu-
heredity during the period when many lated that introduction and hybridization
early crops were domesticated ranged to be the important sources of new cul-
from mythological interpretations to near- tivars and stressed crossing of carefully
scientific notions of trait transmission. In selected parents to meet the aims of new
his Presidential Address to the American cultivars. Although the essential elements
Society for Horticulture Science in 1987, of plant breeding were known by this time,
Janick (1988) stated: there was still a lack of knowledge regard-
ing the scientific basis of variation among
The origin of new information in
horticulture derives from two traditions: plants. For example, the first generation
empirical and experimental. The roots of of crossed materials were mistakenly
empiricism stem from efforts of prehistoric expected to inevitably produce new culti-
farmers, Hellenic root diggers, medieval vars but instead took several generations to
peasants, and gardeners everywhere to stabilize. Many historical examples of suc-
obtain practical solutions to problems of cessful plant breeding can be found in the
plant growing. The accumulated successes literature, although there were still many
and improvements passed orally from important discoveries to be made before it
parent to child, from artisan to apprentice,
could be called a technology (Chahal and
have become embodied in human
Gosal, 2002).
consciousness via legend, craft secrets,
and folk wisdom. This information is
now stored in tales, almanacs, herbals,
and histories and has become part of our
1.3 Major Developments in the
common culture. More than practices
and skills were involved as improved History of Plant Breeding
germplasm was selected and preserved via
seed and graft from harvest to harvest and Plant breeders of today use various meth-
generation to generation. The sum total of ods to accelerate the evolutionary process
these technologies makes up the traditional in order to increase the usefulness of plants
lore of horticulture. It represents a by exploiting genetic differences within a
monumental achievement of our forbears
species. This has been made possible by the
unknown and unsung.
determination of the genetic basis for devel-
Large-scale breeding activities began oping crop breeding procedures and this in
very early in Europe, often under the auspices turn has a long history.
Introduction 5

1.3.1 Breeding and hybridization 1.3.3 Selection

The role of reproduction in plants was In 1859 Darwin proposed in The Origin of
first reported in 1694 by Camerarius who Species that natural selection is the mech-
noticed the difference between male and anism of evolution. Darwins thesis was
female reproductive organs in maize and that the adaptation of populations to their
produced the first artificial hybrid plant. He environments resulted from natural selec-
established that seed could not be produced tion and that if this process continued for
without the participation of pollen produced long enough, it would ultimately lead to
in male reproductive organs of plants. The the origin of new species. Darwins Theory
first hybridization experiment was carried of Evolution through Natural Selection
out on wheat by Fairchild in 1719 and the hypothesized that plants change gradually
current technique of hybridization is largely by natural selection operating on variable
based on the work of Klreuter (17331806), populations and was the outstanding dis-
a French researcher who carried out his covery of the 19th century with direct rele-
experiments in the 1760s. Hybridization vance to plant breeding.
freed the breeder from the severe constraints
of working within a limited population,
enabled him to bring together useful traits
1.3.4 Breeding types and polyploidy
from two or more sources, and allowed spe-
cific genes to be introduced.
By understanding the reproductive Other historical developments in plant
capacities of plants, plant breeders can breeding include, pedigree breeding, back-
manipulate these crosses to produce fer- cross breeding (Harlan and Pope, 1922) and
tile offspring which carry traits from both mutation breeding (Stadler, 1928). Natural
parents. Crossing has been very valuable and artificial polyploids also offered new
to plant breeders, because it allows some possibilities for plant breeding. Blakeslee
measure of control over the phenotype of and Avery (1937) demonstrated the use-
a plant. Nearly all modern plant breeding fulness of colchicine in the induction of
involves some use of hybridization. chromosome doubling and polyploidy,
enabling plant breeders to combine entire
chromosome sets of two or more species to
evolve new crop plants.
1.3.2 Mendelian genetics

It was Gregor Johann Mendel, a Moldavian

monk, who in 1865 discovered the basic 1.3.5 Genetic diversity and germplasm
rules that govern heredity as a result of a conservation
series of experiments in which he crossed
two cultivars of pea plants. By studying The importance of genetic diversity in plant
the inheritance of all-or-none variation in breeding was recognized by the 1960s and
peas, Mendel discovered that inherited Sir Otto Frankel coined the term genetic
traits are determined by units of material resources in 1967 to highlight the relevance
that are transferred from one generation and need to consider germplasm as a natural
to another. Mendel was probably ahead of resource for the long-term improvement of
his time as other biologists of that era took crop plants. The potentially harmful effects
35 years to appreciate his work and plant of genetic uniformity became apparent with
breeding remained deprived of the deliber- the epidemic of southern corn leaf blight
ate application of the law of genetics until in the USA in 1970 which destroyed about
1900 when Hugo de Vries, Carl Correns and 15% of US maize in just 1 year. The National
Erich von Tschermak-Seysenegg rediscov- Academy of Sciences, USA, released the
ered Mendels work. results of its study Genetic Vulnerability
6 Chapter 1

in Major Crops that brought into focus the of plant genomics, particularly molecular
causes and levels of genetic uniformity and markers, and other molecular tools that can
its consequences. It was a turning point be used to dissect complex traits into sin-
in the history of germplasm resources and gle Mendelian factors (Xu and Zhu, 1994;
the International Board for Plant Genetic Buckler et al., 2009; Chapters 6 and 7).
Resources (IBPGR) was established in 1974, Genotype-by-environment interaction
and was later renamed the International (GEI) and its importance to plant breed-
Plant Genetic Resources Institute (IPGRI) ing were first recognized by Mooers (1921)
and now Biodiversity International, to col- and Yates and Cochran (1938). Since then,
lect, evaluate and conserve plant germplasm various statistical methods have been
for future use. developed for the evaluation of GEI using
joint linear regression, heterogeneity of
variance and lack of correlation, ordina-
tion, clustering, and pattern analysis. As
1.3.6 Quantitative genetics and an important field in quantitative genet-
genotype-by-environment interaction ics, GEI has been receiving more attention
in recent years and is covered in Chapter
Quantitative genetics is the study of the 10 along with molecular methods for GEI
genetic control of those traits which show analysis.
continuous variation. It is concerned with
the level of inheritance of these differences
between individuals rather than the type of 1.3.7 Heterosis and hybrid breeding
differences, that is quantitative rather than
qualitative (Falconer, 1989). Several important
Although early botanists had observed
books have been published which document
increased growth when unrelated plants
the major developments in quantitative genet-
of the same species were crossed, it was
ics and these include Animal Breeding Plans
Charles Darwin who carried out the first
(Lush, 1937), Population Genetics and Animal
seminal experiments. In 1877, he showed
Improvement (Lerner, 1950), Biometrical
that crosses of related strains did not
Genetics (Mather, 1949), Population Genetics
exhibit the vigour of hybrids. He observed
(Li, 1955), An Introduction to Genetics Statistics
heterosis, i.e. the tendency of cross-bred
(Kempthorne, 1957) and Introduction to
individuals to show qualities superior to
Quantitative Genetics (Falconer, 1960).
those of both parents, in crops like maize
Many of the misconceptions regarding
and concluded that cross-fertilization was
the inheritance of quantitative traits, which
generally beneficial and self-fertilization
include most of the economically important
injurious. In 1879, William Beal demon-
characters, were corrected by the classical
strated hybrid vigour in maize by using
work of Fisher (1918) who successfully
two unrelated cultivars. The best combi-
applied Mendelian principles to explain
nations yielded 50% more than the mean
the genetic control of continuous varia-
of the parents. Reports by Sanborn in 1890
tion. He divided the phenotypic variance
and McClure in 1892 confirmed Beals ear-
observed into three variance components:
lier reports and extended the generality of
additive, dominance and epistatic effects. This
the superiority of hybrids over the average
approach has been substantially refined and
of the parental forms.
applied to the improvement of the efficiency
of plant breeding. Fisher also laid the found-
ations for scientific crop experimentation
by developing the theory of experimental 1.3.8 Refinement of populations
designs that is an essential part of any plant
breeding programme. Quantitative genetics Several different population breeding
has however evolved considerably in the methods can be used: (i) bulk; (ii) mass
past two decades because of the development selection; and (iii) recurrent selection. One
Introduction 7

of the methods used for managing large All the genes necessary to make an
populations of segregates was the bulk entire organism can be induced to function
method proposed by Harlan et al. (1940) in the correct sequence from a living cell
for multi-parent crosses. This concept isolated from a mature tissue (called totipo-
changed the breeding methodologies for tency). Regeneration of whole plants from
self-pollinated species. Mass selection is single cells is an important new source of
a system of breeding in which seeds from genetic variability for refining the properties
individuals selected on the basis of pheno- of plants because when somatic embryos
type are bulked and used to grow the next derived from single cells are grown into
generation. Mass selection is the oldest plants, the plants characteristics vary some-
breeding method for plant improvement what. Larkin and Scowcroft (1981) coined
and was employed by early farmers for the the term somaclonal variation to describe
development of cultivated species from this observed phenotypic variation among
their ancestral forms. plants derived from micro-propagation
The enhancement of open-pollinated experiments. When it was recognized as a
populations of crops such as rye, maize genuine phenomenon, somaclonal variation
and sugarbeet, herbage grasses, legumes, was considered to be a potential tool for the
and tropical trees such as cacao, coconut, introduction of new variants of perennial
oil palm, and some rubber, depends essen- crops that can be asexually propagated (e.g.
tially on changing the gene frequencies so banana). Somaclonal variation has also been
that the favourable alleles are fixed, while exploited by plant breeders as a new source
maintaining a high (but far from maximal) of genetic variation for annual crops.
degree of heterozygosity. Recurrent selec-
tion is a method of plant breeding associ-
ated with quantitatively inherited traits by
which the frequencies of favourable genes 1.3.10 Genetic engineering and
are increased in populations of plants. The gene transfer
methodology is cyclical with each cycle
encompassing two phases: (i) selection of The discovery of the structure of DNA by
genotypes that possess the favourable or Watson and Crick has enhanced traditional
required genes; and (ii) crossing among the breeding techniques by allowing breeders to
selected genotypes. This leads to a gradual pinpoint the particular gene responsible for
increase in the frequencies of the desired a particular trait and to follow its transmis-
alleles. While recurrent selection is often sion to subsequent generations. Enzymes
successful it also has potential limitations that cut and rejoin DNA molecules allow sci-
in closed populations and this has led to entists to manipulate genes in the laboratory.
numerous modifications and alternative In 1973 Stanley Cohen and Herbert Boyer
schemes (see Hallauer and Miranda, 1988). spliced the gene from one organism into the
Recurrent selection breeding methods have DNA of another to produce recombinant
been applied to a wide range of plant spe- DNA which was then expressed normally
cies, including self-pollinated crops. and this formed the basis of genetic engin-
eering. The goal of plant genetic engineers
is to isolate one or more specific genes and
introduce these into plants. Improvement in
1.3.9 Cell totipotency, tissue culture a crop plant can often be achieved by intro-
and somaclonal variation ducing a single gene, and genes can now be
transferred to plants using the natural gene
The discovery of auxins, by Went and transfer system of a promiscuous pathogenic
colleagues, and cytokinins, by Skoog and soil bacterium, Agrobacterium tumefaciens.
colleagues, preceded the first success of in DNA can also be introduced into cells by
vitro culture of plant tissues (White, 1934; bombardment with DNA-coated particles
Nobcourt, 1939). or by electroporation. Transgenic breeding
8 Chapter 1

has the potential to decrease or increase 1.3.12 Breeding efforts in the public
the environmental impact of agricultural and private sectors
practices.
The initial successes in plant genetic Agricultural research has mainly been the
engineering marked a significant turning responsibility of a national and/or state gov-
point in crop research. In the 1990s in par- ernment department. To accelerate progress
ticular, there was an upsurge of private sec- in food production especially in developing
tor investment in agricultural biotechnology. countries, international agricultural research
Some of the first products were plant strains centres were established with major empha-
capable of synthesizing an insecticidal pro- sis on the development of high yielding culti-
tein encoded by a gene isolated from the vars. Two centres, International Rice Research
bacterium Bacillus thuringiensis (Bt). Bt cot- Institute (IRRI), Philippines, and Centro Inter-
ton, maize, and other crops are now grown nacional de Mejoramiento de Maiz y Trigo
commercially. There are also crop cultivars (CIMMYT), Mexico, established in the 1960s,
which are tolerant to or capable of degrad- made phenomenal contributions to food pro-
ing herbicides. Proponents stress the value duction by developing shorter and higher-
of these crops in conserving tillage soil, yielding rice, wheat and maize cultivars.
reducing the use of harmful chemicals and Encouraged by the astonishing success of
reducing the labour and costs involved in these centres and two others which were
crop production. established later, the Consultative Group on
International Agricultural Research (CGIAR)
was established in 1971. The CGIAR now has
1.3.11 DNA markers and genomics 15 international agricultural research cen-
tres, of which eight concentrate on specific
During the 1980s and 1990s, various types crop plants and one on genetic resources
of molecular markers such as restriction with a mission to contribute towards sus-
fragment length polymorphism (RFLP) tainable agriculture for food security espe-
(Botstein et al., 1980), randomly ampli- cially in developing countries. The breeding
fied polymorphic DNA (RAPD) (Williams materials developed at these centres are dis-
et al., 1990; Welsh and McClelland, 1990), tributed to public and private sector research
microsatellites and single nucleotide poly- programmes for utilization in the develop-
morphism (SNP) were developed. Because ment of locally adapted cultivars. Through
of their abundance and importance in the National Agricultural Research Systems
plant genome, molecular markers have been (NARS), these centres work in close coor-
widely used in the fields of germplasm dination with public and private breed-
evaluation, genetic mapping, map-based ing programmes in each country and share
gene discovery and marker-assisted plant their breeding technologies and stocks of
breeding. Molecular marker technology germplasm.
has become a powerful tool in the genetic In the USA, crop breeding, with the
manipulation of agronomic traits. exception of cotton, began largely as a
Initiated by the complete sequencing tax-supported endeavour with breeding
of the Arabidopsis genome in 2000 (The programmes taking place in most State
Arabidopsis Genome Initiative, 2000) and Agricultural Experimental Stations and in
the rice genome in 2002 (Goff et al., 2002; the United States Department of Agriculture
Yu et al., 2002), the genomes of an increas- (USDA). This pattern changed with the
ing number of plants have been or are being advent of hybrid maize when inbred lines
sequenced. Technological developments in were initially developed by public institu-
bioinformatics, genomics and various omics tions and utilized to produce hybrids by pri-
fields are creating substantial data on which vate companies. With the implementation
future revolutions in plant breeding can be of a Plant Variety Protection Act in the USA
based. in 1974, private breeding was expanded to
Introduction 9

include forages, cereals, soybean, and other leading to the proliferation of specific traits
crops. The activities of private companies within that population. The degree of gene
contributed to the total crop breeding flow varies widely and is dependent on the
effort and offered a large number of culti- type of organism and population structure.
var options for farmers and consumers. For example genes in a mobile popula-
In the USA and other industrialized coun- tion are likely to be more widely distrib-
tries today, the new life-science companies uted than those in a sedentary population,
notably the big multinationals such as Dow, resulting in high and low rates of gene flow,
DuPont and Monsanto, dominate the appli- respectively.
cation of biotechnology to agriculture, and
have developed many proprietary products.
1.4.2 Mutation

1.4 Genetic Variation A mutation is any change in the sequence

of the DNA encoding a gene which leads
The creation of new alleles and the mixing to a change in the hereditary material
of alleles through recombination give rise to when an organism undergoes DNA replica-
genetic variation which is one of the forces tion. During the process of replication, the
behind evolution. Natural selection favours nucleotides of a chromosome are altered,
one phenotype over another and these phe- so rather than creating an identical copy of
notypes are conditioned by one or more DNA strands, there are chemical variations
alleles. Genetic variation is fundamental for in the replicated strands. The alteration on
selection, by which progress in plant breed- the chemical composition of DNA triggers
ing can be made. There are various sources a chain reaction in the genetic information
of genetic variation and those described in of an individual. The effect of a mutation
this section are largely based on the infor- depends on its size, location (intron or exon
mation provided at the following web sites: etc.), and the type of cell in which the muta-
http://www.ndsu.nodak.edu/instruct/mcclean/ tion occurs. Large changes involve the loss,
plsc431/mutation and http://evolution.berke addition, duplication or rearrangement of
ley.edu/evosite/evo101/IIICGeneticvariation. whole chromosomes or chromosome seg-
shtml. ments. Most DNA polymerases have the
ability to proofread their work to ensure that
the unaltered genetic material is transferred
1.4.1 Crossover, genetic drift and to the next generation. There are many types
gene flow of mutation and the most common are listed
below.
Chromosomal crossover takes place during 1. Point mutations represent the smallest
meiosis and results in a chromosome with changes where only a single base is altered.
a completely different chemical composi- For example, a single nucleotide change may
tion from the two parent chromosomes. result in the change of an amino acid (aa)
During the process, two chromosomes inter- codon into a stop codon and thus produce
twine and exchange one end of the chro- a change in the phenotype. Point mutations
mosome with the other. The mechanism do not usually benefit the organism as most
of crossing over is the cytogenetic base for occur in recessive genes and are not usually
recombination. expressed unless two mutations occur at the
Gene flow refers to the passage of traits same locus.
or genes between populations to prevent 2. In synonymous or silent substitu-
the occurrence of large numbers of mutations the aa sequence of the protein is not
tions and genetic drift. In genetic drift, ran- changed because several codons can code
dom variation occurs in small populations for the same aa, and in non-synonymous
10 Chapter 1

substitutions changes in the aa sequence 7. A mutation in which one nucleotide is

may not affect the function of the protein. changed causing all the codons to its right to
However, there have been many cases where be altered is known as frame-shift mutation.
a change in a single nucleotide can create Since protein-coding DNA is divided into
serious problem, e.g. in sickle cell anaemia. codons of three bases long, insertions and
3. Wild-type alleles typically encode a deletions of a single base can alter a gene
product necessary for a specific biological so that its message is no longer correctly
function and if a mutation occurs in that parsed. As a result, a single base change
allele, the function for which it encodes is can have a dramatic effect on a polypeptide
also lost. The general term for these muta- sequence.
tions is loss-of-function mutations and they 8. Mutations which occur in germ line cells
are typically recessive. The degree to which including both the gametes and the cells
the function is lost can vary. If the function from which they are formed are known
is entirely lost, the mutation is called a null as germinal mutations. A single germ line
mutation. It is also possible that some func- mutation can have a range of effects: (i) no
tion may remain, but not at the level of the phenotypic change; mutations in junk DNA
wild-type allele, these are known as leaky are passed on to the offspring but have no
mutations. obvious effect on the phenotype; (ii) small
4. A small number of mutations are actually (or quantitative) phenotypic changes; and
beneficial to an organism providing new or (iii) significant phenotypic change.
improved gene activity. In these cases, the 9. Mutations in somatic cells which give
mutation creates a new allele that is asso- rise to all non-germ line tissues, only affect
ciated with a new function. Any heterozy- the original individual and cannot be passed
gote containing the new allele along with on to the progeny. To maintain this somatic
the original wild-type allele will express mutation, the individual containing the
the new allele. Genetically this will define mutation must be cloned.
the mutation as a dominant. This class of
In general, the appearance of a new
mutation is known as gain-of-function
mutation is a rare event. Most mutations
mutations.
that were originally studied occurred spon-
5. A substitution is a mutation in which
taneously. Such spontaneous mutations rep-
one base is exchanged for another. Such
resent only a small number of all possible
a substitution could change: (i) a codon to
mutations. To genetically dissect a biologi-
one that encodes a different aa thus caus-
cal system further, induced mutations can
ing a small change in the protein produced;
be created by treating an organism with a
(ii) a codon to one that encodes the same
mutagenic agent.
aa resulting in no change in the protein
produced; or (iii) an aa-coding codon to
a single stop codon resulting in an incom-
plete protein (this can have serious effects 1.5 Quantitative Traits: Variance,
since the incomplete protein will probably Heritability and Selection Index
not be functional).
6. Insertions/deletions (indels) produce Recent advances in high-throughput tech-
changes by deleting or inserting sections nologies for the quantification of biological
of DNA into the parental DNA sequence. molecules have shifted the focus in quan-
Because it is usually impossible to say titative genetics from single traits to com-
whether a sequence has been deleted from prehensive large-scale analyses. So-called
one plant or inserted into another, these dif- omic technologies have now enabled genet-
ferences are called indels. Obviously the icists to determine how genetic informa-
deletion of part of a gene can seriously affect tion is translated into biological function
the phenotype of organisms. Insertions can (Keurentjes et al., 2008; Mackay et al., 2009).
be disruptive if they insert themselves into The ultimate goal of quantitative genetics in
the middle of genes or regulatory regions. the era of omics is to link genetic variation
Introduction 11

to phenotypic variation and to identify the both in terms of action and in transmission
molecular pathway from gene to function. through meiosis.
The recent progress made in humans by
combining linkage disequilibrium mapping
(Chapter 6) and transcriptomics (Chapter
3) holds great promises for high-resolution 1.5.2 The concept of allelic and
association mapping and identification genotypic frequencies
of regulatory genetic factors (Dixon et al.,
2007). Information from omics research will A biological population is defined geneti-
be integrated with our current knowledge at cally as a group of individuals that exist
the phenotypic level to increase the effec- together in time and space and that can
tiveness and efficiency of plant breeding. mate or be crossed to each other to produce
fertile progeny. Statistically, this group is
called a population. Breeding populations
are created by breeders to serve as a source
1.5.1 Qualitative and quantitative traits of cultivars that meet specific breeding
objectives.
In general, qualitative traits are genetically At the population level, genetics can be
controlled by one or a few major genes, characterized by allelic and genotypic fre-
each of which has a relatively large effect quencies. The allele frequencies refer to the
on the phenotype but is relatively insensi- proportion of each allele in the population,
tive to environmental influences. Trait dis- while the genotypic frequencies refer to the
tribution in a typical segregating population proportion of individuals (plants) in the
such as an F2 shows multi-peak distribu- population that have a particular genotype.
tion, although individuals within a category A gene may have many allelic states. Some
show continuous variation. Each individual of the alleles of a given gene may have such
in the population can be classified unam- marked effects as to be clearly recognized
biguously into distinct categories that cor- as a classical major mutant. Other alleles,
respond to different genotypes so that they though potentially separable at the DNA
can be studied using Mendelian methods. level, may well cause only minor differ-
Quantitative traits are genetically ences at the level of the external phenotype.
controlled by many genes, each of which has For example, one allele at a locus involved
a relatively small effect on the phenotype, with growth hormone production could be
but is largely influenced by environmental inactive and result in a dwarf plant, while
factors (Buckler et al., 2009). Trait distri- others may simply reduce or increase height
bution in an F2 population usually shows by a few percent.
a normal or bell-shape distribution and as Allele and genotypic frequencies can be
a result, individuals cannot be classified calculated by simple counting in the popu-
into phenotypic categories that correspond lation. For a gene with n alleles, there are
to different genotypes thus making the n(n + 1)/2 possible genotypes. The relation-
effects of individual genes indistinguish- ship between allele frequency and genotypic
able. Quantitative genetics is traditionally frequency for a single gene at the population
described as the study of all these genes as a level can be used to infer the genetic status
whole and the total variation observed in a of the gene in that population, relative to the
population results from the combined effects expected equilibrium under some assumed
of genetic (polygenes as a whole) and envi- mating system. Allele frequencies are gen-
ronmental factors. However, quantitative erally not an issue in breeding populations
variation is not due solely to minor allelic created from non-inbred parents or from
variation in structural genes as regulatory three or more inbred parents. But breed-
genes no doubt also contribute to this vari- ing populations in both self-pollinated and
ation. We expected polygenes to show all cross-pollinated crops are often created by
the typical properties of chromosomal genes crossing two inbred individuals.
12 Chapter 1

1.5.3 HardyWeinberg equilibrium (HWE) mean, m, also known as the first moment
about the origin, is a parameter used to
A population is in equilibrium if the allele measure the central location of a frequency
and genotypic frequencies are constant distribution. The population variance, s 2,
from generation to generation. A collec- also known as the second moment about the
tion of pure selfers is also at equilibrium mean, provides measures of the dispersion
if all are completely selfed, with PA1A1 = p of the distribution. If the yield trait for a cul-
and PA2A2 = q. This implies that the allele tivar that is genetically homogenous is taken
frequency and genotypic frequency share a as an example, the genetic effect for this
simple relationship: cultivar population is a constant. The yield
for all individuals should also be a constant
PA1A1 = p2 provided that environmental factors do not
PA1A2 = 2pq affect the yield which is equal to the pop-
PA2A2 = q2 ulation mean. However, the yield for each
individual is affected not only by its geno-
or
type but also by environmental factors such
(p + q)2 = p2 + 2pq + q2 as temperature, sunlight, water, and vari-
ous nutrients. As a result, individuals may
With one generation of random mating, have different phenotypic values, in this
i.e. an individual in the population that is case yield, resulting in continuous variation
equally likely to mate with any other indi- among individuals. Therefore, the individ-
vidual, the above simple relationship will ual yield measures vary either positively or
be obeyed. However, HWE represents ide- negatively around the population mean so
alized populations and breeders routinely that they are either higher or lower than the
use procedures that cause deviations from population mean by a certain number which
HWE. These procedures include the lack is determined by its variance.
of random mating, the use of small popula-
tion sizes, assortative mating, selection, and
inbreeding during the development of prog-
enies. Some of these procedures, such as 1.5.5 Heritability
inbreeding and the use of small population
sizes, affect all loci in the population while The response of traits to selection depends
others affect only certain loci. Suppose that on the relative importance of the genetic
two traits are controlled by different sets and non-genetic factors which contribute
of loci, and a change in one trait does not to phenotypic variation among genotypes
affect the other. If selection occurs only for in a population, a concept referred to as
the first trait, the loci affecting that trait heritability. The heritability of a trait has
may deviate from HWE, but the loci for the a major impact on the methods chosen for
other trait will remain in equilibrium. In population improvement, inbreeding, and
large natural populations, migration, muta- selection. Selection for single plants is
tion, and selection are the forces that can more efficient when the heritability is high.
change allelic frequencies from generation The extent to which replicated testing is
to generation. required for selection depends on the herit-
ability of the trait.
The question of whether a trait varia-
1.5.4 Population means and variances tion is a result of genetic or environmental
variation is meaningless in practice. Genes
Theoretically, a population can be described cannot cause a trait to develop unless the
by its parameters such as the mean and vari- organism is growing in an appropriate
ance which depend on the probability dis- environment, and, conversely, no amount
tribution of the population. The arithmetic of manipulation will cause a phenotype to
Introduction 13

develop unless the necessary gene or genes genetic gain, and predicted progress or gain,
are present. Nevertheless, the variability and has been denoted as R, GS, G and G.
observed in some traits might result prima- Starting with a parental population of
rily from difference in the numbers and the mean, m, a subset of individuals is selected.
_
magnitude of the effect of different genes, The selected individuals have a mean x ,
but that variability in other traits might while the offspring
_ of the selected popula-
stem primarily from the differences in the tion has a mean y . The difference between
environments to which various individuals the selected population and the original
have been exposed. It is therefore essential population is defined as the selection dif-
to identify reliable measures to determine ferential, and denoted by S, i.e.
the relative importance of not only the _
numbers and magnitude of the effects of the S=xm
genes involved, but also of the effects of dif-
ferent environments on the expression of The response to selection, R, can be
phenotypic traits (Allard, 1999). written as
Heritablity is defined as the ratio of _
genetic variance to phenotypic variance: R=ym

s G2 s2 The relationship between S and R is

h2 = = 2 G 2
s P sG + s E
2
determined by heritability,
where sP2 is phenotypic variance, which 2
R=hS
has two components, genetic variance sG2
and environment variance sE2. sE2 can be
How much of the selection differen-
estimated by the phenotypic variances of
tial is realized in the offspring population
non-segregating populations such as inbred
depends on the heritability of the trait.
lines and F1s because individuals in such 2
The heritability, h , in the formula can be
a population have the same genotypes and 2 2
either hN or hB (depending on whether the
thus, phenotypic variation in these popu-
offspring are produced by sexual or asexual
lations can be attributed to environmental
reproduction,
_ respectively). From the above
factors. sG2 can be estimated using segregat- 2
formula, y = m + h S.
ing populations such as F2 and backcrosses
The population mean of the offspring
where variance components can be obtained
derived from the selected individuals is
theoretically.
equal to the parental population mean plus
the response to selection (Fig. 1.1). When
h2 = 1, the selection differential will be fully
1.5.6 Response to selection realized in the offspring population so that
its mean will deviate from the parental pop-
Genetic variation forms the basis for selec- ulation by S. When h2 = 0, the selection dif-
tion in plant breeding. Selection results in ferential cannot be realized so the offspring
the differential reproduction of genotypes in population mean will regress to its parental
a population so that gene frequencies change population. When 0 < h2 < 1, the selection dif-
and, with them, genotypic and phenotypic ferential is partially realized so that the mean
values (mean and variance) of the trait being of the offspring population will deviate from
selected. Response to selection, or advance the parental population by h2S. It is very use-
in one generation of selection, is measured ful to predict the response before selection is
by the difference between the selected popu- undertaken and details of the mathematical
lation and their offspring population, which derivation of these predictions together with
is denoted as R. Response to selection has the various complications encountered can
been referred to by several different names, be found in Empig et al. (1972), Hallauer and
including genetic progress, genetic advance, Miranda (1988) and Nyquist (1991).
14 Chapter 1

Parental population

Individuals
x
selected,
k = 5%

S=xm
Selection differential

Progeny population

y (h2 = 0) y (h2 = 0.5) y (h2 = 1.0)

R=ym

Fig. 1.1. Distribution of parental and progeny populations with a selection intensity of 5%. Because the
phenotypic values of the selected plants include both a genetic and an environmental component, the
progeny means depend on the heritability of the trait selected.

1.5.7 Selection index and selection With tandem selection, one trait is selected
for multiple traits until it is improved to a satisfactory level
or a critical phenotypic value. Then, in the
In most plant breeding programmes, there next generation or programme, selection
is a need to improve more than one trait at for a second trait is carried out within the
a time. For example, a high-yielding culti- population selected for the first trait, and
var susceptible to a prevalent disease would so on for the third and subsequent traits.
be of little use to a grower. Recognition A selection index is a single score which
that improvement of one trait may cause reflects the merits and demerits of all target
improvement or deterioration in associ- traits. Selection among individuals is based
ated traits serves to emphasize the need on the relative values of the index scores.
for the simultaneous consideration of all Selection indices provide one method
traits which are important in a crop spe- for improving multiple traits in a breeding
cies. Three selection methods, which are programme. The use of a selection index
recognized as appropriate for the simulta- in plant breeding was originally proposed
neous improvement of two or more traits by Smith (1936) who acknowledged criti-
in a breeding programme, are index selec- cal input from Fisher (1936). Subsequently,
tion, independent culling, and tandem methods of developing selection indices
selection. Independent culling requires the were modified, subjected to critical evalu-
establishment of minimum levels of merit ation, and compared to other methods of
for each trait. An individual with a pheno- multiple trait selection.
type value below the critical culling level It is generally recognized that a selec-
for any trait will be removed from the popu- tion index is a linear function of observable
lation. That is, only individuals meeting phenotypic values of different traits. There
requirements for all traits will be selected. are a number of forms of the equations avail-
Introduction 15

able from index selection for multiple traits on the extent of previous testing of the par-
in grain. To construct a selection index, the ents included in the crosses. Although these
observed value of each trait is weighted by concepts were developed for breeding maize,
an index coefficient, an open-pollinated crop, they are generally
applicable to self-pollinated crops.
I = b1x1 + b2x2 + + bnxn The GCA for an inbred line or a cul-
tivar can be evaluated by the average per-
where I is an index of merit of an indi- formance of yield or other economic traits
vidual, xi represents the observed pheno- in a set of hybrid combinations. The SCA
typic value of the ith trait, and b1 bn are for a cross combination can be evaluated
weights assigned to phenotypic trait meas- by the deviation in its performance from
urements represented as x1 xn. The b val- the value expected from the GCA of its two
ues are the products of the inverse of the parental lines. If the crosses among a set of
phenotypic variancecovariance matrix, inbred lines are made in such a way that
genotypic variancecovariance matrix, and each line is crossed with several other lines
a vector of economic weights. A number of in a systematic manner, the total variation
variations of this index, most changing the among crosses can be partitioned into two
manner of computing the b values, have components ascribable to GCA and SCA. _
been developed. These include the base The mean performance of a cross (x AB)
index of Williams (1962), the desired gain between two inbred lines A and B can be
index of Pesek and Baker (1969), and retro- represented as
spective indexes proposed by Johnson et al. _
(1988) and Bernardo (1991). The emphasis x AB = GCAA + GCAB + SCAAB
in the retrospective index developments is
on quantifying the knowledge experienced The GCAA and GCAB are the GCA of the
breeders have obtained. Baker (1986) sum- parents A and B, respectively, and the cross
marized all select indexes in plant breeding of A B is expected to have a performance
developed before that time. equal to the sum (GCAA + GCAB) of the GCA
of their parents. The actual performance of
the cross, however, may be different from
1.5.8 Combining ability the expectation by an amount equivalent to
the SCA. Sprague and Tatum (1942) inter-
Combining ability is a very important con- preted these combining abilities in terms
cept in plant breeding and it can be used to of type of gene action. The differences due
compare and investigate how two inbred to GCA of lines are the results of additive
lines can be combined together to produce genetic variance and additive by additive
a productive hybrid or to breed new inbred interaction whereas SCA is a reflection of
lines. Selection and development of paren- non-additive genetic variances.
tal lines or inbreds with strong combining
ability is one of the most important breeding
objectives, no matter whether the goal is to 1.5.9 Recurrent selection
create a hybrid with strong vigour or develop
a pure-line cultivar with improved charac- Recurrent selection can be broadly defined
teristics compared to their parental lines. In as the systematic selection of desirable
maize breeding, Sprague and Tatum (1942) individuals from a population followed by
partitioned the genetic variability among recombination of the selected individuals to
crosses into effects due to primarily either form a new population. The basic feature of
additive or non-additive effects, which cor- recurrent selection methods is that they are
respond to two categories of combining abil- procedures conducted in a repetitive man-
ity, general combining ability (GCA) and ner, or recycling, including development
special combining ability (SCA). The rela- of a base population with which to begin
tive importance of GCA and SCA depends selection, evaluation of individuals from
16 Chapter 1

the population, and selection of superior for outcrossing crops, to rectify limitations
individuals as parents that can be crossed in inbred development by continuous self-
to produce a new population for the next ing that rapidly leads to inbreeding and
cycle of selection, as shown below: allele fixation and thus inadequate oppor-
tunity for selection. There are two ways by
Develop a
which recurrent selection address this lim-
population itation in inbred development (Bernardo,
2002). First, recurrent selection increases
the frequency of favourable alleles in the
Select superior Evaluate indi- population by repeated cycles of selection.
individuals as viduals in the Secondly, recurrent selection maintains the
parents population
degree of genetic variation in the popula-
tion to allow sustained progress from subse-
A cycle of selection is completed each quent cycles of selection. Genetic variation
time a new population is formed. The initial is maintained by recombining a sufficiently
population that is developed for a recurrent large number of individuals to reduce
selection programme is referred to as the random fluctuations in allele frequencies,
base, or cycle 0, population. The population i.e. genetic drift.
formed after one cycle of selection is called Since the late 1950s, extensive research
the cycle 1 population; the cycle 2 popula- has been conducted to determine the rela-
tion is developed from the second cycle of tive importance of different genetic effects
selection, and so on. on the inheritance of quantitative traits for
Recurrent selection procedures are most cultivated plant species. As indicated
conducted for primarily quantitatively by Hallauer (2007), quantitative genetic
inherited traits. The objective of recurrent research has provided extensive information
selection is to improve the mean perform- to assist plant breeders in developing breed-
ance of a population of plants by increasing and selection strategies. Directly and/or
ing the frequency of favourable alleles in a indirectly, the principles for the inheritance
consistent manner in order to enhance the of quantitative traits are pervasive in devel-
value of the population and to maintain the oping superior cultivars to meet the world-
genetic variability present in the popula- wide food, feed, fuel and fibre demands.
tion as effectively as possible. In addition, The principles of quantitative genetics will
separation of the genetic and environmen- have continued importance in the future.
tal effects is an important facet of effective
recurrent selection methods. The improved
populations can be used as a cultivar per
se, as parents of a cultivar-cross hybrid and 1.6 The Green Revolution and the
as a source of superior individuals that can Challenges Ahead
be used as inbred lines, pure-line cultivars,
clonal cultivars, or parents of a synthetic The application of science and technology
line. Successful recurrent selection results to crop production in the second half of the
in an improved population that is superior 20th century resulted in significant yield
to the original population in mean perform- improvements for rice, wheat and maize in
ance and in the performance of the best the developed countries, and the final result
individuals within it. Ideally, the popula- of these efforts was the Green Revolution
tion will be improved without its genetic which led to a new type of agriculture
variability being significantly reduced so high-input or chemical-genetic agri cul-
that additional selection and improvement ture which replaced the more traditional
can occur in the future. Recurrent selection system. Countries involved in the Green
is complementary to inbred development Revolution, a term coined by Borlaug
procedures; in fact the concept of recur- (1972), included Japan, Mexico, India and
rent selection was developed, particularly China among others.
Introduction 17

By production and acreage yardsticks, Malthus were forestalled, at least temporar-

agriculture has been very successful. The ily, by the extensive cultivation of new land
application of scientific knowledge to agricul- and by the development of a modern agricul-
ture has resulted in greatly increased yields tural science which enabled food crops to be
per unit land area for many of our important produced at far higher yields than Malthus
crops as exemplified by the 92% increase in could have ever anticipated. However, the
cereal production in the developing world production of food has still not been opti-
between 1961 and 1990. The sharp increase mized in all areas of the world.
in human populations has been paralleled by Weather and climate profoundly affect
the increase in food supply. However, yield crop production and natural events can dis-
growth rates are stagnating in some areas rupt normal climate cycles and affect agricul-
and, in a few cases, falling. A slowdown in ture. In addition, human-induced climatic
the rate of yield increase of major cereals change is set to accelerate during this century
raises concern because increased yields are and this will also impact on crop production.
expected to be the source of increased food Much of the arable land has been used for
production in the future (Reeves et al., 1999). industrial purposes and land-use patterns
On the other hand, increased national wealth indicate an increase in intensive farming
resulting from economic development is not which, however, must be sustainable.
necessarily correlated with a decrease in the Agricultural products are affected by
rate of population growth. Widespread hun- abiotic and biotic stresses and one of the
ger persists in a world that produces enough major challenges to the future of plant
food. breeding is the development of cultivars
There are many reasons for being con- and hybrids with multiple resistances or
cerned about meeting future food demands tolerances to these stresses.
(Khush, 1999; Swaminathan, 2007). Expan- The security of the food supply for
sion of the planets population creates an an increasing world population largely
increased demand for food and income. depends on the availability of water for agri-
Other issues such as the cost of food, which culture. Increasing the efficiency of water
may represent 60% of income in the devel- use for our major crop species is an import-
oping world, the 800 million people who ant target in agricultural research, particu-
are food insecure, the 200 million children larly in light of the increasing competition
who are malnourished, and the continu- for limited supplies of fresh water in many
ing decline of available land for farming parts of the world.
and water to irrigate crops, all indicate the There are four prerequisites for greater
need to use all the technologies available to productivity (Poehlman and Quick, 1983):
increase productivity, assuming they can be (i) an improved farming system; (ii) instruc-
employed in harmony with the environment. tion of farmers; (iii) optimization of the
Plant breeding has generally accounted for supply of water and fertilizers; and (iv)
one-half of the increases in productivity of availability of markets. To increase crop pro-
the major crops and the future will continue ductivity planting high-yielding cultivars
to depend on advances in plant breeding. must be combined with improved practices
The increase in productivity has meant that of irrigation, fertilization and pest control.
large areas of land can be saved as wildlife Maximum crop yield will only be achieved
habitats or used for purposes other than agri- if the improved crop cultivar receives and
culture. As the availability of land and water responds to the optimum combination of
is decreasing and populations are increasing water, fertilizer and cultural practices.
in size, the 50% increase in food production
predicted to be required over the next 25
years, poses an obvious challenge. 1.7 Objectives of Plant Breeding
The danger of population growth over-
taking food supplies was predicted by The aim of plant breeders is to reassemble
Malthus in 1817. The dire predictions of desirable inherited traits to produce crops
18 Chapter 1

with improved characteristics. Thus far, growth of increased numbers of nitrogen-

plant breeders have mainly been concerned fixing microorganisms around their roots to
with bringing about a continuous improve- reduce the need for nitrogen fertilizer.
ment in the productivity of that part of the 8. More efficient use of water whether there
plant which is of economic importance, is a plentiful supply or dearth of water.
the stability of production through in-built 9. Stability of crop production by resilience
resistance to pests and diseases and nutri- to weather fluctuations, resistance to the
tive and organoleptic or other desired qual- multiple alliance of weeds, pests and patho-
ity characters. gens, and tolerance to various abiotic stresses
Many parameters and selection criteria such as heat, cold, drought, wind, and soil
should be included as breeding objectives. salinity, acidity or aluminium toxicity.
According to Sinha and Swaminathan 10. Insensitivity to photoperiod and tem-
(1984) and other sources, the major objec- perature: selection of crop cultivars that are
tives of plant breeders can be summarized insensitive to photoperiod or temperature
by the following list: and characterized by a high per-day bio-
mass production would allow the develop-
1. High primary productivity and efficient ment of contingency cropping patterns to
final production for each unit of cultivation suit different weather probabilities.
and solar energy invested: to ensure that all 11. Plant architecture and adaptability to
the light that falls on a field is intercepted mechanized farming: the number and posi-
by leaves and that photosynthesis itself is tioning of the leaves, branching pattern of
as efficient as possible. Greater efficiency in the stem, the height of the plant, and the
photosynthesis could perhaps be achieved positioning of the organs to be harvested are
by reducing photorespiration. all important to crop production and often
2. High crop yield: plants must be selected determine how well plants can be harvested
which invest a large proportion of their total mechanically.
primary productivity into those areas which 12. Elimination of toxic compounds.
are commercially desirable, e.g. seeds, roots, 13. Identification and improvement of
leaves or stems. hardy plants suitable for sources of biomass
3. Desirable nutritional value, organolep- and renewable energy.
tic properties and processing qualities: the 14. Multiple uses of a single crop.
proportion of essential amino acids and the 15. Environmentally-friendly and stable
total protein in cereal grains, for example, across environments.
should be increased to improve their nutri-
tional quality. In conclusion, plant breeding has many
4. Biofortifying crops with essential mineral breeding objectives and each of the objec-
elements that are frequently lacking in the tives can be addressed in a specific breeding
human diet such as Fe and Zn, vitamins programme. A successful breeding pro-
and amino acids (Welch and Graham, 2004; gramme consists of a series of activities as
White and Broadley, 2005; Bekaert et al., Burton (1981) summarized in six words:
2008; Mayer et al., 2008; Ufaz and Galili, variate, isolate, evaluate, intermate, multiply
2008; Naqvi et al., 2009; Xu et al., 2009a). and disseminate.
5. Modifying crop plants to generate plant-
derived pharmaceuticals to supply low-cost
drugs and vaccines to the developing world 1.8 Molecular Breeding
(Ma et al., 2005).
6. Adaptation to cropping systems: includ- By 2025, the global population will exceed
ing breeding for contrasting cropping, inter- seven billion. In the interim per-capita
cropping, and sustainable cropping systems availability of arable land and irrigation
(Brummer, 2006). water will decrease from year to year as
7. More extensive and efficient nitrogen biotic and abiotic stresses increase. Food
fixation: breeding cereals that encourage the security, best defined as economic, physical
Introduction 19

and social access to a balanced diet and breeding is becoming quicker, easier, more
safe dinking water will be threatened, with effective and more efficient (Phillips, 2006).
a holistic approach to nutritional and non- Plant breeders will be well equipped with
nutritional factors needed to achieve suc- innovative approaches to identify and/
cess in the eradication of hunger. Science or create genetic variation, to define the
and technology can play a very impor- genetic feature of the genes related to the
tant role in stimulating and sustaining an variation (position, function and relation-
Evergreen Revolution leading to long-term ship with other genes and environments),
increases in productivity without any asso- to understand the structure of breeding
ciated ecological harm (Borlaug, 2001; populations, to recombine novel alleles or
Swaminathan, 2007). The objectives of the allele combinations into specific cultivars
plant breeder can be realized through con- or hybrids, and to select the best individu-
ventional breeding integrated with various als with desirable genetic features which
biotechnology developments (e.g. Damude enable them to adapt to a wide range of
and Kinney, 2008; Xu et al., 2009c). environments.
Plant breeding can be defined as an Sequencing data for many plants is now
evolving science and technology (Fig. 1.2). readily available and the GenBank database
It has gradually been evolving from art to is doubling every 15 months. Over 20 plant
science over the last 10,000 years, starting species including many important crops are
as an ancient art to the present molecular in the process of being sequenced (Phillips,
design-based science. With the develop- 2008). The next challenge is to determine
ment of molecular tools which will be dis- the function of every gene and eventually
cussed further in Chapters 2 and 3, plant how genes interact to form the basis of com-
plex traits. Fortunately, DNA chips and
other technologies are being developed to
Art-based Plant Breeding study the expression of multiple or even
all genes simultaneously. High throughput
Collection of wild plants for food robotics and bioinformatics tools will play
Selection of wild plants for cultivation an essential role in this endeavour.
(starting from 10,000 years ago)
New information about our crop spe-
cies is expanding our capabilities to use
Large-scale breeding activities supported molecular genetics. For example, we did
by commercial seed production enterprises not previously realize how similar broadly
Hybridization combined with selection
related species are in terms of their gene
Evolution through natural selection
(1700s1800s) content and gene order. Since these spe-
cies cannot usually be crossed, there was
Mendelian genetics no means of assessing their relatedness.
Quantitative genetics With the advent of DNA-based molecular
Mutation markers, the extensive genetic mapping of
Polyploidy chromosomes became readily possible for
Tissue culture a variety of species. We learned that the
(1900s) genomes were highly similar and that this
similarity allowed the prediction of gene
Gene cloning and direct transfer locations among species. For example, rice
Genomics-assisted breeding has become the model or reference spe-
(2000s and beyond)
cies for the cereals as many of the gene
sequences on the rice chromosomes are
Molecular Plant Breeding
shared with other cereals such as maize,
Fig. 1.2. The steps of evolution of plant breeding. sorghum, sugarcane, millet, oats, wheat
With the availability of more sophisticated tools, and barley (Xu et al., 2005). Knowing the
the art of plant breeding became science-based complete DNA sequence of a model or ref-
technology, molecular plant breeding. erence genome allows genes/traits from this
20 Chapter 1

model to be tracked to other genomes. We improve the understanding of the role of het-
have come to realize that the differences erosis in evolution and the domestication of
between species of plants are not due to crop plants (Lippman and Zamir, 2007), and
novel genes, but to novel allelic specifica- finally to make it possible to predict hybrid
tions and interactions. performance.
Since many fundamental aspects of Messenger RNA transcript profiling is
current plant breeding procedures are not an obvious candidate for functional genomic
well understood, further data relating to application to plant breeding. Although
the genetics of crop species may help to direct selection at the gene transcript level
shed light on the genetic gains obtained using microarray or real-time PCR may be
from plant breeding. For example, in suc- a long-term goal, other genomic tools can
cessful plant breeding programmes, the be used to achieve shorter term goals with
genetic base often becomes narrower rather more practical applications (Crosbie et al.,
than broader. Elite by elite crosses may be 2006). Genetic modification of crops today
the rule in these programmes. Molecular involves the interfacing of molecular bio-
genetic markers have been widely employed logy, cell and tissue culture, and genetics/
to identify cryptic and novel genetic vari- breeding. The transfer of genes by cellu-
ation among cultivars and related species lar and molecular means will increase the
and used to increase the efficiency of selec- available gene pool and lead to second
tion for agronomic traits and the pyramid of generation biotechnology plant products
genes from different genetic backgrounds. such as those with a modified oil, protein,
Long-term selection programmes would vitamin, or micronutrient content or those
be expected to lead to genetic fixation, how- that have been engineered to produce com-
ever this has not been found to be the case pounds that can be used as vaccines or anti-
so far and variation is still observed. Several carcinogens.
mechanisms for de novo variation have been While all these new innovations have
described, including intragenic recombin- been useful, practical plant breeding con-
ation, unequal crossing over among repeated tinues to be based on hybridization and
elements, transposon activity, DNA methyl- selection with little change in the basic
ation, and paramutation. Another important procedures. A more complete understand-
feature in plant breeding whose molecular ing of the mechanisms by which genetic
basis is not understood is heterosis although and environmental variation modify yield
it is used as the basis for many seed-producing and composition is needed so that specific
industries. Genomics and particularly tran- quantitative and qualitative targets can be
scriptomics are now being used to identify identified. To achieve this aim, the exper-
the heterotic genes responsible for increas- tise of plant genomics (including various
ing crop yields. Comprehensive quantitative omics), physiology and agronomy, as well
trait locus-based phenotyping (phenomics) as plant modelling techniques must be com-
combined with genome-wide expression bined (Wollenweber et al., 2005) and many
analysis, should help to identify the loci logistic and genetic constraints also need to
controlling heterotic phenotypes and thus be resolved (Xu and Crouch, 2008).
2
Molecular Breeding Tools:
Markers and Maps

2.1 Genetic Markers markers can be used to facilitate studies of

inheritance and variation.
In conventional plant breeding, genetic Desirable genetic markers should
variation is usually identified by visual meet the following criteria: (i) high level of
selection. However, with the development genetic polymorphism; (ii) co-dominance
of molecular biology, it can now be identi- (so that heterozygotes can be distinguished
fied at the molecular level based on changes from homozygotes); (iii) clear distinct
in the DNA and their effect on the pheno- allele features (so that different alleles
type. Molecular changes can be identified can be identified easily); (iv) even distri-
by the many techniques that have been used bution on the entire genome; (v) neutral
to label and amplify DNA and to highlight selection (without pleiotropic effect);
the DNA variation among individuals. Once (vi) easy detection (so that the whole proc-
the DNA has been extracted from plants or ess can be automated); (vii) low cost of
their seeds, variation in samples can be marker development and genotyping; and
identified using a polymerase chain reac- (viii) high duplicability (so that the data
tion (PCR) and/or hybridization process fol- can be accumulated and shared between
lowed by polyacrylamide gel electrophoresis laboratories).
(PAGE) or capillary electrophoresis (CE) to Most molecular markers belong to the
identify distinct molecules based on their so-called anonymous DNA marker type and
sizes, chemical compositions and charges. generally measure apparently neutral DNA
Genetic markers are used to tag and track variation. Suitable DNA markers should
genetic variation in DNA samples. represent genetic polymorphism at the DNA
Genetic markers are biological features level and should be expressed consistently
that are determined by allelic forms and can across tissues, organs, developmental stages
be used as experimental probes or tags to and environments; their number should be
keep track of an individual, a tissue, cell, almost unlimited; there should be a high
nucleus, chromosome or gene. In classical level of natural polymorphism; and they
genetics, genetic polymorphism represents should be neutral with no effect on the
allelic variation. In modern genetics, genetic expression of the target trait. Finally, most
polymorphism is the relative difference at DNA markers are co-dominant or can be
any genetic locus across a genome. Genetic converted into co-dominant markers.

Yunbi Xu 2010. Molecular Plant Breeding (Yunbi Xu) 21

22 Chapter 2

Table 2.1 lists the major molecular Schwarz (2005) and Falque and Santoni
marker technologies that are currently (2007). Further information regarding the
available. Only a selection of widely-used application of DNA markers in genetics and
representative types of markers will be dis- breeding can be found in Lrz and Wenzel
cussed in this section. Figure 2.1 shows the (2005). After a brief review of the classical
molecular mechanism of several major DNA markers, DNA markers will be discussed in
markers and the genetic polymorphisms more detail in this section.
that can be generated by restriction site or
PCR priming site mutation, insertion, dele-
tion or by changing the number of repeat 2.1.1 Classical markers
units between two restriction or PCR prim-
ing sites and nucleotide mutation resulting Morphological markers
in a single nucleotide polymorphism (SNP).
There are several comprehensive reviews In the late 1800s, following his studies on
that cover all the important DNA markers, the garden pea (Pisum sativum), G.J. Mendel
e.g. Reiter (2001), Avise (2004), Mohler and proposed two basic rules of genetics,

Table 2.1. DNA markers and related major molecular techniques.

Southern blot-based markers

Restriction fragment length polymorphism (RFLP)
Single strand conformation polymorphic RFLP (SSCP-RFLP)
Denaturing gradient gel electrophoresis RFLP (DGGE-RFLP)
PCR-based markers
Randomly amplified polymorphic DNA (RAPD)
Sequence tagged site (STS)
Sequence characterized amplified region (SCAR)
Random primer-PCR (RP-PCR)
Arbitrary primer-PCR (AP-PCR)
Oligo primer-PCR (OP-PCR)
Single strand conformation polymorphism-PCR (SSCP-PCR)
Small oligo DNA analysis (SODA)
DNA amplification fingerprinting (DAF)
Amplified fragment length polymorphism (AFLP)
Sequence-related amplified polymorphism (SRAP)
Target region amplified polymorphism (TRAP)
Insertion/deletion polymorphism (Indel)
Repeat sequence-based markers
Satellite DNA (repeat unit containing several hundred to thousand base pairs (bp) )
Microsatellite DNA (repeat unit containing 25 bp)
Minisatellite DNA (repeat unit containing more than 5 bp)
Simple sequence repeat (SSR) or simple sequence length polymorphism (SSLP)
Short repeat sequence (SRS)
Tandem repeat sequence (TRS)
mRNA-based markers
Differential display (DD)
Reverse transcription PCR (RT-PCR)
Differential display reverse transcription PCR (DDRT-PCR)
Representational difference analysis (RDA)
Expression sequence tags (EST)
Sequence target sites (STS)
Serial analysis of gene expression (SAGE)
Single nucleotide polymorphism-based markers
Single nucleotide polymorphism (SNP)
Markers and Maps 23

A. Mutation at
enzyme restriction
or PCR priming site
RFLP, AFLP, CAPS

RAPD, AP-PCR, DAF, ISSR

B. Insertion
between enzyme
restriction or PCR
priming sites Insertion

RFLP, AFLP, CAPS, RAPD, AP-PCR, DAF, ISSR

C. Deletion
between enzyme
restriction or PCR
Deletion
banding sites

RFLP, AFLP, CAPS, RAPD,

AP-PCR, DAF, ISSR

D. Change of
tandem repeat
units between
enzyme restriction
or PCR banding
sites
SSR, VNTR, ISSR

E. Single GGACTACGT C GTATCATCGTACCG

nucleotide CCTGATACA G CATAGTAGCATGGC
mutation
GGACTACGT A GTATCATCGTACCG
CCTGATGCA T CATAGTAGCATGGC
SNP

Enzyme restriction site PCR primer

Tandem repeat sequence

Fig. 2.1. Molecular basis of major DNA markers. Parts AE show different ways in which DNA markers
(listed below each diagram) can be generated. The cross in part A indicates that mutation has eliminated
the priming site. Abbreviations: as defined in Table 2.1; VNTR, variable number of tandem repeat; CAPS,
a DNA marker generated by specific primer PCR combined with RFLP; ISSR, inter simple sequence repeat.
24 Chapter 2

which were later known as the Mendelian 1998). Many of these markers have been
laws of equal segregation and independ- linked with other agronomic traits.
ent assortment. Mendel selected individu- Morphological markers are usually
als which differed in a particular trait and mapped by classical two- or three-point
used them as the parental lines in a cross linkage tests. The linkage groups are estab-
breeding experiment to determine the phe- lished and the order of the markers and
notype of the offspring with regard to the the relative distance between any two are
selected trait. The term phenotype (derived determined by their recombinant frequen-
from Greek) literally means the form that cies. Relatively complete linkage maps
is shown and is used by both geneticists have been constructed in many crop spe-
and breeders. The seven pairs of contrasting cies using morphological markers and these
phenotypes studied by Mendel included maps provide the fundamental information
round versus wrinkled seeds, yellow ver- for the genetic mapping of many physiolog-
sus green seeds, purple versus white petals, ical and biochemical traits.
inflated versus pinched pods, green versus However, it is difficult to construct a
yellow pods, axial versus terminal flowers relatively saturated genetic map because of
and long versus short stems. The plants in the limitation in the number of morphologi-
the segregated populations of the pea, such cal markers with distinguishable polymor-
as F2 and backcross, were classified into two phisms. In addition, many morphological
distinct groups depending on their pheno- markers have deleterious effects on pheno-
types. These contrasting morphological types and some are significantly affected by
phenotypes are the starting point for any other factors such as environments or matu-
genetic analysis and can be mapped to par- rity which results in potential problems
ticular chromosomes using the Mendelian when these markers are used for genetics
laws of inheritance and can thus be used as and plant breeding.
morphological markers of the genome and
the particular trait. Cytological markers
Morphological markers therefore gen-
erally represent genetic polymorphisms By studying the morphology, number and
which are visible as differences in appear- structure of chromosomes from different
ance, such as the relative difference in plant species, particular cytogenetic features can
height and colour, distinct differences in be found, such as various types of aneu-
response to abiotic and biotic stresses, and ploidy, variants of chromosome structure
the presence/absence of other specific mor- and abnormal chromosomes. These can
phological characteristics. A large number be used as genetic markers to locate other
of variants showing particular morphologi- genes on to chromosomes and determine
cal or physiological phenotypes have been their relative positions, or used for genetic
generated by tissue culture and mutation mapping via chromosome manipulations
breeding. Using selection techniques these such as chromosome substitution.
variants can be genetically stabilized and The structural features of chromo-
then used as morphological markers. somes can be shown by chromosome kary-
Some genetic stocks contain more than otype and bands. The banding patterns are
one morphological marker, for example indicated by colour, width, order and posi-
there are a total of over 300 morphologi- tion, revealing the difference in distribu-
cal markers available for genetic studies in tions of euchromatin and heterochromatin.
rice (Khush, 1987) and more are being cre- There are Q bands (produced by quina-
ated for functional genomics. Many mor- crine hydrochloride), G bands (produced
phological marker stocks are also available by Giemsa stain) and R bands (reversed
for tomato (http://www.plantpath.wisc.edu/ Giemsa). These chromosome landmarks are
GeminivirusResistantTomatoes/MERC/ not only useful for characterizing normal
Tomato/Tomato.html), maize (Neuffer et al., chromosomes but also for detecting chro-
1997) and soybean (Palmer and Shoemaker, mosome mutation.
Markers and Maps 25

Cytological markers have been widely otide difference within a gene or between
used to identify linkage groups within spe- genes; and in others it represents the site
cific chromosomes and have been widely of a variable number of tandem repeats of
applied in physical mapping. However, junk DNA present between genes. The
because of the limited number and reso- development of RFLP markers has acceler-
lution, they have limited applications in ated the construction of molecular linkage
genetic diversity analysis, genetic mapping maps for many organisms, improved the
and marker-assisted selection (MAS). accuracy of gene location, and reduced the
time required to establish a complete link-
Protein markers age map.
The digestion of purified DNA using
Isozymes are structural variants of an restriction enzymes which cut the DNA
enzyme and while they differ from the strand wherever there is a recognition
original enzyme in molecular weight and site sequence (usually four to eight base
mobility in an electric field, they have the pairs), leads to the formation of RFLPs
same catalytic activity. The difference in which yield a molecular fingerprint that
enzyme mobility is caused by point muta- may be unique to a particular individual.
tions resulting from amino acid substitu- If the bases are positioned at random in the
tion such that isozymes reflect the products genome, an enzyme having a recognition
of different alleles rather than different site with six bases will cleave the DNA at
genes. Therefore, isozymes can be geneti- every 4096 bases on average (46). A genome
cally mapped on to chromosomes and then of 109 bases could thus produce around
used as genetic markers for mapping other 250,000 restriction fragments of variable
genes. Isozyme markers are based on their length. Gel electrophoresis on such a large
biochemistry and thus are also known as number of genomic DNA digestion prod-
biochemical or protein markers. ucts produces a continuous smear image.
However, their use as markers is lim- Particular fragments that are homologous
ited. For example a total of 57 isozymes between several individuals, and possibly
representing about 100 loci have been iden- allelic, can be separated only by means
tified in plants (Vallegos and Chase, 1991) of molecular probes using the Southern
but for specific species only 1020 iso- technique (Southern, 1975). RFLP analysis
zymes are available so that they cannot be includes the following steps (Fig. 2.2):
used to construct a complete genetic map.
Each isozyme can only be identified with a 1. DNA isolation: a significant amount of
specific stain which also limits their use in DNA must be isolated from multiple indi-
practice. viduals from target genotypes (parents and
segregating populations, germplasm survey,
garden blot, etc.) and purified to a fairly
2.1.2 DNA markers stringent degree as contaminants can often
interfere with the restriction enzyme and
RFLP inhibit its ability to digest the DNA.
2. Restriction digestion: restriction enzyme
Botstein et al. (1980) first used DNA restric- is added to purified genomic DNA under
tion fragment length polymorphism (RFLP) buffered conditions. The enzyme cuts at
in human linkage mapping and this pio- recognition sites throughout the genome
neered the utilization of DNA polymor- and leaves behind hundreds of thousands
phisms as genetic markers. It is known that of fragments.
the genomes of all organisms show many 3. Gel electrophoresis: digested products
sites of neutral variation at the DNA level. (restriction fragments) are electrophoresed
These neutral variant sites do not have any on agarose gel and when visualized appear
effect on the phenotype. In some cases a neu- as smears because of the large number of
tral site is nothing more than a single nucle- fragments.
26 Chapter 2

A1 A2 A1 A2 A1 A2

DNA Restriction Agarose gel DNA

extraction fragments electrophoresis denaturing

A1 A2

Radioactive Wash Hybridization Southern

autograph blotting

Fig. 2.2. RFLP workflow from DNA extraction to radio-autograph. Modified from Xu and Zhu (1994).

4. The agarose gel is denatured using NaOH DNA (cDNA). The standard procedure for
solution and then neutralized. developing genomic DNA probes is to digest
5. The DNA fragments are transferred to a total DNA with a methylation-sensitive
nitrocellulose membrane using Southern enzyme (e.g. PstI), thereby enriching the
blotting. library for single-copy sequences (Burr et al.,
6. Probe visualization: the membrane-bound 1988). Typically, the digested DNA is size
genomic DNA is probed by hybridization fractionated on a preparative agarose gel.
using a cloned fragment of the genome of DNA fragments ranging from 500 to 2000 bp
interest or a genome from a relatively close are excised and eluted for cloning into a
species as the probe. plasmid vector (e.g. pUC18). Digests of the
7. The membrane is washed to remove non- plasmids are screened for inserts and their
specifically hybridized DNA. lengths can be estimated. Southern blots of
8. In most cases the sizes of the fragments the inserts can be probed with total sheared
are determined by radioactive methods. genomic DNA to select clones that hybrid-
The probe-restriction enzyme combina- ize to single- and low-copy sequences and to
tions may identify two or more differently eliminate clones that hybridize to medium-
sized fragments. Polymorphism is revealed and high-copy repeated sequences. Single-
whenever the recognized fragments are of and low-copy probes are screened for RFLPs
non-identical lengths. among a sample of genotypes using genomic
DNAs digested with restriction endonucle-
Differences in size of restriction frag- ases (one per assay). Typically, in species
ments are due to: (i) base pair changes that with moderate to high polymorphism rates,
result in gain and loss of restriction sites; two to four restriction endonucleases with
and (ii) insertions/deletions at the restric- hexanucleotide recognition sites are tested.
tion sites within the restriction fragments EcoRI, EcoRV and HindIII are widely used.
on which the probe sequence is located. In species with low polymorphism rates,
Molecular probes are DNA fragments additional restriction endonucleases can
isolated and individualized by cloning or be tested to increase the chance of find-
PCR amplification. They may originate from ing a polymorphism. Both the theory and
fragmented total genomic DNA and thus the techniques for RFLP analysis in plant
contain coding or non-coding sequences, genome mapping have been intensively
unique or repeated, of nuclear or cytoplas- reviewed (Botstein et al., 1980; Tanksley
mic origin. They may also be complementary et al., 1988).
Markers and Maps 27

Most RFLP markers are co-dominant and is used to amplify random sequences from
locus specific. RFLP genotyping is highly a complex DNA template that is comple-
reproducible and the methodology is sim- mentary to it (or includes a limited number
ple and requires no special instrumenta- of mismatches). This means that the ampli-
tion. High-throughput markers (e.g. cleaved fied fragments generated by PCR depend
amplified polymorphic sequence (CAPS) on the length and size of both the primer
or insertion/deletion (indel) markers) can and the target genome. Ten-base oligomers
be developed from RFLP probe sequences. of varying GC content (ranging from 40 to
The CAPS technique, also known as PCR- 100%) are usually used. If two hybridiza-
RFLP, consists of digesting a PCR-amplified tion sites are similar to one another (at least
fragment with one or several restriction 3000 bp) and in opposite directions, that is,
enzymes, and detecting the polymorphism in a configuration that will allow the PCR,
by the presence/absence restriction sites amplification will take place. The amplified
(Konieczny and Ausubel, 1993). products (of up to 3.0 kb) are usually sepa-
RFLP markers are powerful tools rated on agarose gels and visualized using
for comparative and synteny mapping. ethidium bromide staining. The use of a
However, RFLP analysis requires large single 10-mer oligonucleotide promotes the
amounts of high quality DNA and has low generation of several discrete DNA products
genotyping throughput and is very diffi- and these are considered to originate from
cult to automate. Most genotyping involves different genetic loci. Polymorphisms result
radioactive methods so its use is limited to from mutations or rearrangements either at
specific laboratories. RFLP probes must be or between the primer binding sites and are
physically maintained and it is therefore visible in conventional agarose gel electro-
difficult to share them between laboratories. phoresis as the presence or absence of a par-
In addition, the level of RFLP is relatively ticular RAPD band. RAPDs predominantly
low and selection for polymorphic parental provide dominant markers but homologous
lines is a limiting step in the development allele combinations can sometimes be iden-
of a complete RFLP map. tified with the help of detailed pedigree
information.
RAPD RAPDs have several advantages and for
this reason they are widely used (Karp and
Williams et al. (1990) and Welsh and Edwards, 1997). (i) Neither DNA probes nor
McClelland (1990) independently described sequence information is required for the
the utilization of a single, random-sequence design of specific primers. (ii) The proce-
oligonucleotide primer in a low stringency dure does not involve blotting or hybridiza-
PCR (3545C) for the simultaneous ampli- tion steps thus making the technique quick,
fication of several discrete DNA fragments simple and efficient. (iii) RAPDs require rel-
referred to as random amplified polymor- atively small amounts of DNA (about 10 ng
phic DNA (RAPD) and arbitrary primed PCR per reaction) and the procedure can be auto-
(AP-PCR), respectively. Another related mated; they are also capable of detecting
technique is DNA amplification fingerprint- higher levels of polymorphism than RFLPs.
ing (DAF) (Caetano-Anolls et al., 1991). (iv) Development of markers is not required
These methods differ from one another in and the technology can be applied to vir-
primer length, the stringency of the con- tually any organism with minimal initial
ditions and the method of separation and development. (v) The primers can be uni-
detection of the fragments. They all can be versal and one set of primers can be used for
used to identify RAPD. any species. In addition, RAPD products of
The principle of RAPD consists of a interest can be cloned, sequenced and con-
PCR on the DNA of the individual under verted into other types of PCR-based mark-
study using a short primer, usually ten ers such as sequence tagged sites (STS),
nucleotides, of arbitrary sequence. The sequenced characterized amplified regions
primer which binds to many different loci (SCAR), etc.
28 Chapter 2

Reproducibility affects the way in which clear what might be causing the problem, it
RAPD bands can be standardized for compar- is worth starting from the beginning by dis-
ison across laboratories, samples and trials posing of all the reagents used and preparing
and whether RAPD marker information can fresh ones. A careful experiment revealed
be accumulated or shared. Due to frequently that reproducibility could be improved and
observed problems with reproducibility of Taberner et al. (1997) reported that 3396 out
overall RAPD profiles and specific bands, of 3422 bands (99.2%) were reproducible.
this marker class is often treated with On the other hand, low reproducibility
reserve. In replication studies by Prez et al. is a major limitation of RAPD markers, par-
(1998), mispriming error amounted to 60%. ticularly in ongoing genetic and plant breed-
Several factors have been shown to affect ing programmes in which the accumulated
the number, size and intensity of bands. information and markers and marker data
These include PCR buffers, deoxynucleo- are shared between laboratories and experi-
tide triphosphates (dNTPs), Mg2+ concen- ments. RAPD markers may still find their
tration, cycling parameters, source of Taq applications in independent genetic diver-
polymerase, condition and concentration sity and phylogenetic studies that do not
of template DNA and primer concentra- depend on data sharing or accumulation. As
tion. Results obtained by RAPDs are highly RAPD markers can be converted into other
prone to user error and bands obtained can types of markers, they have a unique role in
vary considerably between different runs of the development of target markers for crop
the same sample. To correct the problems species that have limited molecular markers
that may be encountered when carrying out available to cover the whole genome.
RAPD-PCR, it is important to bear in mind To overcome the problem associated
the following: (i) the concentration of DNA RAPD analysis, Paran and Michelmore
can alter the number of bands; (ii) RAPD (1993) converted RAPD fragments into
profiles vary depending on the Mg2+ con- simple and robust PCR markers known as
centration and the PCR buffer provided by SCARs. This procedure increases the repro-
Taq polymerase suppliers may or may not ducibility of RAPD markers and also avoids
contain Mg2+ ions; (iii) there are different the occurrence of non-homologous mark-
sources of Taq polymerase and there is great ers of equal molecular weight. These spe-
variation between profiles produced using cific markers are obtained by introducing
Taq polymerase obtained from different RAPD bands (polymorphic) into single
companies; (iv) there are a large number of markers which are then sequenced and
alternative cycling times and temperatures specific primers are designed usually by
which are equally important and depend on expanding the original decamer primer
the type of machine used and even the wall sequence with 1015 bases so that only the
thickness of the PCR tubes. band of interest is amplified. In general,
Generally if a PCR does not work there DNA can be isolated from agarose gels,
is likely to be something wrong with the cloned and sequenced to produce the start-
template DNA, primers, Taq polymerase or ing DNA template for the development of a
choice of conditions. Initially it is impor- variety of PCR-based markers. The cloned
tant to try and repeat the PCR under the and sequenced DNA fragments can then be
same conditions to ensure that there was used for the development of CAPS, single
not a simple error that resulted in the fail- strand conformation polymorphism (SSCP)
ure. In addition it is recommended that both or SNP markers.
positive and negative controls are included.
A positive control with a template known AFLP
to amplify well will ensure that all reagents
have been added and that they are all func- Amplified fragment length polymorphism
tioning. A negative control without template (AFLP; Zabeau and Voss, 1993; Vos et al.,
DNA will reveal any contamination. In most 1995) is based on the selective PCR ampli-
cases if the PCR does not work and it is not fication of restriction fragments from a total
Markers and Maps 29

double-digest of genomic DNA under high GAATTC TTAA

CTTAAG AATT
stringency conditions, that is, the combi- Whole genome DNA
nation of polymorphism at restriction sites
and hybridization of arbitrary primers, and Restriction + EcoRI and MseI
because of this AFLP is also called selective
AATTC T
restriction fragment amplification (SRFA). G AAT
It was perfected by the company Keygene
in the Netherlands for initial use in plant TTAA
EcoRI Adaptor
Ligation +
improvement and has been patented. The TA
MseI Adaptor
AFLP technique combines the power of
RFLP with the flexibility of PCR-based AATTC TTA
markers and provides a universal, multi- TTAAG AAT

locus marker technique that can be applied

5 A EcoRI Primer 1
Pre-amplification +
to complex genomes from any source. The C 5 MseI Primer 1
method is based on the identification of
AFLP using selective PCR amplification of 5 A
AATTCN NTTA
digested/ligated genomic or cDNA templates TTAAGN NAAT
separated on a polyacrylamide gel, includ- C 5

ing restrictionligation, pre-amplification Selective 5* GA EcoRI Primer 1

and selective amplification (Fig. 2.3). The amplification CA 5 MseI Primer 1
purified genomic DNA is first cleaved with
5* GA
one or more restriction endonucleases, AATTCN NTTA
i.e. a 6-cutter (EcoRI, PstI and HindIII) and TTAAGN NAAT
CA 5
a 4-cutter (MseI, TaqI). Adaptors of 1820 bp
and of known sequence, adapted at the Electrophoresis

sticky ends of the restriction sites, are then

added to the ends of DNA fragments by a
ligation reaction using T4 DNA ligase. DNA
amplification is carried out using primers
with the sequence specificity of the adaptor
to generate a subset of fragments of differ-
ent sizes (up to 1 kb). The primer(s) also
contains one or more bases at their 3' ends
that provide amplification selectivity by
Fig. 2.3. AFLP flowchart. Adaptor DNA = short
limiting the number of perfect sequence
double strand DNA molecules, 1820 bp in length,
matches between the primer and the pool representing a mixture of two types of molecules.
of available adaptor/DNA templates. The Each type is comparable with one restriction
resulting amplification products (50400 bp enzyme generated DNA end. Pre-amplifications
size range) are typically observed by radio- uses selective primers, which contain an adaptor
labelling one of the primers followed by DNA sequence plus one or two random bases at
fragment separation on acrylamide gels to the 3' end for reading into the genomic fragments.
identify polymorphisms (changes in restric- Primers for re-amplification have the pre-amplification
tion sizes). primer sequence plus one or two additional bases
at the 3' end. A tag (*) is attached at the 5' end of
An AFLP primer is composed of a
one of the re-amplification primers for detecting
synthetic adaptor sequence, the restric-
amplified molecules.
tion endonuclease recognition sequence
and an arbitrary, non-degenerate selec-
tive sequence (typically one, two or three one rare cutter (6-bp recognition site).
nucleotides). In the first step, 500 ng of Oligonucleotide adaptors are ligated to the
genomic DNA will be completely digested end of each restriction DNA which serve
with two restriction enzymes, one fre- together with restriction site sequences as
quent cutter (4-bp recognition site) and target sites for primer annealing, one end
30 Chapter 2

with a complementary sequence for the rare the detector near the bottom of the gel/end
cutter and the other with the complemen- of the capillary, resulting in a linear spac-
tary sequence for the frequent cutter. In this ing of DNA fragments and therefore increas-
way only fragments which have been cut by ing the resolution over the whole size range
the frequent cutter and rare cutter will be (Schwarz et al., 2000).
amplified. Primers are designed from the In general, AFLP assays can be carried
known sequence of the adaptor, plus one out using relatively small DNA samples
to three selective nucleotides which extend (typically 1100 ng per individual). AFLP
into the fragment sequence. Sequences not has a very high multiplex ratio and genotyp-
matching these selective nucleotides in the ing throughput and is relatively reproduc-
primer will not be amplified so that the ible across laboratories. Simple off-the-shelf
specific amplification of only those frag- technology can be applied to virtually any
ments matching the primers is achieved. organism with no formal marker devel-
The option to permutate the order of the opment required and in addition, a set of
selective bases and to recombine the prim- primers can be used for different species.
ers with each other will theoretically lead However, there are limitations to the AFLP
to the gradual collection of all restriction assay. (i) The maximum polymorphic infor-
fragments from a particular enzyme com- mation content for any bi-allelic marker
bination that is of a suitable size for DNA is 0.5. (ii) High quality DNA is needed to
fragment analysis from a genotype. The ensure complete restriction enzyme diges-
multiplex ratio of an AFLP assay is a function. Rapid methods for isolating DNA may
tion of the number selective nucleotides in not produce sufficiently clean template
the AFLP primer combination, the selective DNA for AFLP analysis. (iii) Proprietary
nucleotide motif, GC content and physical technology is needed to score heterozygotes
genome size and complexity. Typically, two and ++ homozygotes, otherwise AFLPs must
selective nucleotides are used for species be dominantly scored. (iv) AFLP markers
with small genomes (1 1085 108 bp), often cluster densely in centromeric regions
e.g. Arabidopsis thaliana L. (1 108 bp) and in species with large genomes, e.g. barley
rice (Oryza sativa L.) (4 108 bp), and three (Qi et al., 1998) and sunflower (Gedil et al.,
selective nucleotides are used for species 2001). (v) Developing locus-specific mark-
with large genomes (5 1086 109 bp), ers from individual fragments can be dif-
e.g. maize, soybean, sunflower and many ficult. (vi) AFLP primer screening is often
others. It is theoretically possible to use necessary to identify optimal primer spe-
several tens of combinations of restriction cificities and combinations otherwise the
enzymes at sites of four to six bases and a assays can be carried out using off-the-shelf
large number of combinations of selective technology. (vii) There are relatively high
bases on the amplification primers. Thus, technical demands in AFLP analysis includ-
as indicated by Falque and Santoni (2007), ing radio-labelling and skilled manpower.
the restrictionamplification combinations (viii) Marker development is complicated
are nearly infinite. and not cost-effective. (ix) Reproducibility
AFLP products can be separated in high- is relatively low compared to RFLP and
resolution electrophoresis systems. The simple sequence repeat (SSR) markers but
number of bands produced can be manipu- better than RAPD marker as AFLP reveals
lated by the number of selective nucleotides large numbers of bands and not all the bands
and the nucleotide motifs used. A well- will be comparable across laboratories or
balanced number of amplified restriction trials due to potential false positive, false
fragments ranges from 50 to150 bp. A major negative and complicated gel backgrounds.
improvement has been made by switching The AFLP technique can be modified
from radioactive to fluorescent dye-labelled so that one primer is obtained from a known
primers for the detection of fragments in multi-copy sequence to detect sequence-
gel-based or capillary DNA sequencers in specific amplification polymorphisms. This
which fluorescently labelled fragments pass approach was used successfully to generate
Markers and Maps 31

genome-wide Bare-1 retrotransposon-like The unique sequences bordering the SSR

markers in barley (Waugh et al., 1997) and motifs provide templates for specific prim-
diploid Avena (Yu and Wise, 2000) as well ers to amplify the SSR alleles via PCR.
as in lucerne by making use of consen- Referred to as simple sequence length poly-
sus sequences from long terminal repeats morphisms (SSLPs), they pertain to the
(LTRs) of Tms1 retrotransposon (Porceddu number of repeat units that constitute the
et al., 2002). The cDNA-AFLP technique microsatellite sequence. The rates of muta-
(Bachem et al., 1996) which applies the tion of SSR are about 4 1045 106 per
standard AFLP protocol to a cDNA tem- allele and per generation (Primmer et al.,
plate, was used to display transcripts whose 1996). The predominant mutation mecha-
expression was rapidly altered during race- nism in microsatellite tracts is slipped-
specific resistance reactions, for the isola- strand mispairing (Levinson and Gutman,
tion of differentially expressed genes from 1987). When slipped-strand mispairing
a specific chromosome region using aneu- occurs within a microsatellite array during
ploids and for the construction of genome- DNA synthesis, it can result in the gain or
wide transcription maps (as reviewed by loss of one or more repeat units depending
Mohler and Schwarz, 2005). In addition, on whether the newly synthesized DNA
there are several modified AFLP tech- chain or the template chain loops out. The
niques based on the use of endonucleases relative propensity for either chain to loop
such as single endonuclease (MspI) AFLP out seems to depend in part on the sequences
(Boumedine and Rodolakis, 1998), three making up the array and in part on whether
endonuclease-AFLP (van der Wurff et al., the event occurs on the leading (continuous
2000), and second digestion AFLP (Knox DNA synthesis) or lagging (discontinuous
and Ellis, 2001). Developments in the DNA synthesis) strand (Freudenreich et al.,
detection of AFLP include the replacement 1997). SSR loci are individually amplified
of radio-active detection with silver stain- by PCR using pairs of oligonucleotide prim-
ing, fluorescent AFLP or agrarose gels for ers specific to unique DNA sequences flank-
single endonuclease AFLP. Recent studies ing the SSR sequence.
have addressed specific areas of the AFLP Microsatellites may be obtained by
technique including comparison with other screening sequences in databases or by screen-
genotyping methods, assessment of errors, ing libraries of clones. If no sequence is
homoplasy, phylogenetic signal and appro- available, microsatellite markers can be
priate analysis techniques. The study by developed in the following steps: construct
Meudt and Clarke (2007) provides a syn- enriched or unenriched small-insert clone
thesis of these areas and explores new library; screen it by hybridizing labelled
directions for the AFLP technique in the oligo (with SSR motif of interest); sequence
genomic era. positive clones; design primers in single
copy regions flanking SSR repeats such that
SSR the amplified fragments will be > 50 bp and
< 350 bp; and identify size polymorphism on
Microsatellites, also known as SSRs, short PAGE gels. For multiplexing, design primers
tandem repeats (STRs) or sequence-tagged with similar melting temperature (Tm) and
microsatellite sites (STMS), are tandemly a range of expected amplicon sizes to have
repeated units of short nucleotide motifs non-overlapping groups of markers on a gel.
that are 16 bp long. Di-, tri- and tetranu- In rice, both an enzyme-digested (Chen, X.
cleotide repeats such as (CA)n, (AAT)n and et al., 1997) and a physically-sheared library
(GATA)n are widely distributed through- (Panaud et al., 1996) were constructed from
out the genomes of plants and animals cultivar IR36 based on size-selected DNA in
(Tautz and Renz, 1984). One of the most the 300800-bp range. These libraries were
important attributes of microsatellite loci screened for the presence of (GA)n microsat-
is their high level of allelic variation, mak- ellites by plaque and colony hybridization.
ing them valuable as genetic markers. A pre-sequencing screening step was used
32 Chapter 2

to eliminate clones where the microsatellite Additional information based on genetic

repeat was too near one of the cloning sites mapping and nearest marker informa-
to permit accurate design of primers and to tion provided the basis for locating a
determine which end should be sequenced total of 1825 designed markers along rice
with priority. The basic steps include: chromosomes.
Compared with library-derived SSRs,
PCR amplification of clone inserts and
EST-derived SSRs are expected to dis-
determination of their lengths before
play slightly fewer polymorphisms as
sequencing. Short and long insert
there is pressure for sequence conserva-
clones are usually discarded.
tion in the coding regions (Scott, 2001).
Selected clones are sequenced and
However, the availability of SSR markers
searched for SSRs.
from the expressed portion of the genome
Sequences within motif classes are
might facilitate their transferability across
grouped and aligned using sequence
genera compared to the low efficiency
alignment software to identify redun-
of SSR markers that have been retrieved
dant sequences.
from gene-poor areas (Peakall et al., 1998).
Oligonucleotide primers are designed
This approach could be used in plant spe-
for unique DNA sequences flanking
cies with minimal resources and research
non-redundant SSRs.
expenditure.
Primers are tested and genotypes are
Once a plant species has been com-
screened for SSR length polymorphisms.
pletely sequenced, the entire set of available
An alternative source of SSRs is to SSRs in the genome can be easily accessed
utilize expressed sequence tag (EST) and through online databases. For example,
other sequence databases (e.g. Kantety the International Rice Genome Sequencing
et al., 2002). SSRs can be identified com- Project identified 18,828 di, tri and tetra-
putationally, using a BLAST query (see nucleotide SSRs that were over 20 bp in
Simple Sequence Repeat Identification length and developed flanking primers for
Tool available at www.gramene.org) and use as SSR markers (IRGSP, 2005). The loca-
available genomic or EST sequences. Using tions of these SSRs on the physical map of
this method, a total of 2414 new di-, tri- rice in relation to other genetic markers can
and tetra-nucleotide non-redundant SSR be found using the online Gramene Genome
primer pairs, representing 2240 unique Browser (http://www.gramene.org/Oryza_
marker loci, were developed and experi- sativa_japonica/index.html).
mentally validated in rice (McCouch et al., The usual method of SSR genotyping is to
2002). SSR-containing sequences that separate radio-labelled or silver-stained PCR
consisted of perfect repeat motifs (> 24 bp products by denaturing or non-denaturing
in length) flanked by 100 bp of unique PAGE using ethidium bromide or SYBR stain-
sequence on either side of the SSR were ing although distinguishing SSRs on agarose
chosen from GenBank. Primer pairs con- gels is sometimes possible (Fig. 2.4). These
taining 1824 nucleotides devoid of sec- assays can usually distinguish alleles which
ondary structure or consecutive tracts of differ by 24 bp or more.
a single nucleotide, with a GC content of Semi-automated SSR genotyping
around 50% (Tm approximately 60C) and can be carried out by assaying fluores-
preferably G- or C-rich at the 3' end were cently labelled PCR products for length
automatically designed. Using electronic variants on an automated DNA sequencer
PCR (e-PCR) to align these designed primer (e.g. Applied Biosystems and Li-Cor)
pairs against 3284 publicly sequenced rice (Fig. 2.4). One drawback of fluorescent
BAC and PAC clones (representing about SSR genotyping is the cost of end-labelling
83% of the total rice genome), 65% of the primers with the necessary fluorophores,
SSR markers hit a BAC or PAC clone con- e.g. 6-carboxy-fluorescine (FAM), hexachloro-
taining at least one genetically mapped 6-carboxy-flurescine (HEX) or tetrachloro-
marker and could be mapped by proxy. 6-carboxy-fluorescine (TET). SSR length
Markers and Maps 33

by several repeat units can often be

distinguished on agarose gels (Fig. 2.4).
SSRs assayed on polyacrylamide gels
typically show characteristic stuttering.
Agarose gel-based SSR genotyping Stutter bands are artefacts produced by
DNA polymerase slippage. Typically, the
most prominent stutter bands are +1 and 1
repeats (e.g. + or 2 bp for a di-nucleotide
repeat), and, if visible, the next most prom-
inent stutter bands are +2 and 2 repeats.
Stuttering reduces the resolution between
PAGE gel-based SSR genotyping alleles such that 2- or possibly 4-bp differ-
ences between alleles cannot be sharply or
unequivocally distinguished on polyacry-
lamide gels. Figure 2.4 shows examples
of different genotyping systems used for
SSR analysis including multiplexing and
stutter bands.
Another source of noise is the incom-
plete addition of non-templated ade-
nine to PCR products thereby producing
adenylated (+A) and non-adenylated (A)
DNA fragments (Magnuson et al., 1996).
Adding a pigtail sequence (e.g. GTCTCTT)
Semi-automated SSR genotyping to the 5' end of the reverse primer pro-
motes the adenylation of the 3' end of the
156
forward strand (Brownstein et al., 1996),
thereby virtually eliminating the A prod-
ucts and producing a more homogenous
158 set of fragments.
SSR markers are characterized by
their hypervariability, reproducibility, co-
156 158 dominant nature, locus specificity and
random dispersion throughout most genomes.
In addition, SSRs are reported to be more
Automated SSR genotyping using fluorescent variable than RFLPs or RAPDs. The advan-
labelling tages of SSRs are that they can be readily
analysed by PCR and are easily detected on
polyacrylamide gels. SSLPs with large size
differences can be also detected on agarose
gels. SSR markers can be multiplexed, either
functionally by pooling independent PCR
products or by true multiplex-PCR. Their
genotyping throughput is high and can be
Stutter bands and multiple alleles automated. In addition, start-up costs are
low for manual assay methods (once the
Fig. 2.4. Examples of genotyping systems used markers have been developed) and SSR
for SSR analysis. assays require only very small DNA samples
(100 ng per individual).
polymorphisms can be also assayed using The disadvantages of SSRs are the labour-
non-denaturing high pressure liquid chro- intensive development process particularly
matography (HPLC). SSR alleles differing when this involves screening genomic DNA
34 Chapter 2

libraries enriched for one or more repeat ing barley, soybean, sugarbeet, maize,
motifs (although SSR-enriched libraries can cassava and potato; typical SNP frequen-
be commercially purchased) and the high cies are also in the range of one SNP every
start-up costs for automated methods. 100300 bp in plants (see Edwards et al.,
2007a for a review).
SNP SNPs may fall within coding sequences
of genes, non-coding regions of genes or in
A single nucleotide polymorphism or the intergenic regions between genes at dif-
SNP (pronounced snip) is an individual ferent frequencies in different chromosome
nucleotide base difference between two regions. In Arabidopsis the distribution of
DNA sequences. SNPs can be catego- SNPs was found to be even across the five
rized according to nucleotide substitu- chromosomes with the exception of cen-
tion as either transitions (C/T or G/A) or tromeric regions which contain few tran-
transversions (C/G, A/T, C/A or T/G). For scribed genes (Schmid et al., 2003). SNPs
example, sequenced DNA fragments from within a coding sequence will not neces-
two different individuals, AAGCCTA to sarily change the amino acid sequence of
AAGCTTA, contain a single nucleotide dif- the protein that is produced due to redun-
ference. In this case there are two alleles: dancy in the genetic code. A SNP in which
C and T. C/T transitions constitute 67% of both forms lead to the same polypeptide
the SNPs observed in humans, and about sequence is termed synonymous, while if
the same rate was also found in plants a different polypeptide sequence is pro-
(Edwards et al., 2007a). In practice, single duced they are non-synonymous. SNPs
base variants in cDNA (mRNA) are consid- that are not in protein coding regions may
ered to be SNPs as are single base inser- still have consequences for gene splic-
tions and deletions (indels) in the genome. ing, transcription factor binding or the
As a nucleotide base is the smallest unit sequence of non-coding RNA. Of the 317
of inheritance, SNPs provide the ultimate million SNPs found in the human genome,
form of molecular marker. 5% are expected to occur within genes.
For a variation to be considered a SNP, Therefore, each gene may be expected to
it must occur in at least 1% of the popula- contain 6 SNPs.
tion. SNPs make up about 90% of all human A variety of approaches have been
genetic variation and occur every 100300 adopted for discovery of novel SNPs in a
bases. Two of every three SNPs involve the wide range of organisms including plants.
replacement of cytosine (C) with thymine These fall into three general categories
(T). This is supported by a genome-wide (Edwards et al., 2007b): (i) in vitro discov-
analysis in rice. A polymorphism data- ery, where new sequence data is generated;
base constructed to define polymorphisms (ii) in silico methods that rely on the analysis
between cultivars Nipponbare (from sub- of available sequence data; and (iii) indirect
species japonica) and 93-11 (from subspe- discovery, where the base sequence of the
cies indica) contains 1,703,176 SNPs and polymorphism remains unknown. On the
479,406 indels (Shen et al., 2004), which other hand, a large number of different SNP
equates to approximately 1 SNP/268 bp genotyping methods and chemistries have
in the rice genome. Using alignments of been developed based on various meth-
the improved whole-genome shotgun ods of allelic discrimination and detection
sequences for japonica and indica rice, platforms. A convenient method for detect-
SNP frequencies varied from 3 SNPs/kb in ing SNPs is RFLP (SNP-RFLP) or by using
coding sequences to 27.6 SNPs/kb in the the CAPS marker technique. If one allele
transposable elements with a genome-wide contains a recognition site for a restriction
measure of 15 SNPs/kb or 1 SNP/66 bp enzyme while the other does not, digestion
(Yu et al., 2005). Based on partial genomic of the two alleles will give rise to fragments
sequence information, SNP frequencies of different length. A simple procedure is
have been revealed in many crops, includ- to analyse the sequence data stored in the
Markers and Maps 35

major databases and identify SNPs. Four be bound to streptavidin-coated wells and
alleles can be identified when the complete denatured under alkaline conditions. An
base sequence of a segment of DNA is con- oligonucleotide probe complementary to
sidered and these are represented by A, T, G one allele is added to the single-strand target
and C at each SNP locus in that segment. DNA molecules. The differences in melting
Sobrino et al. (2005) assigned the major- curves are measured by slowly heating and
ity of SNP genotyping assays to one of four observing the changes in fluorescence of a
groups based on the molecular mechanisms: double-strand-specific, intercalating dye.
allele-specific hybridization, primer exten- The 5' nuclease or TaqMan assay, molecu-
sion, oligonucleotide ligation and invasive lar beacon and the scorpion assays are all
cleavage. These four are described below. examples of ASH SNP genotyping technolo-
Chagn et al. (2007) added three methods gies. Large-scale scanning of SNPs in a vast
to this list, sequencing, allele-specific PCR number of loci using allele-specific hybridi-
amplification, DNA conformation methods zation can be carried out on high-density
and also generalized the enzymatic cleav- oligonucleotide chips.
age method to include the invader assay 2. The Invader assay, also known as flap
and also dCAPS and targeting induced endonuclease discrimination, is based on
local lesions in genomes (TILLING). the specificity of recognition and cleavage
by a three-dimensional flap endonuclease
1. Allele-specific hybridization (ASH), also which is formed when two overlapping oli-
known as allelic-specific oligonucleotide gonucleotides hybridize perfectly to a target
hybridization, is based on distinguishing by DNA (Lyamichev et al., 1999). The cleaved
hybridization between two DNA targets dif- fragment may be labelled with a probe-
fering at one nucleotide position (Wallace specific fluorescent dye which fluoresces
et al., 1979). Allelic discrimination can be following probe cleavage due to spatial sep-
achieved using two allele-specific probes aration from the quencher. Alternatively, the
labelled with a probe-specific fluorescent flap may act as the invader probe in a sec-
dye and a generic quencher that reduces flu- ondary reaction to amplify the fluorescent
orescence in the intact probe. During ampli- signal (Invader squared) (Hall et al., 2000).
fication of the sequence surrounding the Third Wave Technologies Inc. (http://www.
SNP, probes complementary to the DNA tar- twt.com) has manufactured an Invader assay
get are cleaved by the 5' exonuclease activ- for flap endonuclease discrimination which
ity of Taq polymerase. Spatial separation of can be carried out in solid phase using
the dye and quencher results in an increase oligonucleotide-bound streptavidin-coated
in probe-specific fluorescence which can be particles (Wilkins-Stevens et al., 2001).
detected with a plate reader. 3. Primer extension is a term used to
Under optimized assay conditions, describe mini-sequencing, single-base exten-
the SNP can be detected by the difference sion or the GOOD assay (Sauer et al., 2002).
in Tm of the two probetemplate hybrids A popular method which was designed
as only the perfectly matched probetarget specifically for genotyping SNPs is the
hybrids are stable and those with one-base mini-sequencing technique (Syvnen, 1999;
mismatch are unstable. To increase the reli- Syvnen et al., 1990). The method forms the
ability of SNP genotyping the probes should basis of a number of methods for allelic dis-
be as short as possible. Originally, ASH crimination. The robust detection of known
used the dot blot format in which probes are mutations employs oligonucleotides which
hybridized to membrane-bound genomic anneal immediately upstream of the query
DNA or PCR fragments. However, the SNP and are then extended by a single
more advanced PCR-based dynamic allele- dideoxynucleotide triphosphate (ddNTP)
specific hybridization (DASH) method uses in cycle sequencing reactions. The fidel-
a microtitre plate format (Howell et al., ity of thermostable proof-reading DNA
1999). Since one of the PCR primers is bioti- polymerases guarantees that only the com-
nylated at the 5' end, the PCR products can plementary ddNTP is incorporated. Several
36 Chapter 2

detection methods have been described on an automated sequencer and rolling-

for the discrimination of primer extension circle amplification with one of the ligation
(PEX) products. Most popular is the use of probes bound to a microarray surface.
ddNTP terminators that are labelled with
different fluorescent dyes. The differentially DETECTION SYSTEMS. There are several detec-
dye-labelled PEX products can readily be tion methods for analysing the products of
detected on charge coupled device camera- each type of allelic discrimination reaction:
based DNA sequencing instruments. gel electrophoresis, fluorescence resonance
In the case of a single base extension energy transfer (FRET), fluorescence polari-
(SBE), a primer is annealed adjacent to a zation, arrays or chips, luminescence, mass
SNP and extended to incorporate a ddNTP spectrophotometry, chromatography, etc.
at the polymorphic site. SNaPshot (Applied Fig. 2.5 summarizes the enzyme chemistry,
Biosystems) uses differential fluorescent demultiplexing and detection options in
labelling of the four ddNTPs in a SBE reac- SNP genotyping.
tion allowing fluorescent detection of the Fluorescence is the most widely applied
incorporated nucleotide. SNP-IT (Orchid detection method currently employed for
Biosciences) is also based on fluorescent high-throughput genotyping in general. The
SBE and uses solid phase capture and detec- use of fluorescence has been teamed with
tion of extension products. The GOOD assay a number of different detection systems
involves extension of a primer modified including plate readers, capillary electro-
near the 3' end with a charged tag to increase phoresis and DNA arrays. In addition to
sensitivity to mass spectrometry detection. fluorescence detection, mass spectrometry
Alternatives to SBE include pyrose- and light detection represent novel appli-
quencing, allele specific primer extension cations of established technology for high-
and the amplification refractory mutation throughput genotyping of SNPs.
system. Real-time monitoring of PEX relies
on the bioluminometric detection of inor- PLATE READERS. There are many fluores-
ganic pyrophosphate released upon incor- cent plate readers capable of detecting
poration of dNTP (Ahmadian et al., 2000). fluorescence in a 96- or 384-well format
4. The oligonucleotide ligation assay (OLA) (Jenkins and Gibson, 2002). Most models
for SNP typing is based on the ability of use a light source and narrow band-pass
ligase to covalently join two oligonucle- filters to select the excitation and emis-
otides when they hybridize next to one sion wavelengths and enable semi-quan-
another on a DNA template (Landegren titative steady state fluorescence intensity
et al., 1988). Both primers must have perfect readings to be made. This technology has
base pair complementarity at the ligation been applied to genotyping with TaqMan,
site which makes it possible to discriminate Invader and rolling-circle amplification.
two alleles at a SNP site. The OLA has been Fluorescence plate readers are also avail-
modified to exploit a thermostable DNA able which allow measurement of addi-
ligase, interrogate PCR templates and uti- tional fluorescence parameters including
lize a dual-colour detection system. OLA polarization, lifetime and time-resolved
also gave rise to another technique, Padlock fluorescence and FRET.
probes (Nilsson et al., 1994), which uses
oligonucleotide probes that ligate into cir- DNA ARRAY. Oligonucleotide arrays bound to
cles upon target recognition and isothermal a solid support have been proposed as the
rolling-circle amplification. As reviewed future detection platform for high-through-
by Chagn et al. (2007), there are several put genotyping. Two distinct approaches
applications which have been developed to have been adopted involving ASH whereby
detect SNP variation using OLA, including the oligonucleotide directly probes the
colorimetric assays in ELISA plates, sepa- target and tag arrays that capture solution
ration of the ligated oligonucleotides that phase reaction products via hybridization
have been labelled with a fluorescent dye to their anti-tag sequences.
Markers and Maps 37

Enzyme chemistry Demultiplexing Detection method Platform/company

Illumina
BeadArray
Allele-specific Luminex 100 Flow
Semi-homogen.
extend ligate Cytometry

Sequenom iPlex
Oligonucleotide Solid phase Mass Spec.
ligation assay microspheres Fluorescence

ABI SNPlex
Single nucleotide
primer extension Homogeneous Mass Microarray
spectrometry minisequencing

ABI TaqMan
Capillary 5-Nuclease
Allele-specific
electrophoresis
hybridization Fluor. res. energy
transfer (FRET) ABI SNaPshot

Solid phase
DASH,
microarray
Amplicon Tm
Allele-specific Fluorescence
PCR polarization Perkin-Elmer
FP-TDI

Fig. 2.5. Chemistry, demultiplexing, detection options in SNP genotyping. From Syvnen (2001) reprinted
by permission from Macmillan Publishers Ltd.

The Affymetrix Genome-Wide Human MASS SPECTROMETRY. Many genotyping tech-

SNP Array 6.0 features more than 1.8 mil- niques involve the allele-specific incorpora-
lion markers for genetic variation, includ- tion of two alternative nucleotides into an
ing more than 906,600 SNPs and more oligonucleotide probe. Due to the inherent
than 946,000 probes for the detection of molecular weight difference of DNA bases,
copy number variation. The SNP Array 6.0 mass spectrometry can be used to determine
enables high-performance, high-powered which variant nucleotide has been incorpo-
and low-cost genotyping (http://www.affy rated by measuring the mass of the extended
metrix.com). Luminex has developed a primers and this approach has been applied
panel of 100 bead sets with unique fluores- primarily to genotyping by primer exten-
cent labels, identifiable by flow analyser. sion using the MALDI-TOF (matrix assisted
The bead sets can be derivatized with allele laser desorption/ionization-time of flight)
specific oligonucleotides to create a bead- mass spectrometry approach. The MALDI-
based array for multiplex genotyping by TOF method is particularly advantageous
ASH. for detection of PEX products in multiplex.
Tag arrays are generic assemblies of The polyanionic nature of oligonucle-
oligonucleotides that are used to sort or otides results in low signal to noise ratios,
deconvolute mixtures of oligos by hybri- particularly for longer (> 40 mer) fragments.
dization to the anti-tag sequences. The This has been addressed by specifically
current Affymetrix GeneChip Universal cleaving long probes by acidolysis of P3'-N5'
Tag Arrays are available in 3, 5, 10 or 25 K phosphoramidate bonds and by a combined
configurations and contain novel, bio- approach whereby the probe is digested to a
informatically designed tag sequences very short fragment which has been deriva-
that result in minimal potential for cross- tized to lower its charge to a single positive
hybridization. or negative charge.
38 Chapter 2

LIGHT DETECTION. Pyrosequencing involves level, the multiple steps can be assembled
hybridization of a sequencing primer to a and automated so that one laboratory tech-
single stranded template and sequential nician can produce 10,000 data points per
addition of individual dNTPs. Incorporation day. The TaqMan platform is highly suita-
of a dNTP into a primer releases pyrophos- ble for genetically modified organism tests
phate which triggers a luciferase-catalysed and MAS using a few markers for a large
reaction. The genotype of a SNP is deter- number of samples.
mined by the sequential addition (and The SNaPshot Multiplex Assay
degradation) of nucleotides. The light (Applied Biosystems, Foster City, USA)
produced is detected by a charge coupled is based on mini-sequencing, i.e. a single-
device camera and each light signal is pro- base extension using fluorescent labelled
portional to the number of nucleotides ddNTPs. The systems multiplex ready
incorporated (http://www.pyrosequencing. reaction mix enables robust multiplex
com), for which reason pyrosequencing is SNP interrogation of PCR-generated tem-
suitable for the quantitative estimation of plates. Multiplexing can be accomplished
allele frequencies in pooled DNA samples. by representing multiple SNP products
Furthermore, pyrosequencing proved to spatially. This is achieved by tailing the
be an appropriate method for genotyping 5' end of the unlabelled SNaPshot primers
SNPs in polyploidy plant genomes such with different lengths of non-complemen-
as potato because all possible allelic states tary oligonucleotide sequences that serve
of binary SNP could be accurately distin- as mobility modifiers. The reactions may
guished (Rickert et al., 2002). be carried out in 5- to 10-plex using capil-
There are various SNP detection systems lary electrophoresis for data detection in a
which differ in their chemistry, detection 96-well format so that one individual can
platform, multiplex level and application; generate over 10,000 data points per day.
some of these will be discussed below. SNaPshot is suitable for MAS of several
The reader is also referred to Bagge and traits simultaneously and if multiple sets
Lbberstedt (2008) for further information. of 10-plex are combined, it can be used for
The TaqMan SNP Genotyping Assay rough mapping and marker-assisted back-
(Applied Biosystems, Foster City, USA) is crossing with several hundreds of samples
a single-tube PCR assay that exploits the and markers involved.
5' exonuclease activity of AmpliTaq Gold The SNPlex Genotyping System
DNA. The assay kit includes two locus- (Applied Biosystems, Foster City, USA)
specific PCR primers that flank the SNP of uses OLA/PCR technology for allelic dis-
interest and two allele-specific oligonucle- crimination and ligation product amplifica-
otide TaqMan probes. These probes have a tion. Genotype information is then encoded
fluorescent reporter dye at the 5' end and into a universal set of dye-labelled, mobil-
a non-fluorescent quencher with a minor ity modified fragments known as Zipchute
groove binder at the 3' end. Upon cleav- Mobility Modifiers, for rapid detection by
age by the 5' exonuclease activity of Taq capillary electrophoresis. The same set of
polymerase during PCR, the reporter dye Zipchute Mobility Modifiers can be used
will fluoresce as it is no longer quenched for every SNPlex pool regardless of which
and the intensity of the emitted light can SNPs are chosen. The SNPlex System
be measured. Modified probes such as allows for multiplexed genotyping of up
locked nucleic acids, a modified nucleic to 48 SNPs simultaneously against a single
acid analogue, showed better hybridization sample with the ability to detect up to 4500
properties than standard TaqMan probes SNPs in parallel in 15 min. This integrated
(Kennedy et al., 2006). TaqMan is a simple system delivers cost-efficient, medium- to
assay, since all the reagents are added to the high-throughput genotyping and is suitable
microtitre well at the same time in a 96- or for various genetic and breeding applica-
384-well format. Although the assay can tions including fingerprinting, gene map-
be carried out at the monoplex or duplex ping and MAS for both foreground and
Markers and Maps 39

background. Both SNaPshot and SNPlex ing well-known reaction principles for DNA
can be used with capillary electrophoresis amplification and SNP genotyping.
systems as the genotyping platform which Identification of a specific single-base
can be also used for SSR genotyping. change among up to billions of bases that
MassARRAY iPLEX Gold (SEQUE- constitute a plant species is a challenging
NOM, San Diego, USA) combines the ben- task. PCR offers a means of reducing the
efits of the simple and robust single-base complexity of a genome and increasing
primer extension biochemistry with the the copy number of the DNA templates
sensitivity and accuracy of MALDI-TOF to the levels required for the specific and
mass spectometry (see Chapter 3) detection. sensitive detection of single-base changes.
It uses a single termination mix and universal However, the design of robust PCR assays
reaction conditions for all SNPs. The primer with multiplexing levels exceeding 1020
is extended, dependent upon the template amplicons has proven to be more diffi-
sequence, resulting in an allele-specific dif- cult than initially anticipated because in
ference in mass between extension prod- multiplex PCR the number of undesired
ucts. The assays can be multiplexed up to interactions between the PCR primers
40 SNPs in a 384-well format allowing for increases exponentially as the number of
throughput levels of up to 150,000 geno- primers included in the reaction mixture
types per instrument per day. MassARRAY increases. This interaction usually results
is flexible and suitable for generating both in preferential amplification of unwanted
small and large marker numbers for each primerdimer artefacts instead of the
sample so that it can be used for a variety of intended DNA templates (amplicons).
genetic and breeding purposes. Another problem in multiplex PCR is the
There are two major chip-based sequence-dependent differences in PCR
high-throughput genotyping systems, DNA efficiency between the amplicons. The
microarrays developed by Affymetrix (Santa problems of multiplexing can be reduced
Clara, USA) and a high-density biochip to some extent by using PCR primers that
assay by Illumina Inc. (San Diego, USA), are as similar to one another as possible.
both of which offer different levels of mul- The multiplexing level that can be read-
tiplexes up to several thousands or more ily achieved in standard PCRs is less than
plexes (Yan et al., 2009). As an increasing that offered by current technology for pro-
number of sets of these chips become avail- ducing high-density DNA microarrays.
able, outsourced genotyping through com- Simultaneous analysis of a reasonable
panies or service centres becomes one of amount of genomic DNA with the current
the options for genotyping large numbers detection sensitivity of microarray scan-
of samples using the same set of markers ners requires an amplification step. The
(e.g. fingerprinting) to achieve high effi- PCR step complicates the molecular reac-
ciency and low cost per data point. tions underlying the assays and introduces
multiple laboratory steps into the proce-
THE FUTURE OF SNP TECHNOLOGY. A key techni- dures and is therefore the chief obstacle to
cal obstacle in the development of micro- highly multiplexed SNP genotyping.
array-based methods for genome-wide SNP
genotyping is the PCR amplification step Diversity array technology
which is required to reduce the complexity
and improve the sensitivity of genotyping Diversity array technology (DArT) is a novel
SNPs in large, diploid genomes. The level type of DNA marker which employs a
of complexity that can be achieved in PCR microarray hybridization-based technique
does not match that of current microarray- developed by CAMBIA (http://www.diversity
based methods thus making PCR the lim- arrays.com) that enables the simultaneous
iting step in these assays (Syvnen, 2005). genotyping of several hundred polymorphic
Highly multiplexed microarray systems loci spread over the genome (Jaccoud et al.,
have recently been developed by combin- 2001; Wenzel et al., 2004). DArT can be
40 Chapter 2

used to construct medium-density genetic using vector-specific primers, purified and

linkage maps in species of various genome arrayed on to a solid support (microarray)
sizes. Two steps are involved: generating the (Fig. 2.6A). To genotype a sample, the rep-
array and genotyping the sample. For each resentation (DNA) of the sample is fluores-
sample, representative genomic DNA is pre- cently labelled and hybridized against the
pared by restriction enzyme digestion fol- discovery array. The array is then scanned
lowed by ligation of the restriction fragments and the hybridization signal is measured for
to adaptors. The genome complexity is then each array spot. By using multiple labels,
reduced by PCR primers with complemen- a representation from one sample is con-
tary sequences to the adaptor and selective trasted with that from another or with a con-
overhangs. Restriction generated fragments trol probe (Jaccoud et al., 2001; http://www.
representing the diversity of the gene pool cambia.org; http://www.diversityarrays.com).
are cloned. The outcome is known as a Polymorphic clones (DArT markers) show
representation (typically 0.110% of the variable hybridization signal intensities for
genome). Polymorphic clones in the library different individuals. These clones are sub-
are identified by array inserts from a random sequently assembled into a genotyping array
set of clones. Cloned inserts are amplified for routine genotyping (Fig. 2.6B).

A
Gx Gy Gn DNAs of interest

Use complexity reduction

method, e.g. RE digest,
adaptor ligation, PCR
Pool genomes amplification

Pick individual clones

and PCR amplification
Clone fragments from
the representation Library

Purified PCR products are arrayed

B
Gx Gy
Choose two genomes to analyse

Same complexity
reduction as used to make
the diversity panel

Cut, ligate adaptors

and PCR amplify

Label each genomic

subset: red ...
Label each genomic
subset ... green

Hybridize to chip

Fig. 2.6. Procedure of diversity array technology (DArT). (A) Preparing the array. RE, restriction enzyme.
(B) Genotyping a sample.
Markers and Maps 41

DArT markers are biallelic and behave derived from polymorphisms within genes.
in a dominant (present versus absent) or co- FMs are derived from polymorphic sites
dominant (two doses versus one dose versus within genes that are causally associated
absent) manner. DArT detects single-base with phenotypic trait variation and are supe-
changes as well as indels. It is a good alter- rior to RMs as a result of their complete link-
native to currently used techniques includ- age with trait locus alleles and functional
ing RFLP, AFLP, SSR and SNP in terms of motifs (Anderson and Lbberstedt, 2003).
cost and speed of marker discovery and The major drawback of the RMs is that their
analysis for whole-genome fingerprinting. predictive value depends on the known
It is cost-effective, sequence-independent, linkage phase between marker and target
non-gel based technology that is amenable locus alleles (Lbberstedt et al., 1998b).
to high-throughput automation and the dis- Genetic diversity at or below the spe-
covery of hundreds of high quality markers cies level has mostly been characterized
in a single assay. An open source software by molecular markers that more or less
package, DArTsoft, is available for automatic randomly sampled genetic variation in the
data extraction and analysis. The weak- genome. RM is a very effective tool among
nesses of this technology include marker others for the establishment of a breed-
dominance and its technically demanding ing system, the study of gene flow among
nature. Also there is some concern as to natural populations, and the determination
whether DArT markers are randomly dis- of the genetic structure of GeneBank col-
tributed across the whole genome, as DArT lections (Chapter 5; Xu et al., 2005). RM
markers in barley appear to have a moderate systems are still the systems of choice for
tendency to be located in hypomethylated, marker-assisted breeding (Xu, Y., 2003).
gene-rich regions in distal chromosome However, users of biodiversity are often not
areas (Wenzl et al., 2006). interested in random variation but rather in
DArT technology has been successfully variation that might affect the evolutionary
developed for Arabidopsis, cassava, bar- potential of a species or the performance of
ley, rice, wheat, sorghum, ryegrass, tomato an individual genotype. Such functional
and pigeon pea, while work is in progress variation can be tagged with neutral molec-
to establish DArT in chickpea, sugarcane, ular markers using quantitative trait loci
lupins, quinoa, banana and coconut (http:// (QTL) and linkage disequilibrium mapping
www.diversityarrays.com). For example, a approaches. Alternatively, DNA-profiling
genetic map with 385 unique DArT mark- techniques may be used that specifically
ers spanning the 1137 cM barley genome target genetic variation in functional parts
(Wenzl et al., 2004) was constructed, DArT of the genome.
markers along with AFLP and SSR mark-
ers were mapped on the wheat genome GENIC MARKERS. A wealth of DNA sequence
(Semagn et al., 2006), and a cassava DArT information from many fully characterized
genotyping array containing approximately genes and full-length cDNA clones has been
1000 polymorphic clones (Xia, L. et al., generated and deposited in online databases
2005) is now available. for an increasing number of plant species
and the sequence data for ESTs, genes
Genic and functional markers and cDNA clones can be downloaded from
GenBank and scanned for identification of
DNA markers can be classified into random SSRs. Subsequently, locus-specific primers
markers (RMs) (also known as anonymous flanking EST- or genic SSRs can be designed
or neutral markers), gene targeted mark- to amplify the microsatellite loci present in
ers (GTMs) (also known as candidate gene the genes. In maize for example, gene-derived
marker) and functional markers (FMs) SSR markers that have been developed
(Anderson and Lbberstedt, 2003). RMs from genes and their primer sequences
are derived at random from polymorphic are available at www.maizeGDB.org. Genic
sites across the genome whereas GTMs are SSRs have some intrinsic advantages over
42 Chapter 2

genomic SSRs because they can be obtained Novel markers can be developed from
quickly by electronic sorting, are present the transcriptome and specific genes. As
in expressed regions of the genome and summarized by Gupta and Rustgi (2004),
expected to be transferable across species these include EST polymorphisms (devel-
(when the primers are designed from more oped using EST databases); conserved
conserved coding regions; Varshney et al., orthologue set markers (developed by com-
2005a). The potential use of EST-SSRs devel- paring the sequences of target genomes with
oped for barley and wheat has been demon- sequences of the closely related species);
strated for comparative mapping in wheat, amplified consensus genetic markers (based
rye and rice (Yu et al., 2004; Varshney et al., on the known genes from model species);
2005a). These studies suggested that EST-SSR gene-specific tags (with primers designed
markers could be used in related species for using gene sequences); resistance gene
which little information is available on SSRs analogues (with primers designed to iden-
or ESTs. In addition, the genic SSRs are good tify consensus domains conferring resist-
candidates for the development of conserved ance); exonretrotransposon amplification
orthologous markers for the genetics and polymorphism (with primers designed to
breeding of different species. For example, a combine with a long terminal repeat retro-
set of 12 barley EST-SSRs was identified that transposon-specific primer or a randomly
showed significant homology with the ESTs selected microsatellite-containing oligonu-
of four monocotyledonous species (wheat, cleotide); and PCR-based markers target-
maize, sorghum and rice) and two dicotyle- ing exons, introns and promoter regions of
donous species (Arabidopsis and Medicago) known genes with high specificity.
which could potentially be used across these Target region amplification polymor-
species (Varshney et al., 2005a). phism (TRAP) markers are derived from a
Kumpatla and Mukopadhyay (2005) rapid and efficient PCR-based technique
examined the abundance of SSR in more which uses bioinformatics tools and EST
than 1.54 million ESTs belonging to 55 database information to generate poly-
dicotyledonous species. They found that the morphic markers around targeted candi-
frequency of ESTs containing SSR among date gene sequences (Hu and Vick, 2003).
species ranged from 2.65 to 16.82%, with This TRAP technique uses two primers of
dinucleotide repeats being most abundant 18 nucleotides to generate markers. TRAP
followed by tri- or mononucleotide repeats, markers are amplified by one fixed primer
thus demonstrating the potential of in designed from a target EST sequence in the
silico mining of ESTs for the rapid develop- database and a second primer of arbitrary
ment of SSR markers for genetic analysis sequence except for AT- or GC-rich cores
and application to dicotyledonous crops. that anneal with introns and exons, respec-
However, EST-SSRs produce high quality tively. The TRAP technique should be use-
markers but these are often less polymorphic ful in genotyping germplasm collections
than genomic SSRs (Cho et al., 2000; Eujayl and in tagging genes with beneficial traits
et al., 2002; Thiel et al., 2003). EST resources in crop plants.
are also being used to mine SNPs (Picoult-
Newberg et al., 1999; Kota et al., 2003). ESTs FUNCTIONAL MARKERS. Functional markers
provide a quantitative method of measuring (FMs) are derived from polymorphic sites
specific transcripts within a cDNA library within genes causally affecting phenotypic
and represent a powerful tool for gene dis- variation. The development of FMs requires
covery, gene expression, gene mapping and allele-specific sequences of functionally
the generation of gene profiles. The National characterized genes from which polymor-
Center for Biotechnology Information (NCBI) phic, functional motifs affecting plant phe-
database, dbEST 0900409 (http://www.ncbi. notype can be identified. Some theoretical
nlm.nih.gov/dbEST_summary.html) contains and application issues relevant to functional
the largest collection of ESTs in rice, wheat, markers in wheat have been addressed (Bagge
barley, maize, soybean, sorghum and potato. et al., 2007; Bagge and Lbberstedt, 2008).
Markers and Maps 43

FM development requires allele response to gibberellin and consequently

sequences of functionally characterized lead to decreased plant height. Thus, bial-
genes from which polymorphic, functional lelic (gibberellin sensitive and insensitive)
motifs affecting plant phenotype can be iden- FMs can be derived for targeted and rapid
tified. In contrast to RMs, FMs can be used as cultivar breeding aimed at increasing lodg-
markers in populations without prior map- ing tolerance.
ping, in mapped populations without risk In this section, several widely used DNA
of information loss owing to recombination markers have been discussed along with
and to better represent the genetic variation an overview of classical genetic markers.
in natural or breeding populations. Once DNA markers have gained wide acceptance
genetic effects have been assigned to func- because of their genome-wide coverage and
tional sequence motifs, FMs derived from increasingly simple and easy genotyping. It
such motifs can be used to fix gene alleles can be expected that SNP markers, as the
(defined by one or several FM alleles) in ultimate form of genetic polymorphism, will
several genetic backgrounds without addi- largely replace other types of markers when
tional calibration. This would be a major whole DNA sequences become available for
advance in the application of markers, paran increasing number of plant species (e.g.
ticularly in plant breeding, for the selection Lu et al., 2009; Xu et al., 2009b). However,
of parental materials to produce segregating the choice of DNA markers in genetics and
populations for example, as well as the sub- breeding is still highly dependent on the
sequent selection of inbred lines (Andersen accessibility of geneticists and breeders
and Lbberstedt, 2003). Depending on the to various genetic resources including the
mode of FM characterization, they can also availability of DNA markers and the time
be used for the combination of target alleles and cost involved. Table 2.2 compares the
in hybrid and synthetic breeding and culti- five most widely used DNA markers.
var testing based on the presence or absence
of specific alleles at morphological trait
loci. In population breeding and recurrent 2.2 Molecular Maps
selection programmes, FMs can be used to
avoid genetic drift at characterized loci.
The order and relative distance of genetic
A typical example is Dwarf8 in maize
features that are associated with genetic
which encodes a gibberellin response mod-
variation or polymorphisms can be deter-
ulator from which FMs can be developed for
mined by genetic mapping. Genetic maps
plant height and flowering time. For exam-
constructed using molecular markers can
ple, nine sequence motifs in the Dwarf8
also be used to locate major genes which
gene of maize were shown to be associated
can then also be used as genetic markers.
with variation in flowering time and one
particular 6-bp deletion accounted for 711
days difference in flowering time between
inbreds (Thornsberry et al., 2001). Since 2.2.1 Chromosome theory and linkage
Dwarf8 is a pleotropic gene (also affecting
plant height) the FM from additional flow- During meiosis the parental diploid (2n)
ering time genes should also be identified cell divides to produce four haploid (n)
in addition to using the Dwarf8-derived FM. gametes. During the first meiotic division,
Orthologues to Dwarf8 have been identi- the homologous chromosomes align and
fied in wheat (Rht1) (Peng et al., 1999), rice stick together in a process called synapsis.
(SLR1) (Ikeda et al., 2001), and barley (sln1) This allows spindle fibres to attach to the
(Chandler et al., 2002), and such genes have synapsed homologues (tetrads) and to move
been bred into the high-yielding wheat them as a group to the equator of the cell.
and rice cultivars of the Green Revolution As anaphase begins, the homologues can
(Hedden, 2003). Altered function of alleles then be oriented such that they are pulled
in these orthologous genes can reduce the apart to opposite poles of the cell. Following
44
Table 2.2. Comparison of the five widely used DNA markers in plants.

RFLP RAPD AFLP SSR SNP

Genomic coverage Low copy coding region Whole genome Whole genome Whole genome Whole genome
Amount of DNA required 5010 g 1100 ng 1100 ng 50120 ng 50 ng
Quality of DNA required High Low High Medium high High
Type of polymorphism Single base Single base Single base Changes in Single base changes,
changes, indels changes, indels changes, indels length of repeats indels
Level of polymorphism Medium High High High High
Effective multiplex ratio Low Medium High High Medium to high

Chapter 2
Inheritance Co-dominant Dominant Dominant/ Co-dominant Co-dominant
co-dominant
Type of probes/primers Low copy DNA or Usually 10 bp Specific sequence Specific sequence Allele-specific PCR
cDNA clones random nucleotides primers
Technically demanding High Low Medium Low High
Radioactive detection Usually yes No Usually yes Usually no No
Reproducibility High Low to medium High High High
Time demanding High Low Medium Low Low
Automation Low Medium High High High
Development/start-up cost High Low Medium High High
Proprietary rights required No Yes and licensed Yes and licensed Yes and some Yes and some licensed
licensed
Suitable utility in diversity, Genetics Diversity Diversity and All purposes All purposes
genetics and breeding genetics
Markers and Maps 45

telophase and cytokinesis, two new daugh- with an increased number of molecular
ter cells are formed. Each of these daughter markers in the segregated population; geno-
cells has half the chromosomes (n) of the typing each individual/line using molecu-
parental cell (2n). The second meiotic divi- lar markers; and constructing linkage maps
sion closely resembles mitosis with each of from the marker data.
the nuclei generated during the first meiotic The recombination frequency between
division splitting to form two more nuclei. two linked genetic markers can be defined
Thus, four haploid gametes are produced. in units of genetic distance known as cen-
Crossing over is the process by which tiMorgans (cM) or map units. If two mark-
homologous chromosomes exchange por- ers are found to be separated in one of 100
tions of their chromatids during meiosis, progeny, those two markers are 1 cM apart.
resulting in new combinations of genetic However, 1 cM does not always correspond
information and thus affecting inheritance to the same length of physical distance or
and increasing genetic diversity. Genes that the same amount of DNA. The amount of
are present together on the same chromo- DNA per cM is referred to as the physical
some tend to be inherited together and are to genetic distance. Areas in the genome
referred to as linked. Genes that are nor- where recombination is frequent are known
mally linked may be inherited independ- as recombination hot spots; there is rela-
ently during crossing over. tively little DNA per cM in these hot spots
The proportion of recombinant gam- and it can be as low as 200 kb/cM. In other
etes depends on the rate of crossover during areas recombination may be suppressed and
meiosis and is known as the recombination 1 cM will represent more DNA and in some
frequency (r). The maximum proportion of regions the physical to genetic distance can
recombinant gametes is 50% and in this be up to 1500 kb/cM.
case crossover between two genetic loci has
occurred in all the cells. This is equivalent
to the case of non-linked genes, i.e. the two Developing mapping populations
loci are inherited independently. In population development, several factors
The recombination frequency depends should be taken into consideration includ-
on the rate of crossovers which in turn ing the selection of parental lines and
depends on the linear distance between two population types and the determination of
genetic loci. Recombination frequencies population size.
range from 0 (complete linkage) to 0.5 (com-
plete independent inheritance).
CHOICE OF PARENTAL LINES.
Four factors should
be considered in selecting appropriate
parental lines (Xu and Zhu, 1994):
2.2.2 Genetic linkage mapping 1. DNA polymorphism: genetic polymor-
phism between parental lines usually
In order to utilize the genetic information depends on how closely related they are,
provided by molecular markers more effi- which can be determined by criteria such as
ciently, it is important to know the locations geographical distribution and morphological
and relative positions of molecular mark- and isozyme polymorphisms. In general,
ers on chromosomes. The construction of DNA polymorphism is greater in open-
genetic linkage maps using molecular mark- pollinated species than in self-pollinated
ers is based on the same principles as those species. For example, RFLP polymorphism
used in the preparation of classical genetic is very high among maize lines so that a
maps: selection of molecular markers and population derived from any two inbred
genotyping system; selection of parental lines would be desirable for RFLP mapping.
lines from the germplasm collection that are Genetic polymorphism is very low in tomato
highly polymorphic at marker loci; devel- so that only interspecific populations are
opment of a population or its derived lines sufficiently polymorphic to allow for RFLP
46 Chapter 2

mapping. The level of polymorphism in rice P1 P2

is intermediate. In plant breeding, many

P3 P1 or P2
novel traits have been transferred from wild TC F1 BC1
species to cultivated species and such wild- AC P1 or P2
cultivated crosses usually show high levels

of DNA polymorphism. Several mapping F2 DH

BIL/NIL
P1, P2, F1 P3, P4, ...
populations may be needed because genetic IM

polymorphisms that cannot be found in one TTC DH-TC
population may be identified in another. F3
F2-IM
2. Purity: in the case of self-pollinated
plants, the parental lines to be used for devel-
opment of mapping populations should be P3, P4, ...
RIL RIL-TC
breeding-true, i.e. homozygous at almost
all of the genetic loci. Purification through Fig. 2.7. Examples of mapping populations and
further inbreeding may be needed before their relationship. Modified from Xu and Zhu (1994).
hybridization is carried out. Breeding-true AC, anther culture; BC, backcross population;
inbred lines can be used as parents in cross- BIL, backcross inbred line; DH, double haploid;
pollinated plants. For plants for which true- IM, intermating; NIL, near-isogenic line; RIL,
breeding is not possible, genetic mapping recombinant inbred line; TC, testcross; TTC,
can be based on the populations derived triple testcross.
from two heterogeneous parental lines.
3. Fertility: hybrid fertility determines are heterozygous. For dominant markers,
whether a large segregating population can be dominant homozygotes cannot be distin-
obtained. Distant crosses are usually accom- guished from the heterozygotes and the accu-
panied by abnormal chromosome pairing and racy of mapping is therefore reduced. In
recombination, segregation distortion and order to improve the accuracy, more F2 indi-
reduced recombination frequencies. Some viduals will be needed unless co-dominant
distant hybrids may be partially or completely markers can be used. Another disadvantage
sterile so that it becomes difficult to obtain a of F2 populations is that their genetic consti-
segregating population. In this case, back- tution will change during sexual reproduc-
crossing populations can be used for mapping tion so that their genetic structure is difficult
as partially sterile hybrids can be rescued by to maintain. Vegetative reproduction is one
backcrossing to one of the parents. method of prolonging the life of a popula-
4. Cytological features: cytological exami- tion as exemplified by ratooning in some
nation may be necessary in order to exclude grass species. Tissue culture (see Chapters 4
individuals containing translocations and and 12) is another method that can be used
polyploid species containing monosomes to regenerate a population without changing
or partial chromosomes from being used as its constitution. Using bulked DNA from F3
mapping parents. families, which are derived from F2 individ-
uals, is an alternative approach to prolong-
CHOICE OF POPULATION TYPES. There are many ing the life of a population because in some
types of populations that can be used for crops such as rice and maize, one F2 plant
genetic mapping. Figure 2.7 shows the rela- produces a large number of seeds which are
tionship between populations derived from sufficient for multiple plantings. By random
two or multiple parental lines. Some of mating within each F3 family, an F3 popula-
these populations are discussed in detail in tion can also be maintained.
Chapter 4. Their use in genetic mapping is Backcross populations (e.g. BC1) are
discussed below. also frequently used in genetic mapping.
F2 populations are used most frequently BC1 populations have only two genotypes
in linkage mapping because they are easy at each marker locus which represent the
to develop. At each marker locus, however, corresponding gametes produced in the F1
50% of the individuals in an F2 population hybrid, an advantage over F2 populations.
Markers and Maps 47

If reciprocal BC1 populations, A (A B) is to produce a relatively large population

and (A B) A are obtained from a cross by containing about 500 or more individuals
using the F1 hybrid as male and female par- from which a subset (n 150) can then be
ents, respectively, the difference in recombi- used for the construction of a framework
nation frequencies between male and female map as the initial step in genetic mapping.
gametes can be compared and the former When fine mapping of a specific chromo-
indicates the recombination frequency of some region is required, all the individuals
male gametes while the latter indicates that in the population can be used.
for female gametes. Like F2 populations, With regard to the mapping power, the
the genetic constitution of BC populations population size required depends on the
will change with selfing and they need to maximum map distance that can be distin-
be conserved in the same way as F2 popula- guished from random assortment and the
tions. For many crop species, false hybrids minimum map distance at which recombi-
may pose a problem which contributes to nation can be detected between two genetic
the inaccuracy of genetic mapping. When markers (Fig. 2.8). Using a large mapping
distant crosses are used however, backcross population, it is possible to map very small
populations are the only populations that genetic distances between markers and also
can be developed because of high sterility to identify weak genetic linkages. For exam-
among the F1 hybrids. ple, one recombinant represents a 1% recom-
Permanent populations such as doubled binant frequency ( 1 cM) for a population of
haploid (DHs), recombinant inbred lines 100 individuals, a 2% recombinant frequency
(RILs) and backcross inbred lines (BILs), ( 2 cM) for a population of 50 individuals
which are fully discussed in Xu and Zhu but only a 0.1% recombination frequency (
(1994) and Chapter 4, provide a continuous 0.1 cM) for a population of 1000 individuals.
supply of genetic material leading to the The maximum map distance that can
accumulation of genetic information probe distinguished, max, can be determined
duced in different laboratories and experi- as follows
ments. For major crops, there are many
permanent populations available that are max = r + t0.01, n 2 SEr < 0.50 cM
shared internationally with the continuous
accumulation of genetic marker and pheno- where n is the population size, t is Students
typic data. During population development t parameter for a significant probability of
careful attention should be paid to selection 0.01, n 2 is the degrees of freedom, SEr is
factors that could affect the segregation pat- the standard error of r and r is a point esti-
terns (Xu et al., 1997; Chapter 4). In some mate of recombination frequency.
cases, distorted segregation could become The population size required also
very severe if selection pressure is high. depends on the type of population. For
example, more individuals from F2 popula-
POPULATION SIZE. Achieving the maximum tions are required compared with BC or DH
resolution and accuracy from genetic maps populations, because the F2 population con-
largely depends on the size of the mapping tains more marker genotypes and to guarantee
population: the larger the mapping popula- detection of each genotype, a greater number
tion, the greater the accuracy of the genetic of individuals is required. In general, F2 popu-
map. The research objectives dictate the size lation size should be doubled compared to BC
of the population. For example, the con- in order to obtain the same mapping accuracy.
struction of marker maps requires a much Therefore, BC or DH populations are more
smaller population than the fine mapping suitable than F2 for genetic marker mapping.
of QTL (Chapters 6 and 7). Construction of The mapping power of RIL populations is in
a high density marker map can be achieved between that of F2 and BC (DH) populations.
with as few as 200 plants but fine mapping Maximum detectable map distance and mini-
a population in order to clone a gene usually mum resolvable map distance for F2 and BC
requires over 1000 plants. One alternative populations are shown in Fig. 2.9.
48 Chapter 2

50
Maximum distance between markers
Average distance between markers
40

30
cM

0
0 300 600 900 1200 1500 1800 2100 2400
Number of markers

Fig. 2.8. Average and maximum distance expected between markers on a linkage map depending
on number of random markers mapped for a genome with 1200 cM, e.g. 12 chromosomes of 100 cM
each. The maximum distance curve is for 95% confidence level. From Tanksley et al. (1988) with kind
permission of Springer Science and Business Media.

50 ently. However, the observed frequency of

Maximum detectable double crossovers is usually lower than
40 that expected by calculating r1 r2 which
means that a single crossover occurring in
30
F2 a particular chromosome region will reduce
cM

the probability of a second single crossover

BC
20 occurring in its flanking regions. This phe-
nomenon is called crossover interference.
10 Minimum resolvable The degree of interference can be meas-
ured by the coefficient of coincidence (C),
0
0 20 40 60 80 100 120 140
Population size Observed double crossover
C=
Expected double crossover
Fig. 2.9. Maximum detectable and minimum
Observed double crossover
resolvable map distances between markers =
utilizing backcross (BC) and F2 populations. r1 r2 n
Curves are for 99% confidence level. From
Tanksley et al. (1988) with kind permission of
Springer Science and Business Media.
where n is the total number of individuals
observed (including both recombinants and
non-recombinants). When C = 0, there is
Interference and mapping functions complete interference and no double cross-
overs, this usually means that the involved
As the genetic distance between two mark- chromosome region is very short. When
ers increases, the chance of double crossing C = 1, there is no interference, indicating
over within a marker interval increases. For that the involved chromosome region is
three linked genes, A, B and C with r1 and r2 long so that the single crossovers can occur
as single crossover frequencies between A-B independently.
and B-C, the double crossover frequency The genetic distance estimated from the
between A and C can be estimated as r1 r2 recombinant frequency will be smaller than
if the two single crossovers occur independ- the real distance by 2C r1r2 if the double
Markers and Maps 49

crossover is not taken into account. When 1 1 + 2r

q = ln
the genetic distance between two markers
4 1 2r
is relatively large, the adverse effect of dou-
ble or multiple crossovers on the estimation When r = 0.22, q = 23.6 cM. As two loci
of recombination frequency should be cor- become further apart, the amount of interfer-
rected. The correction can help establish a ence allowed by the Kosambi map function
reliable function between genetic distance decreases. For very small values of recombi-
and recombinant frequency and this cor- nation (r), both Haldane and Kosambi map
rection function is known as a mapping functions give q r.
function.
The number of (odd) crossovers (k) in
Segregation and linkage tests
an interval defined by two genetic markers
has a Poisson distribution with mean q. With co-dominance and complete domi-
nance models, populations F2, BC and DH
(RIL) have the segregation ratios at locus
q k e q
Pr(recombination) =
k
k!
M with two alleles M1 and M2 shown at the
bottom of the page.
q q3 qk Assuming two genetic loci M and
= e q + + ... + N each with two alleles, M1, M2 and N1,
1! 3! k !
N2 and a recombinant frequency r, geno-
e q (eq e q ) types and frequencies in an F2 popula-
=
2 tion derived from two parental lines, P1
1 (M1M1N1N1) and P2 (M2M2N2N2) will be as
= (1 e 2q )
2 shown in Fig. 2.10.
There are three types of locus combina-
This probability is represented by r and tions between two loci M and N, depending
has the following limits 0 r 1/2. q is the on the dominance: (1:2:1)-(1:2:1), (3:1)-(1:2:1)
number of map units (M) between two mark- and (3:1)-(3:1).
ers. Assuming that C = 1, Haldane (1919) By combining the genotypes listed
derived the relationship between the map in Fig. 2.10, nine genotypes and their fre-
distance (cM) and recombinant frequency r quencies can be obtained. Similarly, we can
by solving the equation for q: obtain genotypes and their frequencies for
(3:1)-(1:2:1) and (3:1)-(3:1) linkage combina-
tions (Fig. 2.11).
1
q= ln(1 2r ) Linkage can be determined by com-
2 paring the observed frequency for each
genotype with the theoretical frequency
which is known as Haldanes map function. expected from Mendelian ratios. If there are
When r = 0, q = 0 (complete linkage). When n individuals, the genotypes/phenotypes
r = 1/2, q = (markers are unlinked), suggest- listed in Fig. 2.11 can then be identified
ing that the markers are either on the same from top to bottom as n1 to n9 for (1:2:1)-
chromosome but distant from one another (1:2:1), n1 to n6 for (3:1)-(1:2:1) and n1 to n4
or are located on different chromosomes. for (3:1)-(3:1); linkage can be determined
When r = 22%, q = 29 cM. from these observations.
Kosambi (1944) derived a mapping Linkage detection depends on the nor-
function that takes the crossover interfer- mal segregation of the genetic loci involved,
ence into account: thus each locus should be tested to ensure

Population F2 BC DH (RIL)
Co-dominance 1 M1M1:2 M1M2:1 M2M2 1 M1M2:1 M2M2 1 M1M1:1 M2M2
M1 is dominant 3 M1_:1 M2M2 1 M1M2:1 M2M2 1 M1M1:1 M2M2
50 Chapter 2

F2 gamete frequency M1N1 (1 r)/2 M1N2 r/2 M2N1 r/2 M2N2 (1 r)/2

M1N1 (1 r)/2 M1M1N1N1 M1M1N1N2 M1M2N1N1 M1M2N1N2

(1 r)2/4 r(1 r)/4 r(1 r)/4 (1 r)2/4
M1N2 r/2 M1M1N1N2 M1M1N2N2 M1M2N1N2 M1M2N2N2
r(1 r)/4 r 2/4 r 2/4 r(1 r)/4
M2N1 r/2 M1M2N1N1 M1M2N1N2 M2M2N1N1 M2M2N1N2
r(1 r)/4 r 2/4 r 2/4 r(1 r)/4
M2N2 (1 r)/2 M1M2N1N2 M1M2N2N2 M2M2N1N2 M2M2N2N2
(1 r)2/4 r(1 r)/4 r(1 r)/4 (1 r)2/4

Fig. 2.10. Theoretical ratios in an F2 population derived from two parents M1M1N1N1 and M2M2N2N2 with
recombinant frequency r.

(1:2:1)-(1:2:1) (1:2:1)-(3:1) (3:1)-(3:1)

Genotype Frequency Genotype Frequency Genotype Frequency

M1M1N1N1 (1 r)2 M1M1N1_ 1 r2 M1_N1_ 3 2r + r 2

M1M1N1N2 2r(1 r) M1M1N2N2 r2 M1_N2N2 2r r 2
M1M1N2N2 r2 M1M2N1_ 2(1 r + r 2) M2M2N1_ 2r r 2
M1M2N1N1 2r(1 r) M1M2N2N2 2r(1 r) M2M2N2N2 1 2r + r 2
M1M2N1N2 2(1 2r + 2r 2) M2M2N1_ 2r r 2
M1M2N2N2 2r(1 r) M2M2N2N2 (1 r)2
M2M2N1N1 r2
M2M2N1N2 2r(1 r)
M2M2N2N2 (1 r)2

Fig. 2.11. Genotypes and their frequencies for three linkage combinations at two loci in F2 populations
(each frequency divided by 4).

that it fits Mendelian segregation. For each 2

of the three linkage combinations listed
2
cM = {2(n1 + n2 + n3 )2
n
above, four c2 tests can be constructed:
+ (n4 + n5 + n6 )2
cT2: general test + 2(n7 + n8 + n9 )2 } n dfM = 2
cM2 : test to determine whether the
segregation of M1 and M2 fits the 2
c N2 = {2(n1 + n4 + n7 )2
Mendelian ratio n
cN2 : test to determine whether the + (n2 + n5 + n8 )2
segregation of N1 and N2 fits the
+ 2(n3 + n6 + n9 )2 } n dfN = 2
Mendelian ratio
c 2L: test to determine whether M and N c2L = cT2 cM
2
c2N dfL = 4
loci are linked
For linkage combination (3:1)-(1:2:1)
Therefore
8
c T2 = cM2 + c N2 + cL2 cT2 = (2n12 + n32 + 2n52
3n
+ 6n22 + 3n42 + 2n62 ) n dfT = 5
For linkage combination (1:2:1)-(1:2:1)

4 2 4
cT2 = {n5 + 2(n22 + n42 + n62 + n82 )
2
cM = ((n1 + n3 + n5 )2
n 3n
dfT = 8 + 3(n2 + n4 + n6 )2 ) n
+ 4(n12 + n32 + n72 + n92 )} n dfM = 1
Markers and Maps 51

2 We have
c N2 = (2(n1 + n2 )2 + (n3 + n5 )2
n c T2 c20.05(8) = 15.5
+ 2(n4 + n6 )2 ) n dfN = 2 2
cM c0.05(2)
2
= 5.99
c2L = cT2 cA2 c2B dfL = 2 c N2 c20.05(2) = 5.99
c L2 c20.05(4) = 9.49
For linkage combination (3:1)-(3:1)
which indicates that both loci M and N
show normal Mendelian segregation and
1 2
c 2M = (n1 + n22 3n32 3n42 ) dfM = 1 are linked.
3n

1 2 Maximum likelihood estimation (MLE)

c 2N = (n1 3n22 + n32 3n42 ) dfN = 1 of recombinant frequency
3n
To simplify, we take the linkage combina-
1 2
c 2L = (n1 3n22 3n32 + 9n42 ) dfL = 1 tion (3:1)-(3:1) (one of the alleles at each
9n locus shows complete dominance) as an
c 2T = c A2 + c 2B + c 2L dfT = 3 example to show how to obtain the MLE for
recombination frequency. From Fig. 2.11,
Similarly, three linkage combinations there are four types of phenotypes, M1_N1_,
for BC or DH (RIL) populations can be M1_N2N2, M2M2N1_ and M2M2N2N2 with the-
constructed. oretical frequencies pi (i = 1, 2, 3, 4). pi is a
For example, linkage for (1:2:1)-(1:2:1) function of r, a parameter to be estimated,
in an F2 population as shown in Fig. 2.12 and f is a function of frequency:
can be tested as follows
pi = f(r)

cT2 =
4
{(562 + 2(62 + 52 + 42 + 32 ) We have p1 (M1_N1_) = (3 2r + r 2)/4, p2
132 (M 1 _N 2 N 2 ) = p 3 (M 2 M 2 N 1 _) = (2r r 2 )/4,
+ 4(272 + 12 + 02 + 302 )} p4 (M2M2N2N2) = (1 2r + r 2)/4, and pi = 1.
132 = 165.818 Considering the number of individuals
observed for each category, n1, n2, n3 and n4,
2 and ni = n, they have a probability distri-
2
cM = {2(27 + 6 + 1)2 + (5 + 56 + 4)2 bution of (p1+p2+p3+p4)n. For a specific set
132
of observations (n1, n2, n3 and n4), the likeli-
+ 2(0 + 3 + 30)2 } 132 = 0.045
hood function is:

2 n!
c N2 = {2(27 + 5 + 0)2 + (6 + 56 + 3)2 L (r ) = ( p1 )n1 ( p2 )n2 ( p3 )n3 ( p4 )n4
132 n1!n2!n3!n4!
+ 2(1 + 4 + 30)2 } 132 = 0.167 n!
= (1/4)n (3 2r r 2 )n1
n1!n2!n3!n4!
c2L = 165.818 0.045 0.167 = 165.606 (2r r 2 )n2 + n3 (1 2r + r 2 )n4

M1M1 M1M2 M2M2 Subtotal The MLE of r is L(r) which can be

obtained by solving the equation and setting
N1N1 27 5 0 32 the derivative zero
N1N2 6 56 3 65
N2N2 1 4 30 35 dL(r )
Subtotal 34 65 33 132 = n =0
dr

Fig. 2.12. Data example used for test of linkage The natural logarithm of L(r) is called sup-
for (1:2:1)-(1:2:1) in an F2 population. port or log-likelihood. Here we have
52 Chapter 2

ln L(r) = C + n1 ln(3 2r + r 2) + (n2 + n3) of ln L(r) with respect to r, E is expectation,

ln(2r r 2) + n4 ln(1 2r + r 2) and

where k 2
d 2[ln L (r )] ni dpi
C = ln
n!
n ln(1/4) dr 2
= p i
2
i dr

n1! n2! n3! n4! k
ni d 2 pi
is a constant. + p dr
i
i
2

The first partial derivative is the
slope of a function. The slope will be zero k 2
d 2[ln L (r )]
p dr
at the maximum (global/local and/or min- 1 dpi
imum). The partial derivative is set with
E
dr 2 = n
i
i
respect to r k
ni d 2 pi
d ln L(r)/dr = 0
+n p dr
i
i
2

k 2
The partial derivative of ln L(r) is usually 1 dpi
denoted as score or S =n i

pi dr
n1 2 (1 r ) 2(1 r )
S= + (n2 + n3 ) k k
3 2r + r 2 2r r 2 d 2 pi
p = 0,
d
Because = i
2(1 r ) dr 2 dr
n4 =0 i i
1 2r + r 2 k
1 dpi 2 k

i
1
That is =n =n i =I
Vr i pi dr i
n1 n + n3 n4
2 + =0
3 2r + r 2 2r r 2 1 2r + r 2 where I is the total information content and
n1 n2 + n3 n4 ii = I/n is the information derived from a
+ =0 single observation.
2 + (1 r )2 1 (1 r )2 (1 r )2
From the above formula, the variance
If (1 r)2 = k, then of r can be calculated using the information
provided in Table 2.3.
n1 n + n3 n4 To estimate k, the values of ni listed in
2 + =0
2+ k 1 k k the table are used in the formula:
therefore (see equation at bottom of page) 1927 19272 + 8 6952 1338
and the MLE is k=
2 6952
= 0.7743
r = 1 k
r = 1 0.7743 = 0.1201
According to the Rao-Cramer Unequation,
the sampling variance of r is Vr = 1.76702 105
1 d 2[lnL (r )]
= E = I Thus,
Vr dr 2
r = 0.1201 1.76702 105
2
d [ln L ( r )]
where 2 is the secondary derivative = 12.01% 0.42%
dr

nk2 + (2n 3n1 n4)k 2n4 = 0 (n = n1 + n2 + n3 + n4)

(2n 3n1 n4 ) + (2n 3n1 n4 )2 + 8nn4

k=
2n
Markers and Maps 53

Table 2.3. Calculation of the variance of recombinant frequency for two linked loci each with complete
dominance.

2
dpi 1 dpi
ii =
Group ni pi dr p i dr
2
M1_N1_ 4831 (3 2r + r 2 )/4 2(1 r )/4 (1 r )
i1 = 2
3 2r + r
2
M1_N2N2 390 (2r r 2)/4 2(1 r )/4 (1 r )
i2 = 2
2r r

2
M2M2N1_ 393 (2r r 2)/4 2(1 r)/4 (1 r )
i2 = 2
2r r
2
4(1 r )
M2M2N2N2 1338 (1 r 2)/4 2(1 r)/4 i4 = =1
2
4(1 r )
(1 r )2 2(1 r )2
Total 6952 = n 1 0
ii =
3 2r + r 2
+
2r r 2
+1

This is an example of (3:1)-(3:1) link- To simplify the calculation, the log base 10 of
age combination. Allard (1956) derived the ratio L(r)/L(1/2) known as LOD, is used
formulas for r and Vr for almost all possi-
ble linkage combinations and for different L( r )
populations. LOD = log10
L(1/2)

Likelihood ratio and linkage test With n = 6952, n1 = 4831, n2 = 390, n3 = 393,
and n4 = 1338, likelihood of odds (LOD)
In human genetics the linkage phase (repul- scores can be calculated for different r values
sion or coupling) is usually unknown thus as shown below (see (b) at bottom of page).
making it impossible to calculate recom- The result indicates that LOD scores
binant frequency based on the observable vary with r and reach the maximum when
recombinants. As a result, likelihood ratios r = 0.12.
or odds ratios (Fisher, 1935; Haldane and If M and N are linked, L(r)/L(1/2) > 1,
Smith, 1947; Morton, 1955) have been used and thus LOD is positive. When L(r)/L(1/2)
for linkage testing. The method is based < 1, LOD is negative.
on the comparison of the probability that In human genetics the likelihood ratio
observed data follow an hypothesis, for should be greater than 1000:1, i.e. LOD > 3
example two linked loci and the alternative in order to establish linkage unequivocally.
hypothesis, two independent loci. The ratio The concept of the likelihood ratio is now
of the two probabilities L(r)/L(1/2) is tested widely used in genetic mapping of other
as follows: r = 1/2 is entered into the like- organisms including plant species to judge
lihood function (see equation (a) at bottom the reliability of linkage estimation and to
of page). verify its existence.

n!
L(1 / 2) = (1 / 4)n (2.25)n1 (0.75)n2 + n3 (0.25)n4 (a)
n1 ! n2 ! n3 ! n4 !

r 0.05 0.10 0.12 0.15 0.20 0.25 0.30 (b)

LOD 586.42 682.51 688.04 678.52 632.01 560.79 472.54
54 Chapter 2

Multi-point analysis and ordering the observed data at the converged iteration
a set of markers is 10303.28 (351.45) = 1048 times higher than that
for the initial ri = 0.05.
The methods discussed above are all based
on two-point analysis using two markers
at a time. However, when more than two Linkage mapping in the presence
markers from one chromosome are consid- of genotyping errors
ered, they can theoretically be arranged in As generating marker data is time consum-
many different orders but only one particu- ing and expensive, maximum use should be
lar order will match the genetic order on the made of the information generated. Without
chromosome and this particular order can accounting for genotyping errors, each error
be determined by multi-point analysis. in a non-terminal marker causes two appar-
Consider M1, M2, . . . , Mm genetic markers, ent recombinations in the dataset. Thus
ordered by their real locations on a chromo- every 1% error rate in a marker adds 2 cM
some for m genetic markers, there are a total of of inflated distance to the map. If there is
m!/2 possible orders. Assume the recombinant an average of one marker every 2 cM, then
frequency between two flanking markers, Mi an average of a 1% error rate will double
and Mi+1 is ri. The objective is to find r1, r2, . . . , the size of the map. There will be large
rm1 to maximize the likelihood L(r), distances between adjacent markers with
very high error rates. These cases can be
L(r) p1(r1,r2, . . . ,rm1)n1 p2(r1,r2, . . . ,rm1)n2 detected, either manually or automatically,
. . . pm(r1,r2, . . . ,rm1)nm and the markers removed. Such genotyping
errors can be identified by simply sorting
Using the natural logarithm, the par- the marker data by a given linkage order to
tial derivative is then set with respect to determine whether there are a large number
r1, r2, . . . , rm1. EM algorithm (Dempster of crossovers involved.
et al., 1977) can be used to obtain the MLE For the markers with low error levels
for r1, r2, . . . , rm1, which involves multi- that cannot be detected easily, the best
ple iteration steps of Expectation (E) and strategy is to integrate error detection with
Maximization (M). The multiple steps map-building procedure. Cartwright et al.
include: (i) providing an initial set of esti- (2007) extended the traditional likelihood
mates, r old = (r1, r2, . . . , rm1); (ii) using the model used for genetic mapping to include
intial estimates as the estimates of recom- the possibility of genotyping errors. Each
binant frequencies to obtain the E, i.e. the individual marker is assigned an error rate
expected numbers of recombinants and which is inferred from the data as are the
non-recombinants in each marker interval; genetic distances. A software package,
(iii) using these expected values as true val- TMAP, was developed to use this model to
ues to obtain the MLE for r new = (r1, r2, . . . , rm1); identify maximum-likelihood maps for
(iv) repeating steps (ii) and (iii) until the phase-known pedigrees. The methods
MLE has converged to its maximum. were tested using a data set in Vitis and a
Lander and Green (1987) provided an simulated data set, which confirmed that
example of the EM method for multi-point the method dramatically reduced the infla-
linkage analysis. Using 15 marker inter- tionary effect caused by increasing the
vals on human chromosome 7 determined number of markers and resulted in more
by 16 markers and initial recombinant fre- accurate orders.
quencies of ri = 0.05, the log-likelihood was
found to be 351.45. To reduce the difference Molecular maps in plants
of log-likelihoods between two consective
iterations to less than a given critical value Table 2.4 lists some representative molecu-
(tolerance value, T = 0.01), 12 iterations were lar maps that have been developed for major
needed which resulted in convergence at crop plants including legumes, cereals and
log-likelihood 303.28. The probability of clonal crops, which vary in marker density,
Table 2.4. Representative genetic maps in plants.

Crop Marker and mapping population Map information Reference

Azuki bean SSR, RFLP, AFLP; 187 BC1F1 486 markers mapped into 11 linkage groups spanning 832.1 cM with Han et al. (2005)
(JP81481 Vigna nepalensis) an average marker distance of 1.85 cM, 95% genome coverage
Barley AFLP, SSR, STS, and vrs1); 1172 markers with a total distance of 1595.7 cM, and average marker Hori et al. (2003)
95 RILs (Russia 6 H.E.S. 4) density of 1.4 cM per locus
SNP, SSR, RFLP, AFLP; three DH 1237 markers, based on three mapping populations consisted of 1237 loci, Rostoks et al. (2005)
populations with a total map length of 1211 cM and an average marker density
of 1 locus per cM
Lettuce AFLP, RFLP, SSR, RAPD; seven inter- 2744 markers assigned to nine linkage groups that spanned Truco et al. (2007)
and intraspecific populations a total of 1505 cM. The mean interval between markers is 0.7 cM
Maize SSR markers; one intermated The IBM map: 748 SSR and 184 RFLP markers with a total map length Sharopova et al. (2002)
RIL (IBM) and two immortalized F2s of 4906 cM; two immortalized F2 maps: 457 and 288 SSR markers with
total map length of 1830 and 1716, respectively
cDNA probes; two RIL populations: Framework maps: 237 and 271 loci in IBM and LHRF populations, Falque et al. (2005)
IBM (B37 Mo17) and LHRF that both maps contain 1454 loci (1056 on IBM_Gnp2004 and
(F2 F252) 398 on LHRF-Gnp2004) corresponding to 954 cDNA probes
Oat RFLP, AFLP, RAPD, STS, SSR, 426 loci (with 243 loci each) spanning 2049 cM of the oat genome Portyanko et al. (2001)
isozyme, morphological; 136 F6:7
RIL (Ogle TAM O-301)
Pearl millet RFLP and SSR; four populations A consensus genetic map: 353 RFLP and 65 SSR markers, Qi et al. (2004)
marker density in four maps ranged from 1.49 cM to 5.8 cM
Potato AFLP markers; heterozygous diploid > 10,000 AFLP loci, with marker density proportional to physical van Os et al. (2006)
potato distance and independent of recombination frequency
Rice 726 markers; 113 BC1 (BS125 WL02) 726 markers with a total distance of 1491 cM and average marker Causse et al. (1994)
BS125 density of 4.0 cM on the framework map, and 2.0 cM overall
2275 markers; 186 (Nipponbare 2275 markers with a total distance of 1521.6 cM, and average Harushima et al. (1998)
Kasalath) F2 marker density of 0. 67 cM per locus
Sorghum 2590 PCR-based markers and 137 RIL The 1713 cM map encompassed 2926 loci Menz et al. (2002)
(BTx623 IS3620C)
RFLP probes; 65 F2 (Sorghum bicolor The S. bicolor S. propinquum map is composed of 2512 loci, Bowers et al. (2003a)
Sorghum propinquum) spanning 1059.2 cM, a marker per 0.4 cM
Sweet potato AFLP; (Tanzania Bikilamaliya) 632 (Tanzania) and 435 (Bikilamaliya) AFLP markers, with Kriegner et al. (2003)
F2 population a total of 3655.6 cM and 3011.5 cM, and a marker per 5.8 cM
and 6.9 cM, respectively
Wheat SSR and DArT markers; 152 RILs from a 14 linkage groups, 690 loci (197 SSR and 493 DArT markers), Peleg et al. (2008)
cross between durum wheat and wild spanning 2317 cM, a marker per 7.5 cM
emmer wheat
56 Chapter 2

and genomic coverage. For example, crops can be integrated with the molecular link-
such as barley, maize, potato, rice, sorghum age map by using the same population for
and wheat have high-density genetic maps both conventional and molecular markers.
while cassava, Musa, oat, pearl millet, sweet As only very few morphological markers
potato and yam have less saturated maps. can segregate simultaneously in one popu-
The large variation in map length results lation, integration of many of these mark-
from differences in the number of chro- ers requires multiple populations each with
mosomes and total size of the genomes as an available preliminary molecular map. If
well as from the use of different numbers of a complete linkage map for morpholgical
markers (increasing the number of markers markers is available, the positions of these
will generally give a larger total map length markers relative to molecular markers can
up to a certain threshold), the inclusion of be inferred from the linkage relationship
skewed markers (that tend to exaggerate map revealed by both morphological and molec-
distances) and the use of different mapping ular markers. In addition, morphological
software (which vary in estimates of genetic markers, including some traits of agronomic
distances). In addition, many published importance, can be mapped much more
maps report more linkage groups than the precisely if they are integrated with a dense
basic chromosome number of that species. molecular map and this has now become
This is frequently the result of insufficient an integral step in trait and gene mapping.
marker density as most saturated maps can Integration of conventional and molecu-
be directly aligned with the basic chromo- lar maps has been very successful for crop
some complement (Tekeoglu et al., 2002). plants for which relatively complete genetic
The sophistication of molecular map linkage maps are available as a result of the
construction has developed from the RFLP use of morphological markers.
maps of the 1980s to PCR-based markers Some representative examples of such
of the 1990s to more integrated maps, as maps include rice, maize, tomato and soy-
a result of the use of different types of bean. In rice, 39 morphological markers and
molecular markers including genic mark- 82 RFLP markers were mapped together
ers, over the past decade. Linkage maps based on the segregation analysis of 19 F2
have been used in gene mapping for major populations derived from the crosses between
genes and QTL (Chapters 6 and 7), MAS indica cultivar IR24 and japonica lines with
(Chapters 8 and 9) and map-based gene different morphological markers (Ideta et al.,
cloning (Chapter 11). 1996). In tomato, a number of morphologi-
cal and isozyme markers were mapped with
respect to RFLP markers by orienting the
2.2.3 Integration of genetic maps molecular linkage map to both morphologi-
cal and cytological maps. An integrated high-
Integration of conventional density RFLP-AFLP map of tomato based on
and molecular maps two independent Lycopersicon esculentum
Lycopersicon pennellii F2 populations was
During the period 19801990 molecular constructed (Haanstra et al., 1999), which
maps were developed for many plant species. spanned 1482 cM and contained 67 RFLP
The first generation of molecular maps have and 1175 AFLP markers. Integrated maps
been integrated with conventional genetic were also developed for maize (Neuffer et al.,
maps constructed using morphological and 1997; Lee et al., 2002) and soybean (Cregan
isozyme markers through cytological mark- et al., 1999).
ers and markers shared by different maps.
The 12 molecular linkage groups in rice Integration of multiple molecular maps
(McCouch et al., 1988) were assigned to clas-
sical linkage groups using trisomics for each For many crop plants, several molecular
of the 12 rice chromosomes. Shared markers maps have been constructed using differ-
and those which segregate in the population ent populations. These populations are of
Markers and Maps 57

variable size and structure and maps have Integration of genetic and physical maps
been created using different numbers and
types of markers. To build an integrated Integrated genetic and physical genome
reference or consensus map, the order and maps are extremely valuable for map-
genetic distance between specific markers based gene isolation, comparative genome
is compared across populations and maps. analysis and as sources of sequence-ready
Stam (1993) developed a computer pro- clones for genome sequencing projects.
gram, JOINMAP, for the construction of genetic A well-defined correlation between the
linkage maps for several types of mapping physical and genetic maps will greatly
populations: BC1, F2, RILs, DHs and out- facilitate molecular breeding efforts
breeder full-sib family. JOINMAP can be used through associating candidate genes with
to combine (join) data derived from several important biological or agronomic traits,
sources into an integrated map. positional cloning and comparative analy-
For each crop all the molecular maps sis across populations and species, and
developed from different populations will whole genome sequences, which will in
finally be integrated into a consensus map. turn facilitate the development of various
This process has been very successful for molecular breeding tools.
several major crops and it can be expected Various methods have been developed
that it will be extended to all crops when for assembling physical maps of complex
sufficient maps become available. In wheat, genomes and integrating them with genetic
an SSR consensus map was constructed by maps. To create an integrated genetic and
fusing several genetic maps to maximize the physical map resource for maize, a compre-
integration of genetic mapping information hensive approach was used that included
from different sources (Somers et al., 2004). three core components (Cone et al., 2002).
In cotton, chromosome identities were The first was a high-resolution genetic
assigned to 15 linkage groups in the RFLP map that provided essential genetic anchor
joinmap developed from four intraspecific points for ordering the physical map and
cotton (Gossypium hirsutum L.) popula- for utilizing comparative information from
tions with different genetic backgrounds other smaller genome plants. The physical
(Ulloa et al., 2005). In maize, two popula- map component consisted of contigs (sets
tions of intermated RILs (IRILs) were used of overlapping fingerprint clones) assem-
to build a consensus map, the first panel bled from clones from three deep-coverage
(IBM) was derived from B73 Mo17 and genomic libraries. The third core compo-
the second panel (LHRF) from F2 F252. nent was a set of informatics tools designed
Framework maps of 237 loci were built from to analyse, search and display the mapping
the IBM panel and 271 loci from the LHRF data. In rice, most of the genome (90.6%)
panel. Both maps were used to locate 1454 was anchored genetically by overgo hybrid-
loci (1056 on map IBM_Gnp2004 and 398 ization, DNA gel blot hybridization and
on map LHRF_Gnp2004) that corresponded in silico anchoring (Chen et al., 2002).
to 954 previously unmapped cDNA probes In wheat, the geneticphysical map rela-
(Falque et al., 2005). In barley, Wenzl et al. tionship of microsatellite markers was
(2006) built a high-density consensus link- established using the deletion bin system
age map from the combined data sets of ten (Sourdille et al., 2004). In sorghum, Klein
populations, most of which were simultane- et al. (2000) developed a high-throughput
ously typed with DArT and SSR, RFLP and/ PCR-based method for building bacterial
or STS markers. The map comprised 2935 artificial chromosome (BAC) contigs and
loci (2085 DArT, 850 other loci), spanned locating BAC clones on the genetic map
1161 cM and contained a total of 1629 bins in order to construct an integrated genetic
(unique loci). The arrangement of loci was and physical map. It was found that 30%
very similar to, and almost as optimal as, of the overlapping BACs aligned by AFLP
the arrangement of loci in component maps analysis provided information for merg-
created for individual populations. ing contigs and singletons that could not
58 Chapter 2

be joined using fingerprint data alone. In automated matching of BACs were then
the grasses Lolium perenne and Festuca anchored on to IBM2 and IBM2 neighbour
pratensis, the physical map was integrated maps. In the Gramene database, a web-
with a genetic map using genomic in situ based tool, CMAP, was developed to allow
hybridization, which was composed of 104 users to view comparisons of genetic and
F. pratensis-specific AFLPs. The integrated physical maps (Ware et al., 2002). In addi-
map demonstrated the large-scale analy- tion, an integrated bioinformatic tool, the
sis of the physical distribution of AFLPs Comparative Map and Trait Viewer (CMTV),
and variation in the relationship between was developed to construct consensus
genetic and physical distance from one part maps and compare QTL and functional
of the F. pratensis chromosome to another genomics data across genomes and exper-
(King et al., 2002). iments (Sawkins et al., 2004). All these
An integrated genetic and physi- tools can be used to build integrated maps
cal mapping tool has been developed by based on shared markers and a reference
the Maize Mapping Project, Columbia, map to initiate the process. The integra-
Missouri, USA (http://www.maizemap. tion of genetic, cytological and physical
org/iMapDB/iMap.html). Contigs that maps is illustrated in the example shown
were assembled by fingerprinting and the in Fig. 3.6.
3
Molecular Breeding Tools:
Omics and Arrays

The success of molecular breeding depends sis (2DE). The proteins can be identified by
upon the various tools that can be used for excising the spot from the gel, digesting
the efficient manipulation of genetic varia- the polypeptide into smaller peptide frag-
tion. All kinds of omics, arrays and high- ments with specific proteases, and sequenc-
throughput technologies make it possible to ing the peptides directly or analysing them
carry out more large-scale genetic analyses by mass spectrometry (MS). Although this
and breeding experiments than ever before. method is still useful and widely used, it
These technologies have been incorpo- is limited in sensitivity, resolution, and the
rated into many novel genetic and breeding range of abundance of the different proteins
processes, some of which were described in the sample (Zhu et al., 2003; Baginsky
in Chapter 2. In this chapter, microarrays, and Gruissem, 2004). For example, abun-
high-throughput technologies and several dant proteins in the sample dominate the
aspects of genomics will be briefly discussed gel whereas less abundant proteins might
to provide some of the fundamental know- not be visible. New approaches involve
ledge required for molecular breeding. both improved separation methods and
advanced detection equipment, and several
other new technologies are available for use
3.1 Molecular Techniques in Omics in proteomic research (Kersten et al., 2002;
Zhu et al., 2003; De Hoog and Mann, 2004).
New detection methods and proteomic
Developments in molecular techniques have
technologies are also being developed in an
contributed to the various fields of omics,
array format, which is increasingly being
which include genomics, transcriptomics,
focused on proteinprotein interactions,
proteomics, metabalomics and phenomics.
post-transcriptional modification, and
These underlying developments include
elucidation of three-dimensional protein
advanced gel, hybridization and expression
structure.
systems, cell imaging by light and electron
microscopy, high density microarrays and
array experiments, and genetic readout
experiments. 3.1.1 2-Dimensional gel electrophoresis
Using proteomics as an example, clas-
sical techniques used in proteomics involve 2DE is a form of gel electrophoresis com-
the use of two-dimensional gel electrophore- monly used to analyse proteins. Mixtures of

Yunbi Xu 2010. Molecular Plant Breeding (Yunbi Xu) 59

60 Chapter 3

proteins are separated by two properties in proteins are separated in one dimension by
two dimensions in 2DE. During the early isoelectric point and in the second dimen-
years of proteomics and until recently, sion by mass. In one-dimensional electro-
profiling of protein expression relied phoresis, proteins (or other molecules)
primarily on the use of two-dimensional are separated in one dimension, so that all
polyacrylmide gel electrophoresis (2D the proteins/molecules in one lane will
PAGE), which was later combined with be separated from one another according
MS. The basic procedure is to solubilize to the differences in a particular property
the protein contents of an entire cell popu- (e.g. isoelectric point) between each com-
lation, tissue or biological fluid, followed ponent. The result is a gel with proteins
by separation of the protein components separated out on its surface (Fig. 3.1a).
in the lysate using 2DE and visualization The proteins can then be visualized by a
of the separated proteins with silver stain- variety of staining methods, the most com-
ing. This approach allows only a limited monly used stains are silver nitrate and
display of the total protein content and Coomassie blue. By combining electro-
can identify only the relatively abundant phoresis with MS, individual proteins can
proteins. be profiled (Fig. 3.1b, c) and theoretical
2DE begins with one-dimensional and acquired MS profiles can be matched
electrophoresis and then separates the by a database search.
molecules by a second property in a direc- An important development in 2D PAGE
tion at 90 to the first. In this technique is the use of immobilized pH gradients

(a) pl
10 9 8 7 6 5 4 3
100
Molecular weight

80
Trypsin
60

40
12 14 16
20 Time
Peptides Separate peptides
0

(c) MS (d) MS/MS

(b) Peptide chromatography and ESI 200
Intensity (arbitrary units)

LLEAAAQSTK
516.27 (2+)
400
y7 y8

q1 q2 516.27 (2+) 100 a2 SQAA E L L

200
y5 y6
b2 y4
y3 y9
0 0
400 600 800 200 600 1000
m/z m/z

Fig. 3.1. Standard protein analysis by two-dimensional electrophoresis followed by mass spectrometry
proteomics. (a) Protein is separated by two-dimensional electrophoresis: in one dimension by
isoelectronic point (pI) and in the second dimension by mass (molecular weight). Individual peptides
are obtained using trypsin to cleave peptide chains. (b) Peptides are separated by chromatography and
then peptides are ionized using electospray ionization (ESI): they pass through the first quadrupole (q1)
and collision chamber (q2). (c) Individual ions are separated based on their mass-to-charge (m/z) by a
mass analyser. (d) From the MS spectrum, an individual peptide ion (516.27 (2+)) is selected for MS/MS
analysis to produce peptide ion fragmentation patterns. Letters S, Q, A, A, E, L and L represent amino
acids in the selected peptide and a2, b2, y3, etc. represent different ions.
Omics and Arrays 61

(IPGs) in which a pH gradient is fixed 3.1.2 Mass spectrometry

within the acrylamide matrix (Gorg et
al., 1999). Because a wide or narrow pH MS is an analytical technique used to deter-
range can be fixed within the gel, IPGs mine the composition of a physical sample
can be used to detect thousands of spots by measuring the mass-to-charge ratio of the
on a single gel with high reproducibility. ions. It has become the method of choice for
A variation on this theme is the use of so- analysis of complex protein samples (Han
called zoom gels in which the protein et al., 2008). MS-based proteomics has estab-
content of an individual sample is first lished itself as an indispensable technology
fractionated into narrow pH ranges under for interpreting the information encoded in
low resolution and then each fraction is genomes; this has been made possible by
subjected to high-resolution separation by technical and conceptual advances in many
2D PAGE. Another innovation in 2DE is dif- areas, most notably the discovery and devel-
ferential in-gel electrophoresis (DIGE; nl opment of protein ionization methods as rec-
et al., 1997) in which two pools of proteins ognized by the award of the Nobel prize for
are labelled with different fluorescent dyes. chemistry to John B. Fenn and Koichi Tanaka
The labelled proteins are mixed and sepa- in 2002. Mass spectrometry instrumentation
rated in the same 2DE. has made strides in recent years in terms of
Some of the main challenges facing dynamic range and sensitivity (Blow, 2008).
expression proteomics, be it using 2D PAGE Mass spectrometric measurements are
or any other approach, include the great carried out in the gas phase on ionized
dynamic range of protein abundance and a analytes. Mass spectrometers consist of
wide range of protein properties including three essential parts; the first, an ionization
mass, isoelectric point, extent of hyropho- source, converts molecules into gas-phase
bicity and post-translational modifications ions. Once ions are created, individual ions
(Hanash, 2003). Reducing sample com- are separated based on their mass-to-charge
plexity prior to analysis, for example by ratio (m/z) by a second device, a mass ana-
analysing protein subsets and subcellular lyser, and transferred by magnetic or electric
organelles separately, improves the reach fields to the third, an ion detector (Fig. 3.1b,
of 2DE or other separation techniques for c and d). The mass analyser is central to the
the quantitative analysis of low-abundance technology. It uses a physical property to
proteins. The isolation of sub-proteome separate ions of a particular m/z value that
components may be combined with protein then strike the ion detector. The magnitude
tagging to further enhance sensitivity. For of the current that is produced at the detec-
example, protein tagging technologies have tor as a function of time (i.e. the physical
been implemented for the comprehensive field in the mass analyser is changed as a
analysis of the cell-surface proteome (Shin function of time) is used to determine the
et al., 2003). m/z value of the ion. In the context of pro-
Even with all the improvements that teomics, its key parameters are sensitivity,
could be introduced, 2DE will probably resolution, mass accuracy and the ability to
remain a rather low-throughput approach generate information-rich ion mass spectra
that requires a relatively large amount of from peptide fragments. The technique has
sample for analysis. The latter is particu- several applications, including identifying
larly problematic when the samples to be unknown compounds by the mass of the
analysed are of limited availability (Hanash, compound molecules or their fragments,
2003). In particular, the use of laser-capture determining the isotopic composition of
microdissection, which allows defined cell an element and its structure by observing
types to be isolated from tissues, yields a the fragmentation, quantifying the amount
very small amount of protein that is dif- of a compound in a sample using carefully
ficult to reconcile with the large amounts designed methods and studying the funda-
needed for 2DE. mentals of gas phase ion chemistry.
62 Chapter 3

There are many types of mass analys- ally coupled to TOF analysers that measure
ers which use static or dynamic fields and the mass of intact peptides, whereas ESI
magnetic or electric fields. Each analyser has mostly been coupled to ion traps and
type has its strengths and weaknesses. Four triple quatrupole instruments and used to
basic types of mass analyser used in pro- generate fragment ion spectra (collision-
teomic research are: ion trap, time-of-flight induced spectra) of selected precursor ions
(TOF), quadrupole and Fourier transform (Aebersold and Goodlett, 2001). ESI creates
mass spectrometry (FT-MS) analyser. In ion- ions by application of a potential to a flow-
trap analysers, the ions are first captured or ing liquid causing the liquid to charge and
trapped for a certain time interval and are subsequently spray. The electrospray creates
then subjected to MS or tandem MS (MS/ very small droplets of solvent-containing
MS) analysis. Ion traps are robust, sensitive analyte. Solvent is removed by heat or some
and relatively inexpensive. A disadvantage other form of energy (e.g. energetic collisions
is their relatively low mass accuracy, due in with a gas) as the droplets enter the mass
part to the limited number of ions that can spectrometer and multiply-charged ions are
be accumulated at their point-like centre formed in the process. ESI ionizes the ana-
before space-charging distorts their distribu- lytes out of a solution and is therefore read-
tion and thus the accuracy of the mass meas- ily coupled to liquid-based (for example,
urement. The linear or two-dimensional ion chromatographic and electrophoretic) sepa-
trap is a recent development where ions ration tools (Fig. 3.1). MALDI creates ions
are stored in a cylindrical volume that is by excitation of molecules that are isolated
considerably larger than that of the tradi- from the energy of the laser by an energy-
tional, three-dimensional ion traps, allow- absorbing matrix. The laser energy strikes
ing increased sensitivity, resolution and the crystalline matrix to cause rapid excita-
mass accuracy. The FT-MS instrument is tion of the matrix and subsequent ejection of
also a trapping mass spectrometer, although matrix and analyte ions into the gas phase.
it captures the ions under high vacuum in MALDI-MS is normally used to analyse
a high magnetic field. It measures mass by relatively simple peptide mixtures in cases
detecting the image current produced by where integrated liquid-chromatography
ions cyclotroning in the presence of a mag- ESI-MS systems (LC-MS) are preferred for
netic field. Its strengths are high sensitiv- the analysis of complex samples.
ity, mass accuracy, resolution and dynamic Key developments leading to improved
range. In spite of the enormous potential, detection of proteins include TOF MS and
the expense, operational complexity and relatively non-destructive methods for con-
low-peptide-fragmentation efficiency of verting proteins into volatile ions (Zhu et al.,
FT-MS instruments has limited their rou- 2003). MALDI and ESI have made it possible
tine use in proteomic research (Aebersold to analyse large molecules such as peptides
and Mann, 2003). The TOF analyser uses an and proteins. Although MALDI-TOF MS is a
electric field to accelerate the ions through relative high-throughput method compared
the same potential and then measures the with ESI, the latter is more easily coupled
time they take to reach the detector. with separation techniques such as LC or
Techniques for ionization have been key high pressure LC (HPLC) (Zhu et al., 2003).
to determining what types of samples can This has provided an attractive alternative
be analysed by MS. Electrospray ionization to 2DE, because even low-abundance pro-
(ESI; Fenn et al., 1989) and matrix-assisted teins and insoluble transmembrane proteins
laser desorption/ionization (MALDI; Karas can be detected (Ferro et al., 2002; Koller
and Hillenkamp, 1988) are two techniques et al., 2002). Other MS techniques include
most commonly used to volatize and ion- gas chromatographymass spectrometry
ize proteins or peptides for MS analysis (GC-MS), and ion mobility spectrometry/
while inductively coupled plasma sources mass spectrometry (IMS/MS). All MS-based
are used primarily for metal analysis on a techniques require a substantial and search-
wide array of sample types. MALDI is usu- able database of predicted proteins, ideally
Omics and Arrays 63

representing the entire genome. Protein called bait) is screened against a library of
identification is possible by comparing the activation-domain hybrids (prey) to select
deduced masses of the resolved peptide interaction partners (Phizicky et al., 2003).
fragments with the theoretical masses of The key advantages of the Y2H assay
predicted peptides in the database. are its sensitivity and flexibility (Phizicky
Mass spectrometers are restricted in the et al., 2003). The sensitivity derives in part
number of ions that can be detected at any from overproduction of protein in vivo, their
point in time. Pre-fractionation of proteins designed direction to the nuclear compart-
on the basis of isolation of specific cell types ment where interactions are monitored,
or subcellular organelles is often necessary the large number of variable inserts of the
to reduce the complexity (Lonosky et al., interacting proteins that can be examined at
2004). Another method of fractionating a once, and the potency of the genetic selec-
complex sample is to introduce a chromato- tions. This sensitivity leads to the detection
graphic technique before MS analysis. This of interactions with dissociation constants
method, referred to as multidimensional around 107 M which is in the range of most
protein identification technology (MudPIT) weak protein interactions found in the cell
(Whitelegge, 2002) has been used to conduct and is more sensitive than co-purification.
a shotgun survey of metabolic pathways in It also allows detection of certain transient
the leaves, roots and developing seeds of interactions that might affect only a subpop-
rice (Koller et al., 2002). Compared with ulation of the hybrid proteins. Flexibility of
2DE-MS, each method identifies unique pro- the assay is provided by calibration to detect
teins, supporting the complementary nature interactions of varying affinity by altering the
of the different proteomic technologies. expression levels of the hybrid proteins, the
number and nature of the DNA-binding sites
and the composition of the selection media.
3.1.3 Yeast two-hybrid system Some disadvantages of the Y2H assay
include the unavoidable occurrence of false
The yeast two-hybrid assay (Fields and negatives and false positives (Phizicky et al.,
Song, 1989) provides a genetic approach 2003). False negatives include proteins
to the identification and analysis of pro- such as membrane proteins and secretory
teinprotein interactions. Yeast two-hybrid proteins that are not usually amenable to
(Y2H) systems detect not only members of nuclear-based detection systems, proteins
known complexes but also weak or tran- that failed to fold correctly and interactions
sient interactions (Jansen et al., 2005). The dependent on domains occluded in the
Y2H assay makes use of the molecular fusions or on post-translational modifica-
organization found in many transcription tions. False positives include colonies not
factors that have a DNA-binding domain resulting from a bona fide protein interac-
and activation domains that can function tion, as well as colonies resulting from a
independently, but when these domains are protein interaction not indicative of an
fused to two proteins that interact, the abil- association that occurs in vivo.
ity of the domains to control transcriptional There are several variations of the Y2H
activity is reconstituted. In this assay hybrid system. In the reverse Y2H system, induced
proteins are generated that fuse a protein X URA3 expression leads to 5-FOA being con-
to the DNA-binding domain and protein Y verted into the toxic substance 5-fluorouracil
to the activation domain of a transcription by Ura3p, leading to growth prohibition.
factor (Fig. 3.2a). Interaction between X Mutated or fragmented genes are created and
and Y reconstitutes the activity of the tran- then subjected to analysis and only loss-of-
scription factor and leads to expression of interaction mutants are able to grow in the
a reporter gene with a recognition site for presence of 5-FOA. In the one-hybrid sys-
the DNA-binding domain. In the typical tem, the bait is a target DNA fragment fused
practice of this method, a protein of interest to a reporter gene. Preys that are able to bind
fused to the DNA-binding domain (the so- to the DNA fragmentreporter fusion will
64 Chapter 3

(a)

X Y

(b) (c)
X Screened
against
Y1

Screened
X
Screened against
X
against Y2

Screened
X
against
Yn

(d) (e)
X1
Y1
Screened Screened
X
against against

X96
Y96

Fig. 3.2. Yeast two-hybrid approaches. (a) The yeast two-hybrid system. DNA binding and activation
domains (circles) are fused to two proteins X and Y, the interaction of X and Y leads to reporter gene
expression (arrow). (b) A standard two-hybrid search. Protein X, present as a DNA binding domain hybrid,
is screened against a complex library of random inserts in the activation domain vector (shown in square
brackets). (c) A two-hybrid array approach. Protein X is screened against a complete set of full length open
reading frames (ORFs) present as activation domain hybrids (shown as yeast transformant spotted on to
microtitre plates). (d) A two-hybrid search using a library of full length ORFs. The set of ORFs as activation-
domain hybrids (microtitre plates in square brackets) is combined to form a low-complexity library.
(e) A two-hybrid pooling strategy. Pools of ORFs as both DNA-binding domain and activation domain
hybrids (in square brackets) are screened against each other. From Phizicky et al. (2003) reprinted by
permission from Macmillan Publishers Ltd.

lead to activation of the reporter genes (lacZ, bait and prey proteins requires the presence
HIS3 and URA3). In the repressed transac- of a third interacting molecule to form a
tivator system, the interaction of baitDNA complex. The third interacting molecule can
binding domain fusion proteins and the be a protein used with a nuclear localization
preyrepressor domain fusion proteins can acting as a bridge between bait and prey to
be detected by repression of the reporter cause transcriptional activation.
URA3. The interaction of bait and prey ena- Different genome-wide two-hybrid
bles cells to grow in the presence of 5-FOA, strategies have been used to analyse protein
whereas non-interactors are sensitive to interactions in Saccharomyces cerevisiae.
5-FOA as a result of Ura3p production. In One approach involved screening a large
the three-hybrid system, the interaction of number of individual proteins against a
Omics and Arrays 65

comprehensive library of randomly gen- from biotinylated oligo-dT primers. The

erated fragments (Fig. 3.2b). A second DNA is cut with a frequent-cutting restric-
approach used systematic one-by-one test- tion enzyme (NlaIII), and the 3' extremities of
ing of every possible protein combination the double-stranded DNAs are isolated using
using a mating assay with a comprehensive streptavidin (which binds biotin). The double-
array of strains (Fig.3.2c). A third approach stranded DNA is divided into two groups, the
used a one-by-many matings strategy in 5' extremities of which are ligated to primers
which each member of a nearly complete A or B. These primers contain a restriction
set of strains expressing yeast open read- site recognized by the enzyme BsmFI which
ing frames (ORFs) as DNA-binding domain cuts 20 nucleotides away from its recogni-
hybrids was mated to a library of strains tion site. The two populations are then com-
containing activation-domain fusions of bined, ligated, amplified and sequenced. The
full-length yeast ORFs (Fig. 3.2d). A fourth four-nucleotide sequence CAGT (recognized
variation involved mating of defined pools by NlaIII) allows the identification of each
of strain arrays (Fig.3.2e). Suter et al. (2008) amplified region. The sequences obtained
reviewed the current applications of Y2H allow their unique identification for each
and variant technologies in yeast and mam- gene, although the size of the sequence is
malian systems. Y2H methods will continue very short (of the order of a dozen nucle-
to play a dominant role in the assessment of otides), it is sufficiently adequate to identify
protein interactomes. the specific gene from which it derives by
comparison with sequence databases.
SAGE can be used to identify the col-
lection of genes transcribed in a given tissue
3.1.4 Serial analysis of gene expression or developmental stage. It also provides an
estimate of the frequency of transcription of
Serial analysis of gene expression (SAGE) is each identified gene because it is propor-
a method for the comprehensive analysis of tional to the frequency of the sequence in the
gene expression patterns. SAGE is used to total collection of sequences obtained. The
produce a snapshot of the mRNA population study by Velculescu et al. (1995) indicated:
in a sample of interest (Velculescu et al., (i) that just nine base pairs of DNA sequence
1995). Several variants have since been are sufficient to distinguish 262,144 genes if
developed, most notably a more robust ver- the sequence is from a defined position in
sion, LongSAGE (Saha et al., 2002) and the the gene; (ii) if the 9-bp sequences are placed
most recent SuperSAGE (Matsumura et al., end-to-end (concatenated) and separated by
2005) that enables very precise annotation punctuation then they can be sequenced
of existing genes and discovery of new genes serially (analogous to the mechanism by
within genomes because of an increased tag- which a computer transmits data); and (iii)
length of 2527 bp. Three principles underlie a single sequencing reaction can yield infor-
the SAGE methodology: (i) a short sequence mation on 1050 genes.
tag (originally 1014 bp) that contains suf-
ficient information to uniquely identify a
transcript provided that the tag is obtained
from a unique position within each tran- 3.1.5 Quantitative real-time PCR
script; (ii) sequence tags can then be linked
together to form long serial molecules that Real-time reverse-transcriptase PCR (RT-
can be cloned and sequenced; and (iii) quan- PCR), also known as quantitative real-
titation of the number of times a particular time-PCR (QRT-PCR), measures PCR amplifi-
tag is observed which provides the expres- cation in real time (via fluorescence) during
sion level of the corresponding transcript. amplification. It enables both detection and
The principle of the technique is shown quantification (as absolute number of copies
in Fig. 3.3: mRNAs are isolated from a tissue, or relative amount when normalized to DNA
and double-stranded cDNAs are synthesized input or additional normalizing genes) of a
AAAAA
TTTTT
AAAAA
TTTTT

AAAAA
TTTTT

Cleave with anchoring enzyme (AE)

Bind to streptavidin beads

AAAAA
GTAC TTTTT
AAAAA
GTAC TTTTT
AAAAA
GTAC TTTTT

Divide in half
Ligate to linkers (A + B)

CATG AAAAA CATG AAAAA

GTAC TTTTT GTAC TTTTT
CATG AAAAA
CATG AAAAA GTAC TTTTT
GTAC TTTTT
CATG AAAAA
CATG AAAAA GTAC TTTTT
GTAC TTTTT

Cleave with tagging enzyme (TE)

Blunt end

GGATGCATGXXXXXXXXX GGATGCATGOOOOOOOOO
CCTACGTACXXXXXXXXX CCTACGTACOOOOOOOOO
TE AE Tag TE AE Tag

Ligate and amplify with

primers A and B

GGATGCATGXXXXXXXXXOOOOOOOOOCATGCATCC
CCTACGTACXXXXXXXXXOOOOOOOOOGTACGTAGG
Ditag
Cleave with anchoring enzyme
Isolate ditags
Concatenate and clone

CATGXXXXXXXXXOOOOOOOOOCATGXXXXXXXXXOOOOOOOOOCATG
GTACXXXXXXXXXOOOOOOOOOGTACXXXXXXXXXOOOOOOOOOGTAC
Tag 1 Tag 2 Tag 3 Tag 4
AE AE AE
Ditag Ditag

SAGE profile of wild type Arabidopsis and the Arabidopsis-Pti4 line

70
wild type Pti4
60

50
# Tags

0
Ca/b

Pti4

PDF1.2

Di19

Lhcb5

TIP

Catalase

Oxygen-
evolving protein

Germin1

TF
MYB60

BAC clone
T18N14

ATPase

Chrom.
5 clone

Peroxidase

Genes

Fig. 3.3. Serial analysis of gene expression (SAGE).

Omics and Arrays 67

specific sequence in a DNA sample. The pro- A Cycle 5 10 15 20 25 30 45

Agarose
cedure follows the general principle of PCR; gel
Conventional PCR
its key feature is that the amplified DNA is
quantified as it accumulates in the reaction 1.0
in real-time after each amplification cycle. Amplification curve obtained with
the LightCycler
Two common methods of quantification are
the use of fluorescent dyes that intercalate 0.8

with double-strand DNA, and modified

Fluorescence
DNA oligonucleotide probes that fluoresce
0.6
when hybridized with a complementary
DNA (cDNA).
Real-time RT-PCR uses fluorophores in 0.4
order to detect levels of gene expression. As
mRNA becomes translated at the ribosome to
0.2
produce functional proteins, mRNA levels tend
0 10 20 30 40
to roughly correlate with protein expression. Cycle number
In order to adapt PCR to the measurement of
RNA, the RNA sample first needs to be reverse B 4
10 copies
3.0
transcribed to cDNA via an enzyme known as
a reverse transcriptase. The original RT-PCR 2.5 10 copies
technique required extensive optimization
Fluorescence

of the number of PCR cycles, so as to obtain 2.0

results during logarithmic DNA amplification,
before it starts to plateau. Development of PCR 1.5

technology that uses fluorophores to meas-

1.0
ure DNA amplification in real-time allows 0 copies
researchers to bypass the extensive optimiza- 0.5
tion associated with normal RT-PCR.
In real-time RT-PCR, the amplified 0
0 10 20 30 40 50
product is measured at the end of each cycle.
Amplification cycles
This data can be analysed by computer soft-
ware to calculate relative gene expression
210 = 1024 27 37
between several samples or mRNA copy cycles cycles
number based on a standard curve (Fig. 3.4).
By comparing cycles of linear amplification Fig. 3.4. Quantitative real-time PCR. (A) Agarose
among target cDNAs/genes, the relative fold gel to show the amplification results in
conventional PCR of different cycles (above)
difference in expression can be measured
and amplification curve obtained with the
as 2cycles x cycles y. For example, comparing LightCycler to show relative gene expression
sample x linear at 37 and sample y linear at (below). (B) Relative gene expression for different
27, we have 23727, which means a 1024-fold mRNA copies, by which a relative fold difference
x mRNA accumulation of x versus y, in gene expression can be measured.
assuming that the sequences amplify with
equal efficiency (Fig. 3.4). and show the relative difference in the con-
centration of these molecules. It can be used
to enrich for differentially expressed genes.
3.1.6 Subtraction suppressive Subtracted cDNA libraries are hybridization
hybridization and PCR based and result in normalization
of the sample. They can be combined with
Suppression subtractive hybridization full length cDNA libraries.
(SSH) (Diatchenko et al., 1996) is a tech- SSH includes the following proce-
nique that uses PCR to quickly compare the dures: (i) prepare cDNAs from two stages/
expression of mRNA from different samples conditions; (ii) separately digest tester (from
68 Chapter 3

the same source as sample to be tested) and Transcriptional analysis may also be
driver cDNA (from a normal sample) to carried out by inserting a reporter gene such
obtain shorter fragments; (iii) divide tester as lacZ or GFP (green fluorescent protein)
cDNA into two portions and ligate each to downstream from the promoter under study.
a different adaptor, while driver cDNA has lacZ encodes -galactosidase and its expres-
no adaptors; (iv) hybridization kinetics lead sion is detected by the blue colour obtained
to equalization and enrichment of differ- in the presence of X-Gal. GFP is a protein
entially expressed sequences among single containing a chromophore which fluoresces
strand tester molecules; and (v) ultimately under blue light (395 nm). These reporters
generate templates for PCR amplification are used to evaluate the expression levels and
from differentially expressed sequences. identify the tissues in which the normal gene
As a result, only differentially expressed is expressed under the chosen promoter.
sequences are amplified exponentially.

3.1.7 In situ hybridization 3.2 Structural Genomics

In situ hybridization (ISH) is a type of Genomics is a term coined by Thomas

hybridization that uses a labelled cDNA or Roderick in 1986 and refers to a new scientific
RNA strand (i.e. probe) to localize a specific discipline of mapping, sequencing and ana-
DNA or RNA sequence in a portion or sec- lysing genomes. Genomics is now however,
tion of tissue (in situ) or in the entire tissue. undergoing a transition or expansion from the
DNA ISH can be used to determine the struc- mapping and sequencing of genomes to an
ture of chromosomes. Fluorescent DNA ISH emphasis on genome function. To reflect this
(FISH) can be used to assess chromosomal shift, genome analysis may now be divided
integrity. RNA ISH (hybridization histo- into structural genomics and functional
chemistry) is used to measure and localize genomics. Structural genomics represents
mRNAs and other transcripts within tissue an initial phase of genome analysis and has a
sections or whole mounts. clear end point: the construction of high-res-
For hybridization histochemistry, sam- olution genetic, physical and transcript maps
ple cells and tissues are usually treated to fix of an organism. The ultimate physical map of
the target transcripts in place and to increase an organism is its complete DNA sequence.
access of the probe. The probe is either a There are an increasing number of terms
labelled cDNA or more commonly, a cRNA ending up with -omes and -omics. Some
(riboprobe). The probe hybridizes to the tar- examples include cytomics, epigenomics,
get sequence at elevated temperature and genomics, immunomics, interactome, metab-
the excess probe is then washed away (after olomics, ORFeome, phenomics, proteomics,
prior hydrolysis using RNase in the case of secretome, transcriptomics, transgenomics,
unhybridized, excess RNA probe). Solution etc. Genome organization, physical mapping
parameters such as temperature, salt and/or and sequencing will be discussed in this sec-
detergent concentration can be manipulated tion. For further details, readers are referred
to remove any non-identical interactions (i.e. to Primrose (1995), Borevitz and Ecker (2004),
only exact sequence matches will remain Choisne et al. (2007) and Lewin (2007).
bound). Then, the probe that was labelled
with either radio-, fluorescent- or antigen-
labelled bases (e.g. digoxigenin) is localized 3.2.1 Genome organization
and quantitated in the tissue using autoradi-
ography, fluorescence microscopy or immu- Major differences among various genomes
nohistochemistry. ISH can also use two or
more probes labelled with radioactivity or Eukaryotes have large genomes, linear chro-
the other non-radioactive labels to simulta- mosomes with centromeres and telomeres,
neously detect two or more transcripts. low gene density disrupted by introns and
Omics and Arrays 69

highly repetitive sequences, while prokary- concentration and time required to pro-
otes have small genomes, single and cir- ceed to the half way of re-association. It is
cular chromosomes (few linear) with no directly related to the amount of DNA in the
centromere or telomere, high gene density genome.
without introns and very few or no repeti- The DNA content of haploid genomes
tive sequences. The genome size refers to ranges from 5 103 for viruses to 1011 bp for
the haploid genome since different cells flowering plants. Within mammals, there is
within a single organism can be of differ- only a two fold difference between the larg-
ent ploidy. Germ cells are usually haploid est and smallest C-value. However, there
and somatic cells diploid. The size of the is up to a 100-fold variation in size within
genome is known as the C-value and is flowering plants. The minimum genome
measured by re-association kinetics. After size found in each phylum increases from
denaturation, the rate of re-association is prokaryotes to mammals (Fig. 3.5).
dependent on genome size. The larger the Among the most important food crops,
genome, the more repeated DNA sequences rice has the smallest genome (389 Mb)
and the longer time to re-anneal, the higher (IRGSP, 2005) and wheat the largest
the C-value. C0 t1/2 is the product of the DNA (15,966 Mb). According to Arumuganathan

Lycopersicon Zea Capsicum

Phaseolus esculentum Triticum
mays annuum
vulgaris Musa sp. (953 Mb) (2,504 Mb) Allium aestivum
(2,702 Mb)
Arabidopsis (673 Mb) (873 Mb) cepa (15,966 Mb)
thaliana Glycine (15,290 Mb)
Sorghum Hordeum
(145 Mb) Oryza max
bicolor vulgare
sativa (1,115 Mb) Avena
(760 Mb) (4,873 Mb)
(389 Mb) sativa
(11,315 Mb)

109 1010 1011

Flowering plants
Birds
Mammals
Reptiles
Amphibians
Bony fish
Cartilaginous fish
Echinoderms
Crustaceans
Insects
Molluscs
Worms
Fungi
Algae
Bacteria
Mycoplasmas
Virus
103 104 105 106 107 108 109 1010 1011
DNA content (bp)

Fig. 3.5. DNA contents of organisms. Modified from Primrose (1995) and Arumuganathan and Earle (1991).
70 Chapter 3

and Earle (1991), other crops can be called selfish DNA). Some of the sequences
grouped into seven classes: Musa, cowpea are found to cause insertional or deletion
and yam (873 Mb); sorghum, bean, chick- mutations such as Alu.
pea and pigeonpea (673818 Mb); soy-
bean (1115 Mb); potato and sweet potato
(15971862 Mb); maize, pearl millet and 3.2.2 Physical mapping
groundnut (23522813 Mb); pea and barley
(43975361 Mb); and oat (11,315 Mb). Physical mapping entails constructing a
Genome size is often correlated with physical map which consists of continuous
plant growth and ecology and extremely overlapping fragments of cloned DNA that
large genomes may be limited both eco- has the same linear order as found in the
logically and evolutionarily. The manifold chromosomes from which they are derived.
cellular and physiological effects of large A series of overlapping clones or sequences
genomes may be a function of selection of that collectively span a particular chromo-
the major components that contribute to somal region and form a contiguous segment
genome size such as transposable elements is called a contig. Recommended references
and gene duplication (Gaut and Ross-Ibarra, for physical mapping include Zhang and
2008). Wing (1997), Brown (2002), Meyers et al.
(2004) and Lolle et al. (2005).
Sequence complexity

Within a phylum, the number of genes in DNA libraries

each organism is quite similar although
the genome size has a 100-fold difference. Large-insert DNA libraries are one of the
It is estimated that the number of genes in key components in genome research. They
flowering plants is 30,00050,000 but the are especially useful for genome studies in
genome size variation is about 100 times large and complex genomes. These libraries
(Arabidopsis versus wheat). This is because can be used in a variety of research projects
some large genomes contain a high percent- such as physical mapping of chromosomes,
age of repetitive DNA. map-based cloning of important genes,
The proportions of different sequence genome organization and evolution, com-
components in representative eukaryo- parative genomics and molecular breeding
tic genomes differ greatly. For example, programmes.
the Escherichia coli genome consists of A gene or DNA library is a collection of
100% non-repetitive sequences while all the genes for an organism so that there
tobacco contains 65% moderately repeti- is a high probability of finding any particu-
tive and 7% highly repetitive sequences. lar segment of the source DNA in the col-
Repetitive DNA is of two types: tandem lection. To contain a colony of bacteria for
repeats (those that are found adjacent to every gene, a library will consist of tens of
one another) and dispersed repeats (those thousands of colonies or clones. The col-
that recur in unlinked genomic loca- lection is represented in the form of recom-
tions). For example, two classes of dis- binants between DNA fragments from the
persed and highly repetitive DNA include organism and the vector. The library has
SINES (short interspersed elements), to be ordered so that each clone has been
i.e. shorter than 500 bases and present in placed in a precise physical location rela-
1056 copies, and LINES (long interspersed tive to others (such as in wells of microtitre
elements), i.e. longer than 5 kb and present plates).
in at least 104 copies per genome. Various highly efficient cloning vec-
Repeated sequence families can tors have been used to construct DNA
sometimes function as regulators of gene libraries. Most frequently used vectors are
expression. On the other hand, they can be l phages, cosmids, P1 phages and artificial
non-functional identities (such as the so- chromosomes. There are various types of
Omics and Arrays 71

artificial chromosomes including yeast arti- in screening libraries using antibodies or

ficial chromosome (YAC), bacterial artificial enzyme activities.
chromosome (BAC), binary BAC (BIBAC), In order to be confident that virtually
P1-derived artificial chromosome (PAC), all regions of the genome are represented at
transformation-competent artificial chromo- least once in a library, considerable redun-
some (TAC), mammalian artificial chromo- dancy of cloned DNA must be included
some (MAC), human artificial chromosome in the library. The number of DNA clones
(HAC) and plant artificial chromosome. (n) needed for a certain probability (P) of
When the DNA is simply ligated to the vec- finding a target clone, is calculated by the
tor and packaged in the phage particles, formula:
the library is said to be unamplified. In an
amplified library, the original DNA has been ln(1 P )
n=
subsequently increased by replication in k
ln 1
bacteria. m
Which DNA is cloned in libraries
depends on the purpose of the research. where k is the DNA insert size in kb and m
Genomic libraries are constructed from the is the haploid genome size in kb. As a rule
total nuclear DNA of an organism. In mak- of thumb, a library containing DNA inserts
ing these libraries, the DNA must be cut into which collectively add up to three times the
clonable-size pieces as randomly as possi- amount of DNA in a single gamete of the
ble. Shearing or partial digestion with a fre- organism, will provide about 95% confi-
quently cutting restriction endonuclease is dence that any DNA element in the genome
often used. Chromosome-specific libraries is represented at least once in the library.
are made from the DNA of purified isolated A library that has five genome-equivalent
chromosomes. A cDNA library contains a coverage (rather than three), will provide
collection of cDNA clones transcribed from about 99% confidence of including the target
mRNAs collected from a specific tissue or element. For example, the number of BACs
organ at a specific growth or developmen- of an average size of 150 kb required for 5
tal stage under a specific environment. coverage of Arabidopsis (m = 125,000 kb) is
Therefore, a cDNA library only contains the 3835. When DNA fragments are randomly
genes that are expressed in the specific con- distributed the probability of obtaining any
ditions. Furthermore, cDNAs do not contain DNA sequence from this library is no lower
introns or promoters. than 0.99.
Functionally, gene libraries can be clas-
sified into cloning and expression libraries. Construction of large insert genomic libraries
Cloning libraries are constructed by clon-
ing vectors which contain replicons, mul- Construction of large insert genomic librar-
tiple cloning sites and selection markers. ies includes three steps: (i) development
Clones can be multiplied by bacterial cul- of the cloning vector; (ii) isolation of high
ture. Expression libraries are constructed molecular weight DNA; and (iii) preparation
by expression vectors which contain spe- of insert DNA.
cific sequences that control gene expres-
sion such as promoters, Shine-Dalgarno DEVELOPMENT OF LARGE-INSERT CLONING VECTORS.
sequences, ATG and stop codons, etc. in Developing a vector which can accommo-
addition to those contained in cloning vec- date a large DNA fragment has been a dif-
tors. The coding products of clones can be ficult task. Ten kb is the maximum insert
expressed in host cells. size of most plasmid vectors. As the insert
cDNA libraries are often expression size increases, the ligation and transforma-
libraries in which clone construction is tion efficiency decreases significantly.
such that part or all of the encoded pro- The first such vector was the bacte-
tein is expressed in bacteria harbouring the riophage l vector in which the size of the
cloned DNA. Such expression is needed largest DNA insert is about 25 kb. This is
72 Chapter 3

because the fixed capacity of the phage head mediated transformation. A similar vec-
prevents genomes that are too long being tor called TAC, was developed and used
packaged into progeny particles. Cosmids to complement a mutant phenotype in
are one type of hybrid vector that replicate Arabidopsis (Liu et al., 1999). Table 3.1
like a plasmid but can be packaged in vitro provides characteristics of several artificial
into l phage coats. The vector can accom- chromosome vectors.
modate DNA inserts as large as 45 kb.
The YAC vector was developed in ISOLATION OF HIGH MOLECULAR WEIGHT DNA.
which an insert up to 1000 kb can be main- Preparing quality high molecular weight
tained. The YAC cloning system includes (HMW) DNA (most of the DNA > 1 Mb)
Tel yeast telemeres, ARS1 autonomously suitable for large insert library construc-
replicating sequence, CEN4 centromere tion can be one of the most difficult
from yeast chr.4, URA3 (Uracil) and TRP1 steps in constructing a large-insert plant
(tryptophan) yeast selection marker genes, genomic library. There are four predom-
Amp ampicillin-resistance gene and Ori inant problems involved in isolating
origin of replication of pBR322. Although plant nuclear DNA: (i) plant cell walls
the YAC clones have played a major role must be physically broken or enzymati-
in several genome projects and map-based cally digested without damaging nuclei;
cloning of many genes in the early 1990s, (ii) chloroplasts must be separated from
the following four problems have prevented nuclei and/or preferentially destroyed,
their further use in genome studies: (i) high an important process since copies of the
percentage of chimaeric clones; (ii) dif- chloroplast genome may comprise the
ficulty in DNA preparation and storage; majority of the DNA within a plant cell;
(iii) low transformation efficiency; and (iv) (iii) volatile secondary compounds such
instability of some inserts in yeast. In the as polyphenols must be prevented from
rice cultivar Nipponbare for example, 40% interacting with the nuclear DNA; and
of the clones in the YAC library alone were (iv) carbohydrate matrices that often form
chimaeric thus limiting its use for genome after tissue homogenization must be pre-
sequencing or map-based cloning. vented from trapping nuclei.
The BAC cloning system is based on Several different isolation methods
the E. coli single copy F factor (Shizuya have been developed. The first method
et al., 1992). It is easy to manipulate, screen was to isolate the protoplast from leaf tis-
and maintain the cloned DNA. It is non- sue and then embed the protoplast in low-
chimaeric, and has high transformation melting point agarose in the forms of a plug
efficiency. or bead. This method is expensive and
To facilitate gene identification in plant time consuming. In addition, chloroplast
species, second-generation BAC vectors DNA is not separated. The development of
such as BIBAC were constructed (Hamilton methods to isolate nuclei from leaf tissue
et al., 1996). A 150-kb human DNA frag- has dramatically improved the procedure
ment in the BIBAC vector was transferred and quality of the HMW DNA for library
into the tobacco genome by Agrobacterium- construction.

Table 3.1. Characteristics of artificial chromosome vectors.

Maximum DNA Plant

Vector Host size (kb) Stability Chimerism preparation transformation

YAC Yeast 1000 + Difficult No

P1 E. coli 100 + Easy No
BAC E. coli 300 + Easy No
PAC E. coli 300 + Easy No
BIBAC/TAC E. coli and
A. tumefaciens 300 + Easy Yes
Omics and Arrays 73

PREPARATION OF INSERT DNA FOR LIGATION. The gerprinting; chromosome walking; sequence
average size of DNA fragments produced tagged site (STS) mapping; and fluorescent
by complete digestion with restriction in situ hybridization (FISH). In restriction
enzymes with four- or six-base recogni- fragment fingerprinting, individual clones
tion sequences is too small for large insert are first digested with different restriction
library construction. To obtain relatively enzymes. The digested DNA is then labelled
HMW restriction fragments (100300 kb), with radioactive or fluorescent dye and run
the popular method is to partially digest the on a sequence gel. The fingerprint data is
target DNA with a four-base-cut enzyme. collected and analysed for contig assembly.
Partial DNA digestion not only yields frag- During the procedure, markers with known
ments of the desired size but also fragments map position are used as probes to screen
the genome randomly without exclusion of the large insert library. Clones hybridized
any sequence. with the same single copy marker are con-
To determine the conditions that yield sidered to be overlapping. PCR amplifica-
a maximum percentage of fragments between tion of DNA pools using primers derived
100 and 300 kb, a series of partial digestions from DNA markers with known position
are carried out by using different amounts can also be used for physical map construc-
of restriction enzyme for a specific digestion. The disadvantages of this method are
tion period. Once the optimal conditions that it is labour intensive and filling the
for producing fragments between 100 and gaps is difficult.
300 kb are determined, a mass digestion STS mapping uses a sequenced tagged
using several plugs is carried out to obtain site (STS) which is a short region of DNA about
sufficient DNA for size selection. Partially 200300 bases long whose exact sequence
digested HMW DNA is then subjected to is found nowhere else in the genome.
pulsed field gel analysis. Two or more clones containing the same STS
If there is no size selection of partially must overlap and the overlap must include
digested DNA, a random library will have a the STS. There are two disadvantages to this
preponderance of small inserts since small method: it is still very labour intensive and
fragments ligate more efficiently and clones the primer synthesis is expensive.
with small inserts transform with higher FISH uses synthetic polynucleotide
efficiency. Contour-clamped homogeneous strands that bear sequences known to be
electrical field (CHEF) is the most common complementary to specific target sequences
method for separating large DNA molecules. at specific chromosomal locations. The poly-
It uses a hexagonal array of fixed electrodes nucleotides are bound via a series of linked
and a homogeneous electrical field is gen- molecules to a fluorescent dye that can be
erated for enhancing DNA resolution. After detected with a fluorescence microscope.
two-size selection using CHEF Mapper, In addition, physical mapping can
the HMW restriction fragments must be be achieved by a combination of finger-
removed from surrounding agarose before printing, molecular linkage mapping, STS
they can be used in ligation reactions. After mapping, end sequencing and FISH map-
developing the high insert library, a number ping. A by-product of physical mapping
of random clones can be selected to confirm is the integration of genetic, physical and
the successful cloning of the inserts and the sequence maps as shown in Fig. 3.6.
average insert size. The average insert size
will then determine how many clones are
needed to achieve the desired amount of
3.2.3 Genome sequencing
genome coverage.
The sequencing of DNA in laboratories
Physical mapping first began in 1978. The first genome of a
multicellular eukaryote, Caenorhabditis
There are five physical mapping methods: elegans, was published in 1998. The ration-
optical mapping; restriction fragment fin- ale behind genome sequencing includes
74 Chapter 3

Human chromosome 16
Cytogenetic
map
Site of hybridization

FRA16D
FRA16B
CY180

CY165
Somatic cell
CY14
23HA with labelled probe

CY19

CY11
CY13
CY15

CY12

CY8

CY7

CY2
CY4
hybridization
map
(from cultured Region of interest
humanmouse D16 S159

D16 S150
D16 S149

D16 S160

D16 S144
between breakpoints

16AC 6.5
D16 S85

D16 S60

D16 S48

D16 S40
hybrid cells) CY8 and CY7

Genetic
linkage map Region of interest
between genetic
Region of interest can be localized either markers 16AC6.5
on physical map (somatic cell hybrid map) and D16S150
or genetic map.

YAC clone YAC clone containing

insert region of interest

BAC and/or
PAC contigs

BAC or PAC clones

Each of these lines represents a sequence- containing the region
tagged site (STS), a unique DNA sequence of interest
that can be amplified by PCR; presence
of an STS in a clone indicates where the
insert originated from in the chromosome.

STS GATCAAGGCGTTACATGA
AGTCAAACGTTTCCGGCCTA

Fig. 3.6. Example of physical mapping and integration of genetic, cytological and physical maps.

identification of all the genes in the sequenced DNA sequencers; and (iii) PCR. Until the
genome, elucidation of the functions and the late 1970s, obtaining the DNA sequences
interactions of genes in the genome, func- of even five to ten nucleotides was dif-
tional analysis of orthologues in related ficult and very laborious. The develop-
complex genomes, evolutionary analysis of ment of two new methods in 1977, that
genes or genomes and product development of Maxam and Gilbert (chemical sequenc-
and commercial application. As the next- ing method) and the other by Sanger and
generation sequencing technologies contin- Coulson (enzymatic sequencing), made it
ued to facilitate genome sequencing, new possible to sequence large DNA molecules.
applications and new assay concepts (e.g. Later refinements of Sangers chain termi-
Huang et al., 2009) have emerged that are nation method made it the preferred proce-
vastly increasing our ability to understand dure since it has proven to be technically
genome function, including sequence census simpler.
methods for functional genomics (Wold and The modified Sanger sequencing
Myers, 2008; Varshney et al., 2009). method or chain terminator procedure capi-
talizes on two properties of DNA polymer-
Technical developments in DNA sequencing ases: (i) their ability to synthesize faithfully
a complementary copy of a single-stranded
There are three major milestones in DNA DNA template; and (ii) their ability to use
sequencing: (i) the invention of sequenc- 3'-dideoxynucleotides as substrates. Once
ing reactions; (ii) automated fluorescent the analogue is incorporated at the growing
Omics and Arrays 75

point of the DNA chain, the 3' end lacks a and opening up many new possibilities
hydroxyl group and is no longer a substrate (Kahvejian et al., 2008; Shendure and Ji,
for chain elongation. Thus, the dideoxynu- 2008). There are three commercial next-
cleotides act as chain terminators. generation DNA sequencing systems avail-
The development of labelling and able (Schuster, 2008) which promise vastly
detection techniques have contributed to more sequencing capability (> 1 Gb of
an acceleration of sequencing procedures, sequence per run) than standard capillary-
which include 33P labelled primer (1970s); based technology can produce. A high-
33
P or 35S labelled primer with sharper throughput DNA sequencing technique
image and lower radiation (early 1980s); using a novel massively parallel sequenc-
and fluorescently labelled primers and ing-by-synthesis approach called pyrose-
dyes in four different reactions (1986). quencing was developed more recently by
DNA sequencing became automated in the 454 Life Sciences (Margulies et al., 2005;
late 1980s when the primer used for each www.454.com). 454 Sequencing employs
reaction was labelled with a differently clonal DNA fragment amplification on
coloured fluorescent tag. This technology beads in droplets of an aqueousoil emul-
allowed thousands of nucleotides to be sion, followed by loading the beads into
sequenced in a few hours and the sequenc- nanoscale ( 44 m) wells of a PicoTiterPlate
ing of large genomes then became a reality. which is a fibre optic chip. In each reac-
With ABI PRISM technology, up to four tion cycle, one of the four deoxynucleotide
different dyes can be used to label DNA triphosphates (dNTPs) is delivered to the
each of which can be differentiated when reactor along with DNA polymerase, ATP
run together in the same lane of a gel or sulfurylase and luciferase. Incorporation,
injected into a capillary. For DNA sequenc- which is accompanied by a chemolumins-
ing, this means that the four different dyes cent signal, is detected by a high-resolution
representing each of the DNA bases (A, C, charge-coupled device (CCD) sensor. 454
G and T) can be electrophoresed together. Sequencing is capable of sequencing roughly
The improvement of polyacrylamide 100 Mb of raw DNA sequence per 7-h run
gel electrophoresis (in the late 1980s and with their 2007 sequencing machine, the GS
early 1990s) led to high resolution, thin- FLX Genome Analyzer.
ner gels and a sharper image. Capillary 454 Sequencing allows large amounts
electrophoresis (CE) (1998) offers a number of DNA to be sequenced at low cost
of performance advantages such as faster compared to the Sanger chain-termina-
runs, small sample volumes and the abil- tion methods; G-C rich content is not as
ity to eliminate manual gel pouring and much of a problem, and the lack of reli-
sample loading tasks. Walk-away automa- ance on cloning means that unclonable
tion reduces instrument-associated labour segments are not skipped; it is also capa-
time by more than 80% over slab-gel sys- ble of detecting mutations in an amplicon
tems. The introduction of CE resulted in the pool at a low sensitivity level. However,
availability of automated electrophoresis each read of the 2005 sequencing machine
instruments with much lower cost per sam- GS20 is only 100 bp long, resulting in
ple (Amershams MegaBACE and Applied some problems when dealing with highly
Biosystems ABI3700, 3730, etc.). High- repetitive genomes, as repetitive regions
throughput sequencing can also incorporate of over 100 bp cannot be bridged and
full automation in colony picking, 96-well thus must be left as separate contigs. Also,
plasmid isolation and purification, PCR the nature of the technology lends itself
reactions, sample loading and sequence to problems with long homopolymer runs.
data analysis. As one of the projects using 454 sequenc-
The new generation of high-through- ing, Project Jim determined the first
put sequencing technologies promises to sequence of an individual, the complete
transform the scientific enterprise, poten- genome sequence of James Dewey Watson,
tially supplanting array-based technologies in May 2007.
76 Chapter 3

The second high-throughput sequenc- in a DNA strand offers the prospect of third
ing technique is Solexa (Illumina, Inc.; generation instruments that will sequence a
http://www.illumina.com) which depends diploid mammalian genome for US$1000
on sequencing by synthesis. Diluted DNA in 24 h (Branton et al., 2008).
templates are attached to a solid planar sur-
face and then amplified clonally. Sequencing Sequencing strategies
is performed by delivering a mixture of four
differentially labelled reversible chain ter- There are two general genome sequencing
minators along with DNA polymerase. The strategies: (i) clone-by-clone or hierarchical
resulting signal is detected at each cycle sequencing (International Human Genome
and a new cycle can be initiated after termi- Sequencing Consortium, 2001); and (ii) whole
nator removal (Bennet et al., 2005). Current shotgun sequencing (Venter et al., 2001).
average read lengths are about 3040 bases After constructing the complete physical
with 1 Gb per run. map, clone-by-clone sequencing can be
The third high-throughput sequenc- started in any specific region. Clone-by-clone
ing technique is SOLiD System which or hierarchical sequencing strategy has the
enables massively parallel sequencing of following advantages: (i) the ability to fill
clonally-amplified DNA fragments linked gaps and re-sequence the uncertain regions;
to beads. The SOLiD sequencing method- (ii) the ability to distribute the clones to
ology is based on sequential ligation with other laboratories; and (iii) the ability to
dye-labelled oligonucleotides. The SOLiD check the produced sequence by restriction
technology provides unmatched accu- enzymes. The main disadvantages are that
racy, ultra-high throughput and applica- it is expensive and time consuming for the
tion flexibility. It delivers advancements in construction of a physical map and experi-
throughput approaching 20 Gb per run. The enced personnel are required.
flexibility of two independent flow cells, The shotgun sequencing strategy
each capable of running 1, 4 or 8 samples, consists of making small insert librar-
allows multiple experiments to be con- ies (110 kb) from the genomic DNA of an
ducted in a single run. With unparalleled organism, sequencing a large number of
throughput and greater than 99.9% overall clones (six to eight times redundancy) and
accuracy, the SOLiD System enables large- assembling contigs using bioinformatics
scale sequencing and tag-based experiments software. It has no physical map construc-
to be completed more cost effectively than tion and less risk of recombinant clones. It
previously possible. is cost effective and fast and ideal for small
There are several emerging sequencing genome sequencing. However, it is difficult
methods: sequencing by hybridization; mass to fill gaps and re-track all the sequenced
spectrophotometric techniques; direct visu- plasmids and the resulting data is less use-
alization of single DNA molecules by atomic ful for positional cloning. Figure 3.7 com-
force microscopy; single-molecule sequenc- pares the two sequencing methods.
ing strategies. The intense drive towards
developing technology that can sequence a COMBINING CLONE-BY-CLONE AND SHOTGUN SEQUENC-
complete human genome for under US$1000 ING STRATEGIES. In 1997 The Institute of
will ensure that the speed and cost of Genome Research (TIGR) launched the ini-
sequencing will continue to improve rap- tiative of a whole-genome shotgun approach
idly (Schuster, 2008). For example, a nano- for the human genome. But BACs, BAC
pore-based device provides single-molecule end sequences and STS markers were used
detection and analytical capabilities that extensively in assembling the sequencing
are achieved by electrophoretically driving data from shotgun clones. The first draft of
molecules in solution through a nano-scale the human genome was completed within 3
pore. Further research and development to years compared with the 12 years taken by
overcome current challenges to nanopore the Human Genome Project which is funded
identification of each successive nucleotide by government agencies.
Omics and Arrays 77

Hierarchical sequencing Shotgun sequencing

1. Construct Chromosomal
large BAC DNA
Fragment and
or P1 clones sequence
whole genome
2. Align

3. Take subset
of clones,
fragment and
sequence

Assemble contigs and bioinformatics analysis

U-unitigs
Rock 50 kb Mates
Scaffold

Stones

Gap
Link mapped
scaffold to
existing map
STSs

Fig. 3.7. Comparison of two sequencing strategies: assembly of a mapped scaffold. U-unitigs are
assembled into scaffolds using mate-pair information to bridge gaps between two U-unitigs, and by
linking unitigs to rock, which are less-well supported unitigs that nevertheless fit in place according
to at least two independent large insert mate pairs. Stones are single short contigs whose position is
supported by only a single read. Gaps are filled in the finishing stage by further site-directed sequencing.
Scaffolds are placed against existing genetic and physical maps by sequence tagged site (STS) matches
and against the cytological map by fluorescent in situ hybridization (FISH).

Genome filtering strategies

is cleaved when transferred into a Mcr +
The extremely large size of many crop E. coli strain and only hypomethylated
genomes makes it difficult to decode them DNA is recovered. CBCS/HC separates
using the standard methods of genome single- and low-copy sequences including
sequencing such as clone-by-clone and most genes from the repeated sequences
whole-genome shotgun. Determining on the basis of their differential renatura-
their complete sequences is daunting and tion characteristics. Using the MF strategy,
costly. In recent years two genome filtra- Bedell et al. (2005) sequenced 96% of the
tion strategies, methylation filtration (MF) genes in sorghum with an average cover-
(Rabinowicz et al., 1999) and C0t-based age of 65% across their length. This strat-
cloning and sequencing (CBCS; Peterson egy filtered out repetitive elements during
et al., 2002) or high C0t (HC; Yuan et al., the sequencing of the genome of sorghum
2003) have been suggested for selec- which reduced the amount of sorghum
tively sequencing the gene space of large DNA to be sequenced by two-thirds, from
genomes. MF is based on the characteristics 735 Mb to approximately 250 Mb. Both MF
of plant genomes in which genes are largely and HC have been used for efficient char-
hypomethylated but repeated sequences acterization of maize gene space (Palmer
are highly methylated. Methylated DNA et al., 2003; Whitelaw et al., 2003). Using
78 Chapter 3

high C0t and MF, Martienssen et al. (2004) Plant genomic sequences
generated up to twofold coverage of the
gene space with less than one million The first complete plant genome to be
sequencing reads and simulations using sequenced was that of Arabidopsis. The
sequenced BAC clones predicted that sequenced regions cover 115.4 Mb of the
5 coverage of gene-rich regions, accompa- 125-Mb genome and extend into centro-
nied by less than 1 coverage of subclones meric regions. The evolution of Arabidopsis
from BAC contigs, will generate a high qual- involved a whole genome duplication fol-
ity mapped sequence that meets the needs lowed by subsequent gene loss and extensive
of geneticists while accommodating unu- local gene duplications. The genome contains
sually high levels of structural polymor- 25,498 genes encoding proteins from 11,000
phism. Haberer et al. (2005) selected 100 families (The Arabidopsis Genome Initiative,
random regions averaging 144 kb in size, 2000). Arabidopsis contains many families of
representing about 0.6% of the genome, to new proteins but also lacks several common
define their content of genes and repeats protein families. The proportion of predicted
for characterizing the structure and archi- Arabidopsis genes in different functional cat-
tecture of the maize genome. Combining egories is provided in Fig. 3.8. The complete
CBCS with genome filtration can greatly genome sequence provides the foundation
reduce the cost while retaining the high for more comprehensive comparison of con-
coverage of genic regions. An alternative served processes in all eukaryotes, identifying
approach is the identification of gene-rich a wide range of plant-specific gene functions
regions on a detailed physical map and and establishing rapid systematic methods
sequencing large-insert clones from these of identifying genes for crop improvement
regions. (Varshney et al., 2009).

Unclassified Metabolism
10% 11%

Net yet clear-cut

5% Energy
7%
Cell defence
3%
Cell growth
Elicitors 2%
4%
Transcription
Signal transduction 3%
4%

Cellular organization
5%

Intracellular traffic
3%

Transport facilitators Protein synthesis

4% 27%

Protein destination
12%

Fig. 3.8. Proportion of predicted Arabidopsis genes in different functional categories.

Omics and Arrays 79

Rice was the first crop to be fully (University of Missouri), Mark Vaudin
sequenced because of its importance as one (Monsanto) and Steve Rousley (Cereon);
of the major cereals and also because of its the other included Jeff Bennetzen (Purdue
small genome size, small number of chromo- University), Karel Schubert and Roger Beachy
somes (n = 12), well characterized genetic (Danforth Center), Cathy Whitelaw and John
and genomic resources and availability of Quackenbush (TIGR) and Nathan Lakey
a large number of DNA markers and a high (Orion). These two pioneer programmes have
density genetic linkage map. Two draft been extended by a massive US programme
sequences were completed in 2002 (Goff et from the National Science Foundation (NSF),
al., 2002; Yu et al., 2002) and a complete USDA and the Department of Energy (DOE)
sequence was published in 2005 (IRGSP, led by Rick Wilson (Washington University).
2005) which is available in the National The sequencing strategy is a hybrid between
Center for Biotechnology Information (NCBI) a BAC-by-BAC approach and a whole-
database. genome shotgun.
Many sequencing projects for impor-
tant crop species are currently ongoing. The
US Department of Energys Joint Genome 3.2.4 cDNA sequencing
Institute (JGI) is providing funding and
technical assistance to decode the genomes Why cDNA sequencing
of several major plants, including cassava
(Manihot esculenta), cotton (Gossypium), Large-scale DNA sequencing can be car-
foxtail millet (Setaria italica), sorghum, soy- ried out on genomic DNA or cDNAs. There
bean and sweet orange (Citrus sinensis L.) are four advantages to performing cDNA
(http://www.jgi.doe.gov/sequencing/). sequencing. First is the cost of sequencing
Other plants for which there are ongo- a whole genome. Although DNA sequenc-
ing genome sequencing projects include ing costs have fallen more than 50-fold over
Medicago truncatula (http:///www.medi the past decade, it still costs around US$10
cago.org/genome), Lotus japonicum (http:// million to sequence three billion base pairs.
www.kazusa.or.jp), poplar, tomato (http:// It will take years to realize the goal to lower
www.sgn.cornell.edu) and grapevine. the cost of sequencing a mammalian-sized
The International Wheat Genome genome to US$100,000 and ultimately to
Sequencing Consortium (IWGSC) has been cut the cost of whole-genome sequencing to
formed to advance agricultural research for US$1000 or less.
wheat production and utilization by develop- Secondly, the interpretation of the
ing DNA-based tools and resources that result genomic sequence of eukaryotes is not
from the complete sequencing of the expressed straightforward in contrast to prokaryotes:
genome of common (hexaploid) bread wheat coding regions are separated by non-coding
and to ensure that these tools and the sequences regions; introns and alternative splic-
are available for all to use without restriction ing occurs; one gene can lead to multiple
and without cost (Gill et al., 2004; http://www. mRNAs and gene products; a significant
wheatgenome.org/). A Global Musa Genomics fraction of genomic DNA does not code for
Consortium (GMGC) is decoding the Musa proteins (non-coding sequences).
genome (http://www.newscientist.com/article. Thirdly, cDNA sequencing helps in
ns?id-dn1037). A Global Cassava Partnership, annotation and identification of exons and
an alliance of the worlds leading cassava introns. Estimates of the number of human
researchers and developers, has proposed that genes vary from 30,000 to 80,000. The accu-
sequencing the cassava genome should be a racy of the Arabidopsis genome annotation
priority (Fauquet and Tohme, 2004). varied from 50 to 70% in the first draft.
To sequence the maize genome, two Many Arabidopsis genes are still not accu-
consortia in the USA began a pilot study: rately annotated.
one with Jo Messing (Rutgers University), Fourthly, sequencing cDNAs helps
Rod Wing (Arizona University), Ed Coe gain information about the transcriptome.
80 Chapter 3

mRNA populations are variable among efficiency of full-length cDNA cloning using
cells. The transcriptome is dynamic and a cap trapper method (biotinylated cap) and
constantly changing. Cells adapt to envi- thermoactivation of reverse transcriptase
ronmental, developmental and other sig- (cDNA synthesis at 60C: RNA secondary
nals by modulating their transcriptome. structures are melted). Some normalization
mRNA populations form an important and subtraction methods also allow enrich-
level of regulation between signal per- ment of full-length cDNAs.
ception and response. Genetically identi- For a given mRNA, multiple expressed
cal cells can exhibit distinct phenotypes. sequence tags (ESTs) can be obtained.
cDNA sequencing allows direct insight Depending on the extent of sampling, ESTs
into mRNA populations and allows the may or may not overlap. EST process-
dissection of the transcriptome which ing is needed to remove vector sequences,
genomic sequencing alone does not pro- linker sequences, check the quality using a
vide. Sequencing of random cDNA clones sequence quality filter, clean up the contam-
prepared from different tissues also allows inants and chimaeric sequences and store in
analyses of mRNA abundance. databases. To construct EST contigs, there
are two commonly used programs: Phrap/
cDNA libraries consed and TIGR assembler. These programs
generate a unigene set (contigs or Tentative
When constructing a representative cDNA Consensus): a consensus sequence for all
library, the source of the mRNA for the overlapping ESTs that (supposedly) corre-
cDNA library is critical and will vary spond to a single mRNA.
depending on the goal of the study. To esti- Several factors affect the quality of
mate the diversity of mRNAs expressed in EST contigs: contaminating sequences, bad
a given plant, the mRNA should represent quality sequences, non-overlapping ESTs
most plant tissues and organs. On the other from the same mRNA, alternative splicing
hand, to define the diversity of mRNAs resulting in one gene with multiple mRNAs
represented in a specific tissue, organ or and closely related genes (chimaeric con-
developmental stage, the library should tigs). EST annotation can be carried out
be prepared from the most highly defined using similarity searches against Genbank
source feasible. As indicated by Nunberg and other databases, e.g. protein motif data-
et al. (1996), it is better to invest the time bases, to assign a putative function or iden-
to harvest sufficient quantities of scarce tis- tify functional categories. This process can
sue for a library rather than using materials be automated or manual (usually a combi-
which will contain a significant proportion nation of the two).
of extraneous messages. Non-random (normalized or subtracted)
If large quantities of RNA are available, cDNA libraries are needed in order to over-
it is possible to create a plasmid library come some of the problems with redundant
directly. This is particularly feasible since ESTs in order to saturate EST databases when
electroporation transformation efficiencies budget is limited or when there is a specific
are so high. Plasmid libraries may or may interest in a particular stage. Hybridization-
not be directional and are easily arranged based methods are most commonly used
in an ordered array. Constructing plas- to decrease redundancy (reduce represen-
mid libraries directly avoids any sequence tation of abundant cDNAs and increase
bias, including internal deletion and trans rare cDNAs). Normalized cDNA libraries
recombination that may occur during the are used when gene discovery is the main
excision process. objective of the EST project.
The frequency of full-length cDNAs
depends on the length of transcript (the cDNA sequencing
longer the transcript the lower the frequency
of obtaining full-length cDNAs). Carninci Strategies for cDNA sequencing include
and Hayashizaki (1999) discussed the high- single-pass cDNA sequencing (ESTs),
Omics and Arrays 81

normalized cDNA libraries, subtracted nity to in silico simulations of plant growth,

cDNAs and high-throughput full-length development and response to environmen-
cDNA sequencing. Single-pass cDNA can tal change.
be achieved using the following steps:
(i) construct cDNA libraries; (ii) randomly
pick clones for sequencing (from the 5' or 3.3.1 Transcriptomics
3' end using vector primers); (iii) process
sequences (vector/linker removal, qual- The transcriptome is the set of all the mRNA
ity control, contaminants, empty, chimae- molecules or transcripts, produced in one
ric); (iv) construct contigs (sequences from cell or a population of cells. The term can
same transcript); (v) create a unigene set; be applied to the total set of transcripts in
and (vi) annotate sequences. The objective a given organism or to the specific subset
of high-throughput cDNA sequencing is of transcripts present in a particular cell
to obtain the full finished sequence of as type. Unlike the genome, which is roughly
many cDNAs as possible. This is necessary fixed for a given cell line (excluding muta-
for complex eukaryotic genomes (human, tions), the transcriptome can vary with
mouse, plants). Full-length cDNA sequenc- external environmental conditions. Because
ing is discussed in Chapter 11 along with its it includes all mRNA transcripts in the cell,
use in gene cloning. the transcriptome reflects the genes that are
Major limitations of the cDNA sequenc- being actively expressed at any given time
ing approach include: (i) high redundancy of with the exception of mRNA degradation
some genes in cDNA libraries; (ii) difficulty phenomena such as transcriptional attenu-
in isolating rare transcripts or developmen- ation. Transcriptomics is based on the idea
tally-regulated genes; and (iii) the fact that that a catalogue of all the transcripts associ-
some genes are not stable in E. coli. ated with a specific treatment or develop-
mental stage provides a reasonable overview
of the underlying biological processes at
3.3 Functional Genomics work. As we moved from northern blots to
tiling arrays, we have advanced from a gene-
The use of whole genome information and by-gene world to a full genome universe.
high-throughout tools has opened up a new The study of transcriptomics often uses
field of research called functional genomics. high-throughput techniques based on DNA
Among its subdisciplines, transcriptomics microarray or chip technology. Suggested
(the complete set of transcripts produced references for this section include Bernot
in a cell) (Zimmerli and Somerville, 2005), (2004), Bourgault et al. (2005) and Busch and
proteomics (the complete set of proteins Lohmann (2007).
produced in a cell) (Roberts, 2002) and Gene expression profiling technolo-
metabolomics (the complete set of metabo- gies provide a tool for analysing global
lites expressed in a cell) (Stitt and Fernie, gene expression by viewing activity of all
2003) have been used by the plant science or (more typically) a substantial part of the
community. Functional genomics refers to genome at a specific time of interest. There
the development and application of global are open and closed architecture systems
(genome-wide or system-wide) experimen- for gene expression profiling. In the open
tal approaches to assess gene function by architecture, all genes expressed in a tissue
making use of the information and reagents have the possibility of being detected (e.g.
provided by structural genomics. It is char- cDNA-AFLP, differential display (dd) PCR,
acterized by high-throughput or large-scale SAGE, cDNA substraction). Advantages
experimental methodologies combined include the potential discovery of previ-
with statistical and computational analy- ously unknown genes, comprehensive cov-
sis (bioinformatics) of the results. The new erage and the low requirements by way of
information provided by all the omics dis- equipment. Disadvantages include retriev-
ciplines will lead the plant science commu- ing only a small part of the gene (since it can
82 Chapter 3

be laborious to clone full-length cDNA) and indicated by Busch and Lohmann (2007),
simple gene identification that is limited the limited length of the sequenced tags
by sequences that are already in a database precludes the use of MPSS for de novo
(otherwise the corresponding gene must be sequencing but makes it a very powerful
cloned). tool for expression profiling of organisms
Several alternative technologies have with pre-existing sequence information.
emerged for measuring transcript abun- By contrast, two other high-throughput
dance in a parallel fashion. Essentially, these sequencing techniques as described previ-
methods can be divided into three catego- ously, 454 and Solexa, are ideally suited
ries according to their underlying principle, for expression-profiling purposes. Short
namely PCR-, sequencing- or hybridization- tags are sufficient to identify a transcript
based technologies. Therefore, strategies unambiguously and therefore problems
that are currently available for analysis of arising from assembling short tags into
transcriptomes include RT-PCR (qualitative larger contigs can be ignored.
and quantitative), hybridization methods PCR product-based arrays were heavily
(northern blots, macroarrays, DNA micro- used in the early days of global transcriptome
arrays, oligonucleotide microarrays), analysis. However, the low level of stand-
cDNA fingerprinting (differential display, ardization among laboratories, high levels
cDNA-AFLP), cDNA sequencing (full-length of noise and experimental variation and
cDNAs, subtracted cDNAs, normalized cross-hybridization between homologous
cDNA libraries, SAGE, massive parallel sig- transcripts have eroded the attractiveness of
nature sequencing MPSS) and combina- these arrays. Oligonucleotide-based micro-
tions of the above techniques. arrays are now becoming the most popular
The most straightforward and unbi- technology for large-scale expression pro-
ased method of analysing an RNA popu- filing because they allow the simultaneous
lation is the sequencing of cDNA libraries detection of tens of thousands of transcripts
and quantitative analysis of the result- at a reasonable cost. The expression level
ing ESTs. Traditionally, ESTs with read- of any gene represented on the array can
lengths of about 200900 nucleotides have be deduced from the fluorescence inten-
been produced by Sanger-sequencing but sity of the corresponding probe. However,
the associated costs have severely limited microarrays only offer linear expression
the resolution of this approach (Busch and measurements over a range of three orders
Lohmann, 2007). Deep sequencing has of magnitude compared to quantitative
become a viable alternative for unbiased RT-PCR which has a dynamic range of five
large-scale expression profiling because orders of magnitudes. Microarrays perform
of the development of new protocols and with less precision and sensitivity than
entirely new sequencing techniques. Non- other techniques when used for measuring
gel-based sequencing techniques promise low abundance transcripts in particular and
to deliver greatly increased throughput this is manifested in their greater inter-assay
and a considerable cost reduction. MPSS variability (Busch and Lohmann, 2007).
combines in vitro cloning of millions of Another major limitation of microarrays
template tags on separate microbeads designed for expression analysis is that they
with ligation-mediated sequence detec- rely on current genome annotations, which
tion. In each reaction cycle, a four-base precludes the identification of novel or very
overhang is produced on every tag to small transcription units.
which a fluorescently labelled adaptor of Microarrays and quantitative RT-PCR
defined sequence is ligated. The position have dominated expression profiling to date
and fluorescence of every microbead is but deep sequencing and whole-genome
monitored by a high resolution camera in tiling arrays will become increasingly
each of the reaction cycles, allowing the important because these techniques are
sequences of the 17-nucleotide tags to be not limited to the detection of known tran-
reconstructed (Brenner et al., 2000). As scripts. Tiling arrays, on which the entire
Omics and Arrays 83

genome is represented by evenly spaced only a rough estimate of its level of expres-
probes, provide a novel means of transcript sion into a protein. An mRNA produced
identification. In Arabidopsis, tiling arrays in abundance may be degraded rapidly or
have been used to map transcriptionally translated inefficiently, resulting in a small
active regions by profiling four different tis- amount of protein. Secondly, many proteins
sues (Yamada et al., 2003). experience post-translational modifications
The interaction transcriptome is the that profoundly affect their activities; for
sum of all microbe and host transcripts that example some proteins are not active until
are produced during the interaction. The they become phosphorylated. Methods
challenges in studying interaction transcrip- such as phosphoproteomics and glycopro-
tomes include how to discriminate patho- teomics are used to study post-translational
gen from host ESTs, similarity searches modifications. Thirdly, many transcripts
to genome/cDNA sequences, GC analyses give rise to more than one protein through
and determination of hexamer frequency alternative splicing or post-translational
(windows of 6 bp). Systems genomics/tran- modifications. It is generally supposed that
scriptomics can be used to analyse complex if genomes contain tens of thousands of gene
transcriptomes, for example the mixtures of sequences, the proteome comprises several
mRNAs from different species (e.g. infected hundred thousand proteins as a result of
tissue, environmental samples such as soil alternative slicing and post-translational
or seawater, etc.). One challenge is to iden- modifications. Finally, many proteins form
tify the species of origin in the mixtures. complexes with other proteins or RNA mol-
ecules and only function in the presence of
these molecules.
3.3.2 Proteomics Proteomics has become an important
approach for investigating cellular proc-
Proteomics is the study of the identification, esses and network functions. Significant
function and regulation of complete sets improvements have been made in technolo-
of proteins in a tissue, cell or subcellular gies for high-throughput proteomics, both at
compartment. Such information is crucial the level of data analysis software and mass
to understanding how complex biological spectrometry (MS) hardware (Baginsky and
processes occur at a molecular level and Gruissem, 2006). In this section, proteom-
how they differ in various cell types, stages ics will be briefly discussed. For further
of development or environmental condi- details, readers are referred to the follow-
tions (Bourgualt et al., 2005). Proteomics is ing review articles: van Wijk (2001), Molloy
important as proteins are active agents in and Witzmann (2002), de Hoog and Mann
cells and they execute the biological func- (2004), Saravanan et al. (2004), Baginsky and
tions encoded by genes. Sequences of genes Gruissem (2006), Cravatt et al. (2007) and
(or genomes) and transcriptome analyses Zivy et al. (2007).
are not sufficient to elucidate biological
functions. Proteomics complements tran- Protein extraction
scriptomics by providing information about
the time and place of protein synthesis Obtaining high quality protein is the first step
and accumulation, as well as identifying in proteomic research. Extracting protein
those proteins and their post-translational from plant tissue requires tissue disrup-
modifications. Gene expression does not tion by grinding and sonication, separation
necessarily indicate whether a protein is of proteins from unwanted cell materials
synthesized, how fast it is turned over or (cell wall, water, salt, phenolics, nucleic
which possible protein isoforms are synthe- acids) by centrifugation after precipitation
sized (Mathesius et al., 2003). In some cases, of proteins with acetonetrichloroacetic
the correlation between gene expression acid, resolubilizing protein in a solution
and protein presence is as low as 0.4. First, that dissolves the maximum number of dif-
the level of transcription of a gene gives ferent proteins and inactivation of protease
84 Chapter 3

by acetonetrichloroacetic acid treatment or tion can be calculated for all the known
specific protease inhibitors.Pre-fractionation sequence proteins of a given organism (Zivy
of tissue is optional for the analysis of pro- et al., 2007). These masses will depend on
teins from different organelles or micro- the length of peptides and their composi-
somal fractions. Solubilization requires urea tion since most amino acids have differ-
or, for more hydrophobic proteins, thiourea, ent masses. Thus, masses predicted from
as a chaotrope which solubilizes, denatures sequences stored in databases can simply be
and unfolds most proteins. Non-ionic zwit- compared with masses effectively measured
ter detergents, e.g. 3-[3-cholamidopropyl- by the MALDI-TOF equipment. The greater
dimethyl-ammonio]-1-propane sulfonate the number of positive mass matches the
(CHAPS), Triton-X, or amidosulfobetaines more likely it is that the peptides originate
are used to solubilize and separate proteins from the same protein thus facilitating the
in a mixture. Sodium dodecyl sulphate rapid identification of proteins.
(SDS) is also a strong detergent and used to
solubilize membrane proteins. However, it Protein profiling
renders a negative charge to proteins and,
therefore, interferes with isoelectric focus- Protein mixtures of considerable complexity
ing (Mathesius et al., 2003). Reducing agents can now be routinely characterized in some
(usually dithiothreitil [DDT], 2-mercapto- detail. One measure of technical progress is
ethanol or tributyl phosphine) are needed the number of proteins identified in each
to disrupt disulfide bonds. study. Such numbers can now reach the
thousands for suitably complex samples.
Protein identification and quantification Large-scale proteomic studies are needed
to solve three types of biological problem
N- or C-terminal sequencing has made pro- (Aebersold and Mann, 2003): (i) the genera-
tein identification possible on a small scale tion of proteinprotein linkage maps; (ii)
although with limitations. Improvements the use of protein identification technol-
in MS have made it possible to identify ogy to annotate and, if necessary, correct
proteins faster, on a larger scale, using genomic DNA sequences; and (iii) the use
smaller amounts of protein. In addition, of quantitative methods to analyse protein
post-translational modifications can be expression profiles as a function of the
determined by MS/MS analysis and pro- cellular state as an aid to inferring cellular
teins can be identified even when bound function.
to other proteins in complexes. A standard The sequences of many mature pro-
technique for protein identification with teins in higher eukaryotes after processing
MALDI-TOF MS is peptide mass finger- and splicing are often not directly apparent
printing. Protein spots in a gel can be vis- from their cognate DNA sequences. Peptide
ualized using a variety of chemical stains sequence data of sufficient quality provides
or fluorescent markers. Proteins can often unambiguous evidence of translation of a
be quantified by the intensity with which particular gene and can in principle, dif-
they stain. Once proteins have been sepa- ferentiate between alternatively spliced or
rated and quantified, they can be identi- translated forms of a protein (Aebersold
fied. Individual spots are cut out of the gel and Mann, 2003). Thus, it might be tempt-
and cleaved into peptides with proteolytic ing to systematically analyse the proteins
enzymes. These peptides can then be iden- expressed by a cell or tissue, that is, to gen-
tified by MS, specifically MALDI-TOF MS. erate comprehensive proteome maps.
The MALDI-TOF analysis will measure very The more common and versatile use
precisely (< 0.1 Da) the mass of peptides of large-scale MS-based proteomics has
formed by this digestion. Since the cleav- been to document the expression of pro-
age sites are known, the digestion can be teins as a function of cell or tissue state.
simulated by informatics, that is, the masses Aebersold and Mann (2003) argued that to
of all the peptides produced by this diges- be meaningful, such data must be at least
Omics and Arrays 85

semi-quantitative and that a simple list of There are many important charac-
proteins detected in the different states is teristics of a proteinprotein interaction.
insufficient. This is because analyses of Obviously, it is important to know which
complex mixtures are often not comprehen- proteins are interacting. In many experi-
sive and therefore the non-appearance of a ments and computational studies, the focus
particular sequence in the list of identified is on interactions between two different
peptides does not indicate that the peptide proteins. However, one protein can interact
or protein was not originally present in the with other copies of itself (oligomerization)
sample. Additionally, it is often impossible or with three or more different proteins.
to prepare a certain cell type, cell fraction The stoichiometry of the interaction is also
or tissue in completely pure form without important, that is, how many of each pro-
trace contamination from other fractions. tein involved are present in a given reac-
And because the ion current of a peptide is tion. Some protein interactions are stronger
dependent on a multitude of variables that than others because they bind together more
are difficult to control, this measure is not tightly. The strength of binding is known
a good indicator of peptide abundance. If as affinity. Proteins will only bind to each
stable-isotope dilution has not been used, a other spontaneously if it is energetically
rough relative estimate of the quantity of a favourable. Energy changes during bind-
protein can be obtained by integrating the ing are another important aspect of protein
ion current of its peptide-mass peaks over interactions. Many of the computational
their elution time and comparing these tools that predict interactions are based on
extracted ion currents between states, pro- the energy of interactions.
vided that highly accurate and reproducible Protein interaction maps represent
methods are used. Increasingly, stable-iso- essential components of the post-genomic
tope dilution and LC-MS/MS are used to tool kits needed for understanding biologi-
accurately detect changes in quantitative cal processes at a systems level. Over the
protein profiles and to infer biological func- past decade, a wide variety of methods have
tion from the observed patterns (Aebersold been developed to detect, analyse and quan-
and Mann, 2003). tify protein interactions, including surface
plasmon resonance spectroscopy, nuclear
Proteinprotein interactions magnetic resonance (NMR), Y2H screens,
peptide tagging combined with MS and
Proteinprotein interactions occur among fluorescence-based technologies. Lalonde
most proteins and there are six types of et al. (2008) and Miernyk and Thelen (2008)
interfaces found in proteinprotein inter- reviewed the latest techniques and cur-
actions: domaindomain, intra-domain, rent limitations of biochemical, molecular
hetero-oligomer, hetero-complex, homo- and cellular approaches for the detection
oligomer, and homo-complex. The analysis of proteinprotein interactions. In vitro
of proteinprotein interactions can be either biochemical strategies for identifying and
qualitative or quantitative. Traditional bio- characterizing interacting proteins include
chemical methods such as co-purification co-immunoprecipitation, blue native gel
and co-immunoprecipitation have been electrophoresis, in vitro binding assays, pro-
used to identify the members of protein tein cross-linking and rate-zonal centrifuga-
complexes. Proteomics-based strategies tion. Fluorescence techniques range from
have been used to determine the composi- co-localization to tags which may be limited
tion of complexes and to establish interac- by the optical resolution of the microscope,
tion networks. The systematic, large-scale, to fluorescence resonance energy transfer
high-throughput approaches now being (FRET)-based methods that have molecular
taken to build maps of the interactions resolution and can also report on the dynam-
between proteins predicted by genome ics and localization of the interactions within
sequence information have become known a cell. Proteins interact via highly evolved
as interactomics (Causier et al., 2005). complementary surfaces with affinities that
86 Chapter 3

can vary over many orders of magnitude. strate. For example, drugs can be used as
Some of the techniques such as surface plas- affinity baits in the same way as proteins to
mon resonance provide detailed information define their cellular targets and small mol-
regarding the physical properties of these ecules such as cofactors can be used to iso-
interactions. To analyse protein complexes late interesting sub-proteomes (MacDonald
systematically at a sub- or full-genome level, et al., 2002).
several methods have been adapted for high- The Y2H system has become one of
throughput screens using robotics: (i) Y2H the standard laboratory techniques for the
systems; (ii) the mating-based split-ubiquitin detection and characterization of protein
system (mbSUS); and (iii) affinity purifica- protein interactions. It can be used to map
tion of protein complexes followed by iden- individual amino acid residues involved
tification of proteins by MS (AP-MS). in a specific proteinprotein interaction.
One of the first questions usually asked It can also be used to identify novel inter-
about a new protein, apart from where it is actions from complex libraries of expressed
expressed, is to what proteins does it bind? proteins. The Y2H system has been widely
To study this question by MS, the protein used for determination of protein interac-
itself is used as an affinity reagent to isolate tion networks within different organisms.
its binding partners. Compared with two- In plants, the Y2H system has been suc-
hybrid and array-based approaches, this cessfully applied to detect interactions
strategy has the advantages that the fully with phytochromes, cryptochomes, tran-
processed and modified protein can serve scription factors, proteins involved in self-
as the bait, that the interactions take place incompatibility mechanisms, the circadian
in the native environment and cellular loca- clock and plant disease resistance (Causier
tion and that multi-component complexes et al., 2005). Taken together with the recent
can be isolated and analysed in a single progress made in the development of large-
operation (Ashman et al., 2001). However, scale Y2H screening procedures, the time is
because many biologically relevant interac- now ripe for large-scale Y2H screens to be
tions are of low affinity, transient and gen- applied to organisms such as Arabidopsis
erally dependent on the specific cellular and rice.
environment in which they occur, MS-based Another potential method to detect
methods in a straightforward affinity experi- proteinprotein interactions involves the
ment will detect only a subset of the protein use of FRET between fluorescent tags on
interactions that actually occur (Aebersold interacting proteins. FRET is a non-radio-
and Mann, 2003). Bioinformatics methods, active process whereby energy from an
correlation of MS data with those obtained excited donor fluorophore is transferred to
by other methods or iterative MS measure- an acceptor fluorophore that is within 60
ments possibly in conjunction with chemi- of the excited fluorophore (Wouters et al.,
cal crosslinking (Rappsilber et al., 2000) 2001). After excitation of the first fluoro-
can often help to further elucidate direct phore, FRET is detected either by emis-
interactions and overall topology of multi- sion from the second fluorophore using
protein complexes. appropriate filters or by alternation of the
The ability of quantitative MS to detect fluorescence lifetime of the donor. Two
specific complex components within a fluorophores that are commonly used are
background of non-specifically associated variants of GFP: cyan fluorescent protein
proteins increases the tolerance for high (CFP) and yellow fluorescent protein (YFP)
background and allows for fewer purifica- (Tsien, 1998). The potential of FRET is con-
tion steps and less stringent washing condi- siderable, for two reasons (Phizicky et al.,
tions, thus increasing the chance of finding 2003). First, it can be used to make meas-
transient and weak interactions. The same urements in living cells, which allows the
methods can be used to study the interac- detection of protein interactions at the
tion of proteins with nucleic acids, small location in the cell where they normally
molecules and in fact with any other sub- occur in the presence of the normal cellular
Omics and Arrays 87

environment. Secondly, transient interac- an irrelevant antibody or isolate from a

tions can be followed with high temporal cell devoid of affinity-tagged protein), the
resolution in single cells. Protein interac- method can distinguish between true com-
tion within the proteome might be mapped plex components and non-specifically asso-
by performing FRET screens on cell arrays ciated proteins.
that are co-transferred with cDNAs bearing
CFP and YFP fusion proteins. Post-translational modifications
In recent years there has been a strong
focus on predicting protein interactions Proteins are converted to their mature form via
computationally. Predicting the interac- a complicated sequence of post-translational
tions can help scientists predict pathways protein processing and decoration events.
in the cell, potential drugs and antibiotics Detection of post-translational modifications
and protein functions. Proteins are large is necessary, especially for phosphoryla-
molecules and binding between them often tion or ubiquitinylation because they affect
involves many atoms and a variety of inter- protein function. Phosphorylation can be
action types including hydrogen bonds, detected by the use of antiphosphotyrosine
hydrophobic interactions, salt bridges and antibodies on blots of 2DE or by radiolabel-
more. Proteins are also dynamic, with many ling proteins and detecting the labelled pro-
of their bonds able to stretch and rotate. teins. Glycosylation of proteins can easily
Therefore, predicting proteinprotein be detected on gels using the periodic acid
interactions requires a good knowledge of Schiff reaction. In addition, specific enzymes
the chemistry and physics involved in the can be used for selective cleavage of several
interactions. common post-translational modifications
The principle of using hybrid proteins (Mathesius et al., 2003).
to analyse interactions has been extended to Many of the post-translational modifi-
examine DNAprotein interactions, RNA cations are regulatory and reversible which
protein interaction, small moleculeprotein impacts biological function through a
interactions and interactions dependent on multitude of mechanisms. MS methods to
bridging proteins or post-translational mod- determine the type and site of such modi-
ifications. In addition, the reconstitution fications on single, purified proteins have
of proteins other than transcription factors been undergoing refinements since the late
such as ubiquitin, has been used to estab- 1980s. In this case, peptide mapping with
lish reporter systems to detect interactions different enzymes is usually used to cover
(Fashena et al., 2000) and these may enable as much of the protein sequence as possible.
the analysis of proteins not generally suit- Protein modifications are then determined
able for the traditional two-hybrid arrays by examining the measured mass and frag-
such as membrane proteins. mentation spectra via manual or computer-
In the future, quantitative methods assisted interpretation. For the analysis of
based on stable-isotope labelling are likely some types of PTMs, specific MS techniques
to revolutionize the study of stable or tran- have been developed that scan the peptides
sient interactions and interactions depend- derived from a protein for the presence of
ent on post-translational modifications. a particular modification. The analysis of
In such experiments, accurate quantifica- regulatory modifications, in particular pro-
tion by means of stable-isotope labelling tein phosphorylation, is complicated by the
is not used for protein quantification per frequently low stoichiometry, the size and
se; instead the stable-isotope ratios distin- ionizability of peptides bearing the modi-
guish between the protein composition of fications and their fragmentation behav-
two or more protein complexes (Aebersold iour in the mass spectrometer (Aebersold
and Mann, 2003). In the case of a sample and Mann, 2003). Given the difficulties of
containing a complex and a control sam- identifying all modifications even in a sin-
ple containing only contaminating proteins gle protein, it is clear that at present, scan-
(for example, immunoprecipitation with ning for proteome-wide modifications is not
88 Chapter 3

comprehensive. One of the strategies used metabolites occur in an individual species

is essentially an extension of the approach vary within the 525,000 range (Trethewey,
used to analyse protein mixtures. Instead of 2005). Metabolites are the products of inter-
searching a database only for non-modified related biochemical pathways and changes
peptides, the database search algorithm is in metabolic profiles can be regarded as the
instructed to also match potentially modi- ultimate response of biological systems to
fied peptides. To avoid a combinatorial genetic or environmental changes (Fiehn,
explosion resulting from the need to con- 2002). Plant metabolism research has expe-
sider all possible modifications for all pep- rienced a second golden age resulting from
tides in the database, the experiment is synergies between genome-enabled tech-
usually divided into identification of a set nologies and classical biochemistry. The
of proteins on the basis of non-modified rapid rate at which genomics data are being
peptides followed by searching only these accumulated creates an increased need for
proteins for modified peptides (MacCoss robust metabolomic technologies and rapid
et al., 2002). A more functionally oriented and accurate methods for identifying the
strategy focuses on the search for one type of activities of enzymes (DellaPenna and Last,
modification on all the proteins present in a 2008).
sample. Such techniques are usually based The metabolome refers to the complete
on some form of affinity selection that is set of small-molecule metabolites (such as
specific for the modification of interest and metabolic intermediates, hormones and
which is used to purify the sub-proteome other signalling molecules and secondary
bearing this modification. metabolites) that can be found within a bio-
Many challenges remain in the large- logical sample such as a single organism.
scale mapping of post-translational modi- Metabolomics is defined as the systematic
fications but it is clear that MS-based survey of all the metabolites present in a
proteomics can make a unique contribution plant tissue, cell and cellular compartment
in this area. For example, systematic quan- under defined conditions (Bourgault et al.,
titative measurements of post-translational 2005). The name metabolomics was coined
modifications by stable-isotope labelling in the 1990s (Oliver et al., 1998). The foun-
would be of considerable biological interest. dations of metabolomics lie in the descrip-
One of the future challenges in proteomics is tion of biological pathways and current
to increase sensitivity to visualize low abun- metabolomic databases, such as KEGG, are
dance proteins (e.g. regulatory proteins) as frequently based on well-characterized bio-
only 10% of proteins can be now visualized chemical pathways. Metabolomics might
by 2DE. It needs a high quality database for be considered to be the key to integrated
matching sequence to MS data (or the use systems biology because it is frequently a
of MS/MS). Technical developments are direct gauge of a desired phenotype (Fiehn,
needed for understanding post-translational 2002), measuring quantitative and qualita-
modifications, protein complexes, protein tive traits such as starches in cereal grains
localization and the interface with tran- or oils in oilseeds. Moreover, metabolomics
scriptomics and metabolomics. can be correlated with genetics through
genomes, transcriptomes and proteomes
and therefore bypass the more traditional
quantitative trait loci (QTL) approach
3.3.3 Metabolomics applied to molecular plant breeding. Major
recommended references for this section
Plants contain a wide diversity of low- include Fiehn (2002), Sumner et al. (2003),
molecular-weight chemical constituents. Weckwerth (2003), Bourgault et al. (2005),
More than 100,000 secondary metabolites Breitling et al. (2006), Schauer and Fernie
have been identified in plants and this prob- (2006) and Krapp et al. (2007).
ably represents less than 10% of natures Targeted metabolomics involves exami-
total (Wink, 1988). Estimates of how many nation of the effects of a genetic alteration or
Omics and Arrays 89

change in environmental conditions on par- The global study of the structure and
ticular metabolites (Verdonk et al., 2003). dynamics of metabolic networks has been
Sample preparation is focused on isolating hindered by a lack of techniques that iden-
and concentrating the compound of inter- tify metabolites and their biochemical
est to minimize detection interference from relationship in complex mixtures. Recent
other components in the original extract. advances in ultra-high mass accuracy MS
Metabolite profiling refers to a qualitative provide two advantages that can enable ab
and quantitative evaluation of metabolite initio determination of metabolic networks:
collections, for example those found in a (i) the ability to identify molecular formu-
particular pathway, tissue or cellular com- lae based on exact masses; and (ii) the infer-
partment (Burns et al., 2003). Finally, meta- ence of biosynthetic relationships between
bolic fingerprinting focuses on collecting masses directly from the mass spectrum.
and analysing data from crude extracts to Mass spectrometers with the necessary per-
classify whole samples rather than separat- formance parameters (mass accuracy around
ing individual metabolites (Johnson et al., 1 ppm and resolution above 100,000 m/m)
2003; Weckwerth, 2003). are now within the reach of many research-
In stark contrast to transcriptomics and ers and will change the way we think about
proteomics, metabolomics is mainly spe- metabolomics (Breitling et al., 2006). The
cies-independent, which means that it can recent application of Fourier transform
be applied to widely diverse species with ion cyclotron resonance MS (FTICR-MS)
relatively little time required for re-optimiz- to metabolomic analysis suggests a way to
ing protocols for a new species. Metabolite tackle the problem. A lower-cost alterna-
profiling can monitor variation in the accu- tive to high-field FTICR-MS, the Orbitrap
mulation of metabolites in plant cells in mass analyser, promises accelerated activ-
culture which are ectopically expressing ity in this area. These two analysers are able
transcription factors, as a hypothesis-gener- to achieve high resolution and mass accu-
ating tool to establish the possible pathways racy in the 1-ppm range for biomolecular
regulated by particular regulatory proteins. samples. In both instruments, the ionized
The first step consists of generating a trans- metabolite mixture is trapped in an orbital
genic cell line expressing the regulator from trajectory. The frequency of their orbit
a constitutive or inducible promoter. The depends on the mass-over-charge ratio of the
second step is to subject extracts from trans- ions and can be measured precisely, which
formed and control cells to various meta- is the basis of the exceptional accuracy. In
bolic profiling approaches to determine the FTICR-MS, trapping is achieved in a strong
qualitative and quantitative differences in magnetic field which exerts a force on the
metabolite accumulation. A more practical charged particles that is perpendicular to
approach to monitoring and purifying indi- their direction of motion and thus confines
vidual metabolites is to profile hundreds or them to a circular path. The Orbitrap traps
thousands of small molecules biochemically ions without a magnetic field and ions are
and to screen for changes in the relative trapped in a radial electrical field between
levels of these compounds. By comparing a central and an outer cylindrical electrode.
two conditions, a profile of the differences Theoretically, the resolving power of the
can be obtained that is then used as a blue- FTICR-MS and Orbitrap is sufficiently high
print to identify the individual compounds to resolve even the most complex metabo-
affected (Dias et al., 2003). The immense lite mixtures using direct infusion.
chemical diversity of small biomolecules Gas chromatography (GC)-MS or
makes comprehensive metabolome screens LC-MS is the tool of choice for generating
difficult. The lack of unifying principles high-throughput data for identification and
such as genetic codes that would assist mol- quantification of small-molecular-weight
ecule identification, comparison and causal metabolites (Weckwerth, 2003). Capillary
connection is another important challenge electrophoresis (CE) is an alternative
(Breitling et al., 2006). method which separates particular types of
90 Chapter 3

compound more efficiently and can be cou- centration in a single NMR experiment with
pled with MS or other types of detectors. excellent reproducibility.
NMR, infrared (IR), ultraviolet (UV) and HPLC and GC are the most widely used
fluorescence spectroscopy can be used as analytical techniques for the separation of
alternative means of detection, often in par- small metabolites. GC is used to separate
allel with MS (Weckwerth, 2003). TOF MS compounds on the basis of their relative
technology has also been used in metabo- vapour pressure and affinities for the sta-
lite analysis and provides a means of high tionary phase in the chromatographic col-
sample-throughput. In the end, a combina- umn. It offers very high chromatographic
tion of methods enables analysis of a broad resolution but requires chemical derivatiza-
range of metabolites. tion for many biomolecules: only volatile
NMR is a spectroscopic technique that chemicals can be analysed without deriva-
exploits the magnetic properties of the tization. Some large and polar metabolites
atomic nucleus (Macomber, 1998). In NMR, cannot be analysed by GC. GC tends to give
the sample is immersed in a strong external much greater chromatographic resolution
magnetic field and transitions between the than HPLC but has the disadvantages of
nuclear magnetic energy levels are induced being limited to compounds that are vola-
by a suitably oriented radiofrequency field. tile and heat stable. A big advantage of GC
In theory, any molecule containing one is that it can be easily combined with MS,
atom with a non-zero nuclear spin (I) is which greatly increase its utility for multi-
potentially visible by NMR. Considering component profiling because of its inherent
the isotopes with a non-zero nuclear spin high specificity, high sensitivity and positive
such as 1H, 13C, 14N, 15N and 31P, all biologi- peak confirmation (Dias et al., 2003). HPLC
cal molecules have at least one NMR signal. is a form of column chromatography used
There is wide variation in the sensitiv- frequently in biochemistry and analytical
ity of the experiment for different nuclei, chemistry. It is used to separate components
hence 1H NMR remains the best choice for in a mixture by using a variety of chemical
metabolite profiling by NMR mainly due to interactions between the substance being
its natural abundance (99.8%) and sensitiv- analysed (analyte) and the chromatography
ity (Moing et al., 2007). The NMR spectrum column. Compared to GC, HPLC has lower
generally consists of a series of discrete chromatographic resolution but it does have
lines (resonances) which are character- the advantage that a much wider range of
ized not only by the familiar spectroscopic analytes can potentially be measured.
quantities of frequency (chemical shift), The generation of reproducible and
intensity and line shape, but also by relaxa- meaningful metabolomic data requires great
tion times. Although less sensitive than GC care in the acquisition, storage, extraction
or LC-MS, proton NMR spectroscopy is a and preparation of samples (Fiehn, 2002).
powerful complementary technique for the The true metabolic state of samples must be
identification and quantitative analysis of maintained and additional metabolic activ-
plant metabolites either in vivo or in tissue ity or chemical modification after collec-
extracts (Krishnan et al., 2005). Typically, tion must be prevented. Depending on the
2040 metabolites have been identified in type of sample and the analysis performed,
metabolite profiling of plant extracts and this can be achieved in various ways. The
the number of metabolites quantified can be most common strategies are freezing in liq-
increased with higher field strength (increas- uid nitrogen, freeze-drying, and heat dena-
ing spectral resolution) and by using micro- turation to halt enzymatic activity (Fiehn,
probes for small quantity samples together 2002). Metabolomic experiments are typi-
with cryogenic probe heads (increasing cally conducted by comparing experimen-
sensitivity). One of the main advantages of tal plants possessing an expected metabolic
1
H-NMR is that structural and quantitative modification (i.e. because of the introduc-
information can be obtained on numerous tion of a transgene or exposure to a particu-
chemical species with a wide range of con- lar treatment) to control plants. Statistically
Omics and Arrays 91

significant changes in metabolite levels using a novel extraction method whereby

attributable to perturbations affecting the RNA, proteins and metabolites were all
experimental plants are identified. Natural extracted from a single sample (Weckwerth
variability in metabolite levels occurs as part et al., 2004).
of normal homeostasis in plants; thus a high Parallel to the development of the
number of replicates is typically necessary technologies of metabolite profiling, there
to establish a statistically significant dif- has been a bewildering proliferation in
ference between experimental and control the nomenclature associated with this
plants, especially if the differences between field. At the root of the problem is that
metabolite levels are subtle (Johnson et al., some groups have chosen to use the term
2003). In order to validate metabolomic metabolomics while others have opted
studies and to facilitate data exchange, the for metabonomics. Metabolomics will be
Metabolomics Standards Initiative (MSI) used in this book as it is derived from
has released documents describing mini- metabolic profiling or fingerprinting and
mum parameters for reporting metabolomic should be a parallel terminology to tran-
experiments. The reporting parameters scriptomics and proteomics (Trethewey,
encompassed by MSI include the biologi- 2005). The Human Metabolome Project
cal study design, sample preparation, data led by Dr David Wishart of the University
acquisition, data processing, data analysis of Alberta, Canada, completed the first
and interpretation relative to the biological draft of the human metabolome consisting
hypotheses being evaluated. Fiehn et al. of 2500 metabolites, 1200 drugs and 3500
(2008) exemplified how such metadata food components (Wishart et al., 2007).
could be reported by using a small case Schauer and Fernie (2006) assessed the
study: the metabolite profiling by GC-TOF contribution of metabolite profiling to sev-
mass spectrometry of Arabidopsis thaliana eral fields of plant metabolomics. As a fast
leaves from a knock-out allele of the gene growing technology, metabolite profiling is
At1g08510 in the Wassilewskija ecotype. useful for phenotyping and diagnostic analy-
The large data sets and multitude of ses of plants. It is also rapidly becoming a
metabolites require computer-based applica- key tool in functional annotation of genes
tions to analyse complex metabolomic exper- and in the comprehensive understanding
iments. Ideally, such systems compile and of the cellular response to biological condi-
compare data from a variety of separation tions such as various stresses of biotic and
and detection systems (Sumner et al., 2003). abiotic origin. Metabolomics approaches
Ultimately, gene functions can be predicted or have recently been used to assess the natu-
global metabolic profiles associated with par- ral variation in metabolite content between
ticular biological responses can be defined. individual plants, an approach with great
Multivariate data analysis techniques that potential for the improvement of the com-
reduce the complexity of data sets and enable positional quality of crops.
more simplified visualization of metabolomic
results are currently available. These include
principle-component analysis (PCA), hier- 3.4 Phenomics
archical clustering analysis (HCA), K-means
clustering and self-organizing maps (Sumner Phenomics is a field of study concerned with
et al., 2003). the characterization of phenotypes, which are
Considering the natural variability in characteristics of organisms that arise via the
transcript, protein and metabolite levels in interaction of the genome with the environ-
plants of the same genotype, correlations ment. Genomics has spawned a plethora of
within complex fluctuating biochemical related omics terms that frequently relate to
networks can be revealed using PCA and established fields of research. Of these terms,
HCA (Weckwerth, 2003; Weckwerth et al., phenomics, the high-throughput analysis of
2004). Metabolic networks were integrated phenotypes, has the greatest application in
with gene expression and protein levels plant breeding.
92 Chapter 3

3.4.1 Importance of phenotypes matics extrapolation were associated with

in genomics too much noise and were becoming non-
productive. Instead, he called for a renewed
For all sequenced organisms from the most focus on cellular studies and the creation of
thoroughly studied and simple bacterial function-based cell maps in a variety of cell
cells to humans, only about two-thirds of types by the year 2020.
all genes have an assigned biochemical However, generating phenotypic maps
function and only a fraction of those are will not be easy. Scientists generally test and
associated with a phenotype. Even when measure phenotypes one at a time, which
phenotypes are assigned, they might repre- is too slow. Almost every model system
sent only a partial understanding of the role in which the genome has been sequenced
of the gene. The function of a gene cannot has used functional genomics projects
be fully understood until it is possible to to associate the genome with the biology
predict, describe and explain all the phe- and this typically includes some efforts
notypes that result from the wild-type and that involve phenomics. Many large-scale
mutant forms of that gene (Bochner, 2003). projects are being carried out by generally
Phenotypes often cannot be predicted using and adapting diverse existing pheno-
on the basis of the biochemical function of typic technologies that range from animal
a gene alone because it is not clear how cat- autopsies to mass spectrometer analysis of
alytic or regulatory activity will affect the cellular metabolites. A phenotype micro-
biology of the cell or the whole organism. array technology was devised that had sev-
However, if a gene has a biological function eral attributes (Bochner, 2003): (i) it could
then, for every identified gene it should assay about 2000 distinct culture traits;
be possible to define at least one pheno- (ii) it could be used with a wide range of
type. A second layer of genomic annotation microbial species and cell types; (iii) it would
could then follow in which every gene is be amenable to high-throughput studies and
described biologically by the phenotypes automation; (iv) it would allow phenotypes
that it produces. The first step is to construct to be recorded quantitatively to facilitate
a so-called phenomic map and in diploid comparisons over time; (v) it would give a
and higher plants this will be complicated comprehensive scan of the physiology of the
by the fact that several genes can affect gene cell; and (vi) by providing global cellular
expression and the resulting phenotypes of analysis, it would provide a complement to
each other, leading to epistasis, complex genomic and proteomic studies.
traits and multifactorial stress responses
(Bochner, 2003).
Advances in genetic and genomic anal-
ysis are being hindered by the slow pace at 3.4.2 Phenomics in plants
which biological (that is, phenotypic) infor-
mation is being obtained, which is not keep- The great plasticity of plant genomes in
ing pace with genomic information. Bochner producing various phenotypes from a
(1989) predicted that global phenotypic small amount of genetic variation has pro-
analysis would soon be needed to comple- vided both challenges and opportunities for
ment the massive amounts of genetic data crop improvement. Detailed and systematic
being obtained and Brown and Peters (1996) analysis of phenotype requires both a data
called attention to the phenotype gap in repository and a means of structure inter-
mouse research. The Nobel laureate Sydney rogation. The field of phenomics developed
Brenner, in a keynote address (at a joint from the phenotypic characterization of
Cold Spring Harbor Laboratory/Wellcome mutant plants, the descriptions of which
Trust Genome Informatics Conference held have been published in volumes that fre-
at Hinxton in the UK on 9 September 2002) quently use structured ontological terms.
emphasized that approaches that relied The storage of these data in searchable
heavily on genome sequences and bioinfor- databases together with the application of
Omics and Arrays 93

phenomics to high-throughput analysis, The growth stages are described as germina-

plant development and natural variation, tion and sprouting, leaf development (main
creates the final link in the chain from the shoot), formation of side shoot to tillering,
genetics of crop development to crop pro- stem elongation or rosette growth (main
duction (Edwards and Batley, 2004). shoot, shoot development), development
There is an additional need to make of harvestable vegetative plant parts, inflo-
phenotypic data from different organisms rescence emergence (main shoot) and ear
simultaneously searchable, visible and or panicle emergence, flowering on main
most importantly, comparable (Lussier and shoot, development of fruit, ripening or
Li, 2004). As an example of attempts in maturity of fruit and seed and senescence
this field, PHENOMICDB has been created as a the beginning of dormancy.
multi-species genotype/phenotype database Mutant analysis provides an alternative
by merging public genotype/phenotype and typically more reliable means to assign
data from a wide range of model organisms gene function. However, this phenotype-
and Homo sapiens (Kahraman et al., 2005). centric process, classically known as for-
To provide systematic descriptions of phe- ward genetics, typically is not suitable
notypic characteristics of gene deletion for systematic genome-wide gene analy-
mutants on a genome-wide scale, a public sis, primarily due to the enormous effort
resource for mining, filtering and visual- required to identify each gene responsi-
izing phenotypic data the PROPHECY data- ble for a particular phenotype. In spite of
base was established. PROPHECY is designed improvements in the cloning of genes on
to allow easy and flexible access to physi- the basis of phenotype (such as availability
ologically relevant quantitative data for the of whole-genome sequences, large numbers
growth behaviour of mutant strains in the of mapped polymorphisms and faster and
yeast deletion collection during conditions cheaper genotyping technologies), it can
of environmental challenges. often take over a year for a skilled scientist
In plant biology, comparison of data to move from a mutant to the affected gene.
collected by laboratories in which plants As indicated by Alonso and Ecker (2006),
are grown under slightly different condi- the combination of classical forward genet-
tions can be problematic. This is especially ics with recently developed genome-wide,
true if the data are collected solely with ref- gene-indexed mutant collections is begin-
erence to chronological age. Kjemtrup et al. ning to revolutionize the way in which
(2003) described the development of a plant gene functions are studied in plants. High-
phenotyping platform based on a growth throughput screens using these mutant pop-
stage scale that will aid in the generation of ulations should provide a means to analyse
coherent data. While their emphasis is on plant gene functions the phenome on a
Arabidopsis, the principles they describe genomic scale.
can also be applied to other plant sys-
tems. They adapted a modified version of
the BBCH scale which is named after the
consortium of agricultural companies that 3.5 Comparative Genomics
developed it (BASF, Bayer, Ciba-Geigy and
Hoechst), for high-throughput phenotyp- Comparative genomics has been used to
ing of Arabidopsis to collect data for both address four major research areas (Schranz
quantitative and qualitative traits spread et al., 2007). First, all comparative analy-
over the developmental timeline of the ses are based on phylogenetic hypotheses.
plant. In the first phase of the method, data In turn, genomics data can be used to con-
are collected enabling a series of landmark struct more robust phylogenies. Secondly,
growth stages to be defined. The second comparative genome sequencing has been
phase involves the collection of detailed crucial in identifying changes in genome
data for additional traits that are of particu- structure that are due to rearrangements,
lar interest at any one of these given stages. segmental duplications and polyploidy.
94 Chapter 3

The alignment of multiple genomes can also 3.5.1 Comparative maps

be used to reconstruct an ancestral genome.
Thirdly, comparative genomics data have A comparative map aligns two or more spe-
been used to annotate homologous genes cies-specific maps using common sets of
and subsequently to identify conserved cis- markers or sequences. It requires identifica-
regulatory motifs. Having multiple genomes tion of regions of sequence similarity in the
of varying phylogenetic depths has proven genomes of different species or genera (i.e.
very useful for detecting conserved non- typically, genes). Sequence similarity can
coding sequences. Fourthly, comparative be identified due to common evolution-
genomics is used to understand the evolu- ary origins. Gene repertoire and gene order
tion of novel traits. may be found conserved over larger chro-
Comparative genomics provides the mosomal segments between closely related
potential for trait extrapolation from a spe- species. The long-term goals of compara-
cies where the genetic control is well under- tive genomics are to establish relationships
stood and for which there are molecular between map, sequence and functional
markers to a species about which there is genomic information across all plant spe-
a limited amount of information. For exam- cies and to facilitate taxonomic and phylo-
ple, rice is regarded as a model for cereal genetic studies in higher plants.
genomics because of its small genome. The
similarity of cereal genomes in general Importance of comparative maps
means that the genetic and physical maps
of rice can be used as reference points for The objective of the development of a com-
exploration of the much larger and more parative map is to identify subsets of genes
difficult genomes of the other major and that have remained relatively stable in both
minor cereal crops (Wilson et al., 1999). sequence and copy number since the radia-
Conversely, decades of breeding work and tion of flowering plants from their last com-
molecular analysis of maize, wheat and bar- mon ancestor. Why are comparative maps
ley can now find direct application in the so important? First, eukaryotic genomes
improvement of rice. Comparative genom- are organized into chromosomes and maps
ics can also be used to locate desirable alle- summarize genetic information using chro-
les in gene pools close to the target crop so mosomes as the organizational principle.
that transfer can achieved by conventional Secondly, conservation of gene identity
methods (Kresovich et al., 2002). and gene order along the chromosomes
Across plant species, genome size does determines potential for sexual reproduc-
not correlate with number of genes or bio- tion; disruption leads to speciation and
logical complex. Physical size of genomes major evolutionary change. Thirdly, species
across plant species varies greatly, while maps provide the context for the study of
genetic size of genomes is roughly equiva- inheritance and chart the history of genetic
lent. Large genomes usually have large change. Fourthly, comparative maps are the
physical:genetic distance ratios. Also the major tools for ferrying genetic information
relationship between genes and the number back and forth across species and genera in
of gene families is not clear. In this section, a systematic fashion.
comparative maps and collinearity among Once chromosomal duplications are
related species and their implications will identified in a genome and the timing of a
be discussed. Key references recommended duplication/polyploidization event has been
for an overview of comparative genomics determined relative to angiosperm diver-
include Shimamoto and Kyozuka (2002), gence nodes, ancestral gene order within the
Ware and Stein (2003), Miller et al. (2004), duplicated segments can be inferred. Map
Caicedo and Purugganan (2005), Filipski comparisons across divergent genera show
and Kumar (2005), Koonin (2005), Xu et al. greater conservation of ancestral gene order
(2005), Schranz et al. (2007) and Tang et al. and gene repertoire once genome-wide
(2008). duplication/gene loss within each genome
Omics and Arrays 95

is accounted for. Map comparisons between based on inferred protein matches between
closely related species are largely unaffected 26,028 genes. A total of 34 non-overlapping
because most duplications pre-date them. chromosomal segment pairs were identified
Comparative maps lay the groundwork for consisting of 23,177 (89%) Arabidopsis genes
asking questions about whether specific (Bowers et al., 2003b). To relate this alpha
linkage blocks or gene arrangements are sta- duplication to the angiosperm family tree, all
tistically associated with increased fitness or duplicated syntenic Arabidopsis gene pairs
have a relationship between polyploidy and were compared to individual genes from
plant adaptation. For example, comparative pine, rice, tomato, Medicago, cotton and
linkage mapping and chromosome painting Brassica. It was determined whether inferred
in the close relatives of Arabidopsis have protein sequences were from duplicated
inferred an ancestral karyotype of these spe- syntenic gene pairs. Arabidopsis genes were
cies. In addition, comparative mapping to more similar to one another than to the heter-
Brassica has identified genomic blocks that ologous protein in another species.
have been maintained since the divergence
of the Arabidopsis and Brassica lineages RELATIVE AGE OF CHROMOSOMAL DUPLICATION EVENTS.
(Schranz et al., 2007). It was concluded that the alpha duplication
event pre-dated divergence from Brassica
An example: Arabidopsistomato about 14.520.4 million years ago but post-
comparative map dated divergence from cotton about 8386
million years ago.
DEVELOPMENT OF ARABIDOPSISTOMATO COMPARA-
About 50% (4964%) of Brassica
TIVE MAP TO DETECT MACROSYNTENY. Fulton et al.
sequences were more similar to one dupli-
(2002) identified over 1000 conserved cated Arabidopsis sequence than was the other
orthologous sequences (COS) between Arabidopsis sequence to its paralogue. Only
tomato and Arabidopsis by comparison of 619% of cotton, rice, pine, etc. sequences
Arabidopsis genomic sequence with 130,000 clustered internally to the Arabidopsis syn-
tomato ESTs (representing 27,000 unigenes tenic duplicates (Bowers et al., 2003b).
or approximately 50% of the tomato gene
content). For 1025 COS markers developed,
POLYPLOID ANCESTRY OF MOST PLANT SPECIES. As
927 were screened against tomato DNA
using Southern analysis to classify them as more data accumulates, the history of
single, low or multiple copy, among which angiosperms emerges as a history of genome-
85% were considered to be single or low wide duplication followed by massive gene
copy (> 95% hybridization signal assigned loss (and return to diploidy). Only 30% of
to three or fewer restriction fragments) and Arabidopsis genes have retained syntenic
50% matched a gene of unknown function copies in less than 86 million years since
(Gene Ontology classification). A total of 550 the alpha duplication. In contrast, mam-
COS markers was mapped on to the tomato mals appear to harbour fewer polyploidiza-
genome. The size of conserved segments was tion events and less cycling of duplicated
generally smaller than 10 cM. Results indi- genes; 70% of human and mouse proteins
cated that multiple polyploidization events show conserved synteny after 100 million
punctuate the evolution of Arabidopsis and years of evolution.
tomato. Distinguishing orthologues from
paralogues is difficult due to reciprocal loss
of genes and chromosome segments follow- 3.5.2 Collinearity
ing polyploidization events.
Orthology and paralogy
PHYLOGENETIC ANALYSIS OF CHROMOSOMAL DUPLI-
The
CATION EVENTS TO DETECT MICROSYNTENY. Figure 3.9 shows the concepts of orthology
Arabidopsis genome sequence was used and paralogy. Orthologues and paralogues
to analyse internal duplication events are two types of homologous sequence.
96 Chapter 3

Homologues somes were highly collinear with those of

several other grass species and extensive
Orthologues Paralogues Orthologues work has shown a remarkable conserva-
tion of large segments of linkage groups
within rice, maize, sorghum, barley, wheat,
Frog a Chick a Mouse a Mouse b Chick b Frog b
rye, sugarcane and other agriculturally
important grasses (e.g. Ahn and Tanksley,
a-chain gene b-chain gene
1993; Kurata et al., 1994; van Deynze et al.,
1995a; Wilson et al., 1999). These studies
Gene duplication
led to the prediction that grasses could be
studied as a single syntenic genome. The
Early globin gene macrocollinearity was summarized by Gale
and Devos (1998) for rice and seven other
Fig. 3.9. The concepts of orthology and paralogy
cereals using what is now known as the
(from http://www.ncbi.nlm.nih.gov/Education/
circle diagram (Plate 1). Further studies
BLASTinfo/Orthology.html).
identified QTL controlling important agro-
nomic traits which showed similarities in
Orthology describes genes in different spe- locations for the same or similar traits (as
cies that derive from a common ancestor. reviewed by Xu, 1997). Shattering and plant
Orthologous genes may or may not have the height are examples that were also mapped
same function. Paralogy describes genes to collinear regions among grass genomes
that have duplicated (tandemly or moved (Paterson et al., 1995; Peng et al., 1999).
to a new location) within a genome since More recently, Chen et al. (2003) identified
they descended from a common ancestral four QTL for quantitative resistance to rice
gene. The word synteny (from the Greek blast that showed corresponding map posi-
syn, together, and taenie, ribbon) refers tions between rice and barley, two of which
to linkage of genes along a chromosome; had completely conserved isolate specifi-
currently used to indicate conservation of city and the other two had partial conserved
gene order across species. From this defi- isolate specificity. Such corresponding loca-
nition, macrosynteny means conservation tions and conserved specificity suggested a
of gene order across species detected at common origin and conserved functionality
low resolution (i.e. genetic maps) while of the genes underlying the QTL for quan-
microsynteny means conservation of gene titative resistance, which may be used to
order across species analysed by high res- discover genes, understand the function of
olution (i.e. physical or sequence-based the genomes and identify the evolutionary
maps). forces that structured the organization of
the grass genomes. Such findings reinforce
Macrocollinearity the notion of collinearity among the cereal
genomes.
Significant genomic collinearity in plants This unified grass genome model has
has been shown by comparative genetic had a substantial impact upon plant biol-
mapping and genome sequencing, although ogy but has not yet lived up to its potential.
plant genomes vary greatly in genome size There are some difficulties in evaluating
and chromosome number and morphology. synteny between genomes at the macro-level
Comparative mapping of cereal genomes (Xu et al., 2005). First, the genomic marker
using low copy number, cross-hybridizing data are very incomplete and genomic
genetic markers has provided compelling sequence data are largely lacking for many
evidence for a high level of conservation of grass species. Secondly, the data are some-
gene order across regions spanning many times biased because the homologous DNA
megabases (i.e. macrocollinearity). Initial probes used in comparative mapping are
studies of the organization of grass genomes selected for simple cross-hybridization pat-
indicated that individual rice chromo- terns. Thirdly, many genes are members of
Omics and Arrays 97

gene families and, accordingly, it is often dif- revealed excellent conservation between
ficult to determine if a gene mapped in the the overall structure and gene order of sor-
second species is orthologous or paralogous ghum chromosome 3 and rice chromosome
to that in the first species. Fourthly, the col- 1 but also indicated several rearrangements.
linearity of gene order and content observed Together, these studies indicate a general
at the recombinational map level is often conservation of large syntenic blocks within
not observed at the level of local genome cereals but with many more rearrangements
structure (Bennetzen and Ramakrishna, and synteny breakdowns than originally
2002). Finally, in most early studies, no anticipated.
statistical analysis was used to evaluate This trend is even more obvious when
whether the presence of a few markers in synteny is analysed at the sequence level.
the same order on two chromosomal seg- Rearrangements may occur that involve
ments in two species occurs by chance or is regions smaller than a few centimorgans and
truly significant. would be missed by most recombinational
The genome collinearity of several mapping studies. Comparative sequence
Cammelineae and Brassicaceae species analysis involving large genomic segments
have been recently compared to that of can detect these rearrangements. Such anal-
A. thaliana by comparative genetic link- yses reveal the composition, organization
age mapping and comparative chromosome and functional components of genomes and
painting (Schranz et al., 2007). A compre- provide insight into regional differences
hensive study identified 21 syntenic blocks in composition between related species.
that are shared by Brassica napus and Recently, the sequencing of genomic seg-
A. thaliana genomes, corresponding to 90% ments in the cereals has enabled microcol-
of the B. napus genome (Parkin et al., 2005). linearity across genes or gene clusters to
be investigated. Sequencing of the domes-
Microcollinearity tication locus Q in Triticum monococcum
revealed excellent collinearity with the
Using the rice genome sequence as the ref- bread wheat genetic map (Faris et al., 2003).
erence to compare with molecular marker Following the sequencing of the leaf-rust-
information of other cereals gave a result resistance locus Rph7 from barley, it was
which indicated many more rearrangements observed that this locus is flanked by two
than had been expected from Gale and HGA genes. The orthologous locus in rice
Devoss (1998) concentric circles model. chromosome 1 consists of five HGA genes.
One such comparison involved more than In barley, only four of the five HGA genes
2600 mapped sequenced markers in maize are present, one is duplicated as a pseudo-
among which only 656 putative ortholo- gene and six additional genes have been
gous genes could be identified (Salse et al., inserted in between the HGA genes. These
2004). The comparison of the wheat genetic six genes have homologues on eight dif-
map with the rice sequence also suggests ferent rice chromosomes (Brunner et al.,
numerous rearrangements between the two 2003). The most striking rearrangement
genomes with a high frequency of break- was revealed by the comparison of 100 kb
downs in collinearity (Sorrells et al., 2003). around the Bronze locus of two maize lines.
Extensive comparisons have also been made Not only does the retrotransposon distribu-
between sorghum and rice (Klein et al., tion differ between the two lines but the
2003; The Rice Chromosome 10 Sequencing genes themselves could also be different (Fu
Consortium, 2003). To align the sorghum and Dooner, 2002). Comparison of the low
physical map with the rice map, sorghum molecular weight glutenin locus between
BAC clones were selected from the mini- T. monococcum and Triticum durum also
mum tiling path of chromosome 3. Unique revealed dramatic rearrangements: more
partial sequences were obtained from each than 90% of the sequence diverged because
BAC clone and could be directly compared of retro-element insertions and because dif-
with the rice sequence. This approach ferent genes are present at this locus (Wicker
98 Chapter 3

et al., 2003). Therefore collinearity can be for identifying regions of cereal genomes
lost very rapidly within two genomes from that are prone to rapid evolution. Similar
the same species. comparative analyses of Arabidopsis acces-
With the sequencing of long regions, sions have shown that both the relocation
several studies in cereals have demon- of genes and the sequence polymorphisms
strated incomplete microcollinearity at the between accessions (in both coding and
sequence level. Song et al. (2002) identified non-coding regions) are common in the
orthologous regions from maize, sorghum Arabidopsis genome (The Arabidopsis
and two subspecies of rice. It was found Genome Initiative, 2000). Intraspecific vio-
that gross macrocollinearity is maintained lation of collinearity has also been identified
but microcollinearity is incomplete among in maize (Fu and Dooner, 2002). Han and
these cereals. Deviations from gene colline- Xue (2003) also discovered significant num-
arity are attributable to micro-rearrangement bers of rearrangements and polymorphisms
or small-scale genomic changes such as gene when comparing indica and japonica
insertions, deletions, duplications or inver- genomes in rice. The deviations from col-
sions. In the region under study, the orthol- linearity are frequently due to insertions or
ogous region was found to contain six genes deletions. Intraspecific sequence polymor-
in rice, 15 in sorghum and 13 in maize. In phisms commonly occur in both coding and
maize and sorghum, gene amplification non-coding regions. These variations often
caused a local expansion of conserved genes affect gene structures and may contribute to
but did not disrupt their order or orienta- intraspecific phenotypic adaptations.
tion. As indicated by Bennetzen and Ma
(2003), numerous local rearrangements dif- Implications of genome collinearity
ferentiate the structures of different cereal
genomes. On average, any comparison of a Genomics would be much simpler if the
ten-gene segment between rice and a dis- order of genes were common (syntenic)
tant grass relative such as barley, maize, across the major groups of plants. The
sorghum or wheat shows one or two rear- usefulness of the collinearity between the
rangements that involve genes. A simple genomes of model plants and important
extrapolation to the rice genome of about crops can be assessed by the number of
40,000 genes (Goff et al., 2002) suggests that failures or successes in its exploitation. For
about 6000 genic rearrangements occurred example, the analysis of the Arabidopsis
which differentiate rice from any of the sequence provides information that will
other cereals. Most of these rearrangements facilitate the annotation of the rice sequence
appear to be tiny and thus would not inter- and likewise sequencing Medicago provides
fere with the macrocollinearity observed by a resource for research on important crop
recombinational mapping. There are excep- legumes. Furthermore, the effort put into
tions however, which include chromosomal sequencing and annotating the rice genome
arm translocations and movements of single has also been rewarded, as this annotation
genes to different chromosomes (Bennetzen will be transferred to related sequences and
and Ma, 2003). used repeatedly in the future. The synteny
As expected, there is a high degree of between the monocots will help decipher
gene conservation between the two shot- the structure and function of the more
gun-sequenced subspecies of rice, japonica complex genomes. A fully assembled rice
and indica, which diverged more than 1 sequence allows more accurate assessment
million years ago. On careful inspection, of the macro- and microsynteny of rice with
however, narrow regions of divergence can other cereals (Xu et al., 2005).
be found in these genomes (Song et al., The advent of technologies for map-
2002). These regions correspond to areas of ping genomes directly at the DNA level has
increased divergence among rice, sorghum made comparative genetic mapping among
and maize, suggesting that the alignment sexually incompatible species possible.
of the two rice subspecies might be useful Extensive comparative maps for marker
Omics and Arrays 99

genes have been constructed for a number of of divergence among grass species. When
plant taxa, including species in the Poaceae evaluating 124 CISPs across rice, sorghum,
(rice, maize, sorghum, barley and wheat), millet, Bermuda grass, teff, maize, wheat
Solanaceae (tomato, potato and pepper) and barley, about 18.5% of them seemed
and Brassicaceae (Arabidopsis, cabbages, to be subject to rigid intron size constraints
mustard, turnip and rape). As a result, the that were independent of per-nucleotide
concept of a single genetic or ancestral DNA sequence variation. Likewise, about
map for all grasses, with species-specific 487 conserved non-coding sequence motifs
modifications, is emerging (Moore et al., were identified in 129 CISP loci. As pointed
1995). The extensive collinearity of wheat, out by Feltus et al. (2006), CISP provides the
rye, barley, rice and maize suggests that it means to effectively explore poorly char-
may be possible to reconstruct a map of the acterized genomes for both polymorphism
ancestral cereal genome. These conserved and non-coding sequence conservation on
gene orders and the possibility of sharing a genome-wide or candidate gene basis and
DNA probes and PCR primers across spe- also to anchor points for comparative genom-
cies will greatly extend the power of map- ics across a diverse range of species. After
ping analysis by facilitating the molecular the whole genomes of the major food crops
analysis of the corresponding chromosomal have been sequenced, plant breeders will be
regions in different species and allowing able to access new gene tools that will facili-
information, and perhaps DNA sequences tate the selection of outstanding individu-
and genes, to be transferred quickly and als characterized by resistance to biotic and
efficiently between different species. abiotic stresses and good seed quality, thus
The challenge of finding which map, enabling breeders to produce new cultivars
sequence and eventually functional genomic in addition to those currently available.
information from one species can be accessed, As a fundamental tool in biology, com-
compared and exploited across all plant spe- parative analysis has been extended from
cies will require the identification of a subset being focused on a specific field to biology
of plant genes that have remained relatively as a whole. With the growing availability of
stable in both sequence and copy number phenotypic and functional genomic data,
since the radiation of flowering plants from comparative paradigms are now also being
their last common ancestor. Identification of extended to the study of other functional
such a set of genes would also facilitate taxo- attributes, most notably gene expression.
nomic and phylogenic studies in higher plants Microarray techniques present an alterna-
that are presently based on a very small set of tive method of studying differences between
highly conserved sequences, such as those closely related genomes. Advances in micro-
of chloroplast and mitochondrial genes. The array-based approaches (see Section 3.6)
conserved orthologue set of markers, identi- have enabled the main forms of genomic var-
fied computationally and experimentally, iation (amplifications, deletions, insertions,
may further studies on comparative genomes rearrangements and base-pair changes) to be
and phylogenetics and elucidate the nature of detected using techniques that can easily be
genes conserved throughout plant evolution. undertaken in individual laboratories using
Completed genome sequences provide simple experimental approaches (Cresham
templates for the design of genome analysis et al., 2008).
tools in orphan species lacking sequence Tirosh et al. (2007) reviewed recent
information. For example, Feltus et al. studies in which comparative analysis was
(2006) designed 384 PCR primers to con- applied to large-scale gene expression data-
serve exonic regions flanking introns using bases and discussed the central principles
sorghum and millet EST alignments to the and challenges of such approaches. As differ-
rice genome. These conserved-intron scan- ent functional properties often co-evolve and
ning primers (CISP) amplified single-copy complement one another, their combined
loci with 3780% success rates; i.e. sampling analysis reveals additional insights. Unlike
most of the approximately 50 million years sequence-based genetic map information
100 Chapter 3

however, most functional properties are ogy. Depending on the type of molecules that
condition-dependent, a property that needs are arrayed, microarrays can also be based on
to be accounted for during interspecies com- proteins, tissues or carbohydrates.
parisons. Furthermore, functional proper- An array is an orderly arrangement of
ties often reflect the integrated function of samples. It provides a medium for match-
multiple genes, calling for novel methods ing known and unknown molecular samples
that allow network-centred rather than gene- based on base-pairing (i.e. A-T and G-C for
centred comparisons. Finally, one of the DNA; A-U and G-C for RNA) or hybridiza-
main challenges in comparative analysis is tion and automating the process of identify-
the integration of different data types which ing the unknowns. From its origin as a new
is becoming particularly important as addi- technique for large-scale DNA mapping and
tional data types are being accumulated. The sequencing and initial success as a tool for
lack of appropriate descriptors and metrics transcript-level analyses, microarray technol-
that succinctly represent the new informa- ogy has spread into many areas by adapting
tion originating from genomic data is one of the basic concept and combining it with other
the roadblocks on this path. Galperin and techniques. Microarray-based processes,
Koller (2006) outlined recent trends in com- either mature or under development, include
parative genomic analysis and discussed transcriptional profiling, genotyping, splice-
some new metrics that have been used. This variant analysis, identification of unknown
issue is related to the ontology concept and is exons, DNA structure analysis, chromatin
discussed in detail in Chapter 14. immunoprecipitation (ChIP)-on-chip, protein
binding, proteinRNA interaction, chip-based
comparative genomic hybridization, epige-
3.6 Array Technologies in Omics netic studies, DNA mapping, re-sequencing,
large-scale sequencing, gene/genome syn-
It is widely believed that thousands of genes thesis, RNA/RNAi synthesis, proteinDNA
and their products (i.e. RNA and proteins) in interaction, on-chip translation and universal
any given living organism function in a com- microarrays (Hoheisel, 2006).
plicated and orchestrated manner. However, In this section, the basic procedures
traditional methods in molecular biology of arraying will be introduced and several
generally work on a one gene in one experi- major microarray technologies and plat-
ment basis which means that the through- forms will be briefly described. The two
put is very limited and the whole picture volumes of DNA Microarrays (Kimmel and
of gene function is difficult to obtain. In the Oliver, 2006a, b) provide a comprehensive
late 1990s, a new technology known as a coverage of all the related fields from tech-
biochip or DNA microarray, attracted great nologies and platforms to data analysis.
interest among biologists. This technology The reader is also referred to Zhao and Bruce
promised to monitor the whole genome on a (2003), Amratunga and Cabrera (2004),
single array so that researchers would have Mockler and Ecker (2004), Subramanian
a better picture of the interactions among et al. (2005), Allison et al. (2006), Hoheisel
thousands of genes at the same time. (2006) and Doumas et al. (2007).
Various terminologies have been used in
the literature to describe this technology; for
DNA microarrays these include, but are not 3.6.1 Production of arrays
limited to, biochip, DNA chip, DNA micro-
array and gene array. Affymetrix, Inc. owns Complementary strands of DNA and nucleic
a registered trademark, GeneChip, which acids in general can pair in a duplex via non-
refers to its high density, oligonucleotide- covalent binding. This fundamental charac-
based DNA arrays. However, in some articles teristic is used in all DNA array techniques.
appearing in professional journals, popular Amaratunga and Cabrera (2004), Arcellana-
magazines and on the Internet, the term gene Panilio (2005) and Doumas et al. (2007)
chip(s) has been used as a general terminol- describe the principles of DNA miroarray
ogy that refers to DNA microarray technol- technology and how they are prepared and
Omics and Arrays 101

used. First, two terms related to microarrays, resulting in a dramatic increase in through-
probe and target, should be introduced. The put. In GeneChips (http://www.affymetrix.
gene-specific DNA spotted on to the array com/) the probe array was designed using
is referred to as the probe and the sample to an optimal set of oligonucleotides selected
be tested that will hybridize with the probe using computer algorithms and manu-
is referred to as the target. The same probe factured using Affymetrix light-directed
spotted on to the array can be repeatedly chemical synthesis. Fluorescent labels were
hybridized with many different targets (sam- used for hybridization and detection and
ples). An experiment using a single DNA the Affymetrix software suite was used for
chip can provide researchers with informa- data analysis and database management.
tion on thousands of genes simultaneously, Figure 3.10 illustrates a flowchart showing

EST database or
cDNA library

Treatment 1 Treatment 2
PCR inserts
from EST clones RNA 1 RNA 2

Multi-well plates Cy5-cDNA 1 Cy5-cDNA 2

Hybridization
Spotting

Laser Wash
Dry
scanning

10000 10000

1000 1000

100 100

10 10
1 1
1 10 100 1000 10000 1 10 100 1000
Treatment 1 Treatment 2 Treatment 1 Treatment 2

0 0.5 1 4 8 24 168 0 0.5 1 4 8 24 168

15 15

10
Mean fold-decrease compared to 0h

10
Mean fold-increase compared to 0h

7.5 7.5

5.0 5.0

4.0 4.0

3.0 3.0

2.0 2.0

1.0 1.0

Treatment 1 Treatment 2

Fig. 3.10. A flowchart for a general microarray process.

102 Chapter 3

a general microarray process. As DNA the complementary sequences are then

microarrays for whole genome expression determined. This technology, historically
profiling are the most mature and widely known as DNA chips, was developed at
used technology, they will be used in this Affymetrix, Inc. which sells its photolitho-
chapter as an example to describe the basic graphically fabricated products under the
procedures of microarrays. GeneChip trademark. Many companies are
now manufacturing oligonucleotide-based
Types of arrays microarrays using alternative in-situ syn-
An array experiment can be carried out using thesis or depositioning technologies.
common assay systems such as microplates
or standard blotting membranes; the arrays Source of arrays
can be created by hand or robotics used to
deposit the sample. In general, arrays are A collection of purified single-stranded DNA
described as macroarrays or microarrays, is the initial requirement. A drop of each type
the difference being the size of the sample of DNA in solution is placed on to a specially
spots. Macroarrays contain sample spot sizes prepared glass microscope slide by a robotic
of about 300 m or larger and can be easily machine known as an arrayer. This process
imaged with existing gel and blot scanners. is called arraying or spotting and consists
The sample spot sizes in microarrays are of binding a library of synthetic DNA on
typically less than 200 m in diameter and to a minimum surface area in a dense and
these arrays usually contain thousands of homogeneous fashion. The major difference
spots. Microarrays also require specialized between various types of DNA arrays lies in
robotics and imaging equipment. the density of the bound probes and the man-
There are two main types of arrays, ner in which these probes have been syn-
nylon and glass. Nylon arrays can contain thesized. The arrayer can quickly produce a
up to about 1000 probes per filter. The regular grid of thousands of spots in an area
target can be labelled using radioactive the size of a dime ( 1 cm2), small enough to
chemicals and detection of hybridization fit under the coverslip of a standard slide.
can be achieved using a phosphorimager The DNA in the spot is bound to the glass to
or X-ray film. Glass arrays can hold up to prevent it from being dislodged during the
about 40,000 spots per slide or 10,000 per hybridization reaction and subsequent wash.
2 cm2 area (limited by the capabilities of the The DNA spotted on to the microar-
arrayer). The target sample is labelled with ray may be either cDNA (in which case the
fluorescent dyes and detection of hybridiza- microarray is called a cDNA microarray),
tion requires specialized scanners. oligonucleotides (in which case it is called
There are two variants of the DNA micro- an oligonucleotide array), subgenomic
array technology in terms of the properties of regions of specific chromosomes or even
the arrayed DNA sequence of known identity. the entire set of genes. The DNA spotted on
Format I: the probe cDNA (5005000 bases to cDNA microarrays are cloned copies of
long) is immobilized on a solid surface such cDNA that have been amplified by PCR and
as glass using robot spotting and exposed to a which correspond to the whole or part of a
set of targets either separately or in a mixture. fully sequenced gene or putative ORF; ESTs
The development of this method, known tra- are commonly arrayed. The selection of
ditionally as a DNA microarray, is widely DNA probes to be spotted on to the microar-
attributed to Stanford University. ray depends on which genes are to be stud-
Format II: an array of oligonucleotide ied. For plants whose genomes have been
(2080-mer oligos) or peptide nucleic acid completely sequenced, it is possible to array
(PNA) probes is synthesized either in situ genomic DNA from every known gene or
(on-chip) or by conventional synthesis putative ORF. To obtain sufficient DNA for
followed by on-chip immobilization. The arraying, each gene or putative ORF from
array is exposed to the labelled DNA sample the total genomic DNA can be amplified by
and hybridized, the identity/abundance of PCR or each cDNA can be cloned and large
Omics and Arrays 103

numbers of identical DNA copies can be ing oligos close to the 3' end might also boost
generated by growing them in bacteria. signal intensity.
The DNA spots on a microarray are
produced either by synthesis in situ or by Slide substrates
deposition of the pre-synthesized product.
DNA synthesis in situ methods have largely Glass microscope slides are the solid sup-
been within the purview of commercial port of choice and they should be coated
companies. In this method, 2025-bp long with a substrate that favours binding of the
gene-specific oligonucleotides are gener- DNA. Development of substrates on atomi-
ated in situ on a silicon surface by combin- cally flat slide surfaces and minimum back-
ing a standard DNA synthesis protocol with ground for higher signal-to-noise ratios has
phosphoramidite reagents modified with contributed to the improvement of data
photolabile 5'-protecting groups (Doumas quality (Arcellana-Panilio, 2005). Different
et al., 2007). The activation for oligonucle- versions of silane, amine, epoxy and alde-
otide elongation is achieved using a mask hyde substrates which attach DNA by either
(Affymetrix; http://www.affymetrix.com) ionic interaction or covalent bond forma-
or maskless (NimbleGen; www.nimblegen. tion are commercially available.
com) method. Alternatively, the reagents
can be delivered to each spot using ink-jet Arrays and spotting pins
technology (Agilent; http://www.agilent.
com). Ongoing research and development The physical process of delivering the DNA
efforts ensure the optimum design of the to pre-determined coordinates on the array,
DNA content and continued technologi- involves spotting pens or pins carried on a
cal advancements enable the production of print head that is controlled in three dimen-
increasingly higher-density arrays. sions by gantry robots with sub-micron pre-
cision. A total of 30,000 features of 90-m
diameter can easily be spotted on to a 25 75-
Array content mm slide with a maximum spotting density
of over 100,000 features per slide. There are
The choice of DNA type to print is funda- several DNA arraying technologies, includ-
mental. The sequence of the cDNA could be ing high speed robotic printing of DNA
several hundred to a few thousand base pairs fragments on glass (usually PCR amplified
long. The DNA spotted on oligonucleotide cDNAs), high speed robotic printing of long
arrays consist of synthesized chains of oligo- oligonucleotides (70-mers; Agilent technol-
nucleotides corresponding to part of a known ogy and many academic facilities), synthesis
gene or putative ORF; each oligonucleotide is of oligonucleotides (25-mers) on micro-chips
usually about 2570 bp long. In an oligonu- using photolithographic masks (Affymetrix
cleotide array, a gene is generally represented GeneChips) and synthesis of oligonucle-
by several different oligonucleotides and otides (2570-mers) on microchips using
they are carefully chosen for maximal specif- maskless aluminium mirrors (NimbleGen
icity. Longer stretches of DNA such as those GeneChips). Improvements in arraying sys-
obtained from PCR of cDNA clones produce tems have included shorter printing times
robust hybridization signals but less specifi- and longer periods of walk-away operation.
city. Short oligonucleotides (2430 nt) have Arrayers are invariably installed within
greater discrimination and are also suitable controlled-humidity cabinets to maintain an
for assessing single-nucleotide changes. Long optimum environment for printing.
oligonucleotides (5070 nt) afford an excel-
lent compromise between signal strength and
specificity and their use has increased among
academic core facilities (Arcellana-Panilio, 3.6.2 Experimental design
2005). Choosing oligos corresponding to the
3' untranslated regions (3'UTR) increases the Careful experimental design is required
likelihood of their being specific and design- to determine the type of array to run; how
104 Chapter 3

many replicates to use; and which samples to 1025 g total RNA for cDNA spots and
will be hybridized to obtain meaningful long oligonucleotide arrays. In some cir-
data amenable to statistical analysis, upon cumstances it becomes necessary to amplify
which sound conclusions can be drawn. the RNA in the sample to obtain adequate
A biological question must first be framed amounts for labelling and hybridization to
and a microarray platform then chosen, fol- an array.
lowed by a decision on biological and tech- To prepare the labelled sample, the
nical replicates and the design of a series of first step is to purify mRNA from total cel-
hybridizations. lular contents. There are several challenges
Microarray experimental design is usu- involved: (i) mRNA accounts for only a
ally governed by the aim of the experiment. small fraction (less than 3% of all RNA in a
An important aspect of experimental design cell) so isolating mRNA in sufficient quan-
is deciding how to minimize variation which tity for an experiment (12 g) can be a chal-
can be thought of as occurring in three lay- lenge. Common mRNA isolation methods
ers: biological variation, technical variation take advantage of the fact that most mRNAs
and measurement error. Replication is the have a poly-adenine, poly(A), tail. These
easy answer to dealing with variation. To poly(A) mRNAs can be purified by captur-
make the best use of available resources, it ing them using complementary oligodeoxy-
is important to know what to replicate and thymidine (oligo(dT) ) molecules bound to
how many replicates to apply. Hybridization a solid support such as a chromatographic
of two samples to the same slide is made column or a collection of magnetic beads.
possible by labelling each sample with (ii) The more heterogeneous the cells, the
chemically distinct fluorescent tags. This more difficult it is to isolate mRNA specific
also provides the opportunity to make direct to the study. (iii) Captured mRNA degrades
comparisons between samples of primary very quickly and the mRNA has to be imme-
interest (Arcellana-Panilio, 2005). Using a diately reverse-transcribed into more stable
common reference becomes more efficient cDNA (for cDNA microarrays). The reverse
when a large number of samples need to transcription reaction usually starts from
be compared. When an experiment is test- the poly(A) tail of the mRNA and moves
ing the effect(s) of multiple factors, a well- toward its head; such a reaction is described
thought-out design is extremely critical so as oligo(dT)-primed.
that resources are not wasted on eventually
useless comparisons.
3.6.4 Labelling

3.6.3 Sample preparation Before hybridization to DNA arrays or chips,

the target (sample) has to be labelled to allow
Preparation of DNA samples for hybridiza- its subsequent detection. There are several
tion can follow general DNA extraction pro- methods that have been developed histori-
tocols. So here we will focus on RNA sample cally to detect or identify hybrid DNA mol-
preparation as described by Arcellana- ecules including the use of hydroxylapatite,
Panilio (2005). The sources of RNA for the radioactive labelling, enzyme-linked detec-
samples that will be hybridized to a micro- tion and fluorescent labelling depending on
array may be obtained from different types the nature of the chip, whether it is glass or
of cells or tissues. Obtaining pure, intact nylon. In order to be able to detect which
RNA, free from DNA or protein contamina- cDNAs are bound to the microarray, the sam-
tion, is important, while the homogeneity of ple is labelled with a reporter molecule that
the RNA source itself as defined by the bio- flags their presence. The reporters currently
logical question being asked must be con- used in microarray experiments are fluores-
sidered. The amount of RNA required for cent dyes known as fluors or fluorophores,
hybridization ranges from as little as 25 g chemicals that fluoresce when exposed to a
total RNA for short oligonucleotide arrays specific wavelength of light. A differently
Omics and Arrays 105

coloured fluor is used for each sample so cDNA whose sequence is complementary
that the two samples can be differentiated to the DNA on a given spot, that cDNA
on the array. will hybridize to the spot where it will be
The cDNA or mRNA can be labelled detectable by its fluorescence. In this way,
either directly or indirectly. In the direct every spot on an array is an independent
labelling procedure, fluorescently labelled assay for the presence of a different cDNA.
nucleotide is incorporated into the cDNA Hybridization is achieved by pouring the
products as it is being synthesized. With labelled sample on to the array and allow-
this method, a difference in the steric hin- ing it to diffuse uniformly. It is then sealed
drance conferred by different label moie- in a hybridization chamber and incubated
ties causes some labelled nucleotides to at a specific temperature for a period of time
be more efficiently used than others, pro- sufficient to allow hybridization reactions
ducing a dye bias in which one sample is to complete. The experimental conditions
labelled at a higher level overall than the should ensure that all areas of the array are
other. Cyanine 3 (Cy3) and cyanine 5 (Cy5) exposed to a uniform amount of labelled
are large molecules that reduce reverse sample throughout the hybridization.
transcriptase efficiency of long transcripts Hybridizations are processed directly
and certain sequences. Cy3-nucleotide on the slides after target synthesis. The
tends to be incorporated at a higher fre- hybridization step is literally where every-
quency than Cy5 although this does not thing comes together, i.e. the labelled mol-
necessarily translate into a better labelled ecules find their complementary sequences
target. To prevent the dye bias, the indirect on the array and form double stranded
labelling approach was developed where hybrids which are strong enough to with-
RNA is reverse transcribed in the pres- stand stringent washes. As in the hybridiza-
ence of an amino allyl-modified nucle- tion of classical Southern and northern blots,
otide that enables the chemical coupling the objective is to favour the formation of
of fluorescent labels after the cDNA is hybrids and the retention of those which are
synthesized. If the coupling reaction goes specific. Hybridization conditions depend
to completion, the frequency of labelling on the length of probes arrayed on the slide
becomes independent of the fluorophore and need to be extensively tested before
(Arcellana-Panilio, 2005). analysis. As an example, probe melting tem-
The labelled sample is the target for the peratures range from 42 to 70C depending
experiment. The number of fluor molecules on the nature of the buffer: the presence of
that label each cDNA depends on its length formamide exerts a positive effect on buffer
and also possibly its sequence composi- stringency in Denhardt-type buffers which
tion. For an RNA sample, either total RNA are used at 42C, whereas Sarkosyl-based
or mRNA is typically isolated and labelled buffers are commonly used around 70C.
using a first strand cDNA synthesis step Exogenous DNA (e.g. salmon sperm and
either by direct incorporation of a fluores- Cot-1 DNA) reduces background by block-
cent dye or by coupling the dyes to a modi- ing areas of the slide with a general affinity
fied nucleotide. For non-expression-based for nucleic acid or by titrating out labelled
experiments, DNA rather than RNA can be sequences that are non-specific. Denhardts
labelled and hybridized to the array. reagent (containing equal parts of Ficoll,
polyvinylpyrrolidone and bovine secum
albumin) is also used as a blocking agent.
Detergents such as SDS reduce surface
3.6.5 Hybridization and tension and improve mixing while help-
post-hybridization washes ing to lower background at the same time.
Temperature is an important factor that can
The array holds hundreds or thousands be manipulated during the hybridization
of spots, each of which contains a differ- and post-hybridization washes of microar-
ent DNA sequence. If a sample contains a rays and here much can be learned from
106 Chapter 3

what has already been established for end models enable excitation at several
Northern or Southern blots. For microar- wavelengths and offer dynamic focus, lin-
rays to be useful as a means of quantifying ear dynamic range over several orders of
expression the target has to be present in lim- magnitude and options for high-throughput
iting concentrations and the probe must be scanning. The objective of the scanning pro-
present in sufficient excess so as to remain cedure is to obtain the best image, where the
virtually unchanged even after hybridiza- best is not necessarily the brightest (to avoid
tion (Arcellana-Panilio, 2005). One impor- over-saturation beyond the signal range) but
tant feature of fluorescence detection is that is the most faithful representation of the
it allows the simultaneous hybridization of data on the slide.
two to several targets that have been differ- Although it is only supposed to pick up
ently labelled. the light emitted by the target cDNAs bound
The quality of the hybridization can to their complementary spots, the scanner
be assessed by spotting the sample with a will inevitably pick up light from various
set of hybridization control genes, spiking other sources, including the labelled sam-
the labelled sample with a known amount ple hybridizing non-specifically to the glass
of these controls prior to exposure to the slide, residual (unwanted) labelled probe
array and verifying that these control genes adhering to the slide, various chemicals
are indeed showing up as having been used in processing the slide and even the
hybridized. slide itself. This extra light creates back-
ground signals. Once signal and background
values are clearly defined, which is specific
to each experiment, data can be extracted
3.6.6 Data acquisition and quantification from the image by counting the pixels with
each probe and background area and record-
Once the wet phase (e.g. slide hybridization ing this in a computer readable format.
and washing off any excess labelled sample) Data extraction from the image involves
is completed, signal detection of each of the several steps (Arcellana-Panilio, 2005):
hybridization targets can be captured, that (i) gridding or locating the spots on the
is, the array must be scanned to determine array; (ii) segmentation or assignment of
how much of each labelled sample is bound pixels either to foreground (true signal)
to each spot. The signal is acquired using or background; and (iii) intensity extrac-
array scanners, either a charge-coupled tion to obtain a new value for foreground
device (CCD) or a confocal microscope, and background associated with each spot.
typically equipped with lasers to excite the Subtracting the background intensity from
fluorophores at a specific wavelength and the foreground yields the true spot intensity
photo-multiplier tubes to detect the emitted which can be used as an approximation of
light. Spots with more bound sample will relative gene expression.
have more reporters and will therefore fluo-
resce more intensely. Whatever the scanner
resolution, the microarray spot diameter
needs to be five to ten times larger than 3.6.7 Statistical analysis and data mining
the scanner resolution which can be as lit-
tle as 5 m for the most recent models. The Huge data sets are generated by microar-
end-product of a microarray experiment is ray experiments. For example, 20 hybridi-
a scanned grey scale image whose inten- zation experiments with the Arabidopsis
sity measurements range from 0 to 216. The GeneChip generates a set of 2,624,000 data
image is usually stored in a 16-bit tagged points (8200 genes 16 oligonucleotides
image file format (tiff, for short). The most 20 hybridizations). Such a massive amount
basic scanner models offer excitation and of data prohibits any manual treatment.
detection of the two most commonly used Also experimental variability is generally
fluorophores (Cy3 and Cy5) whereas higher- significant and has to be managed in order
Omics and Arrays 107

to exploit the data properly. Allison et al. spots and background can be difficult espe-
(2006) examined five key components of cially when the spots fade gradually around
microarray analysis: (i) design (the develop- their edges. Detection efficiency might not
ment of an experimental plan to maximize be uniform across the slide, leading to exces-
the quality and quantity of information sive red intensity on one side of the array
obtained); (ii) pre-processing (processing and excessive green on the other.
of the microarray image and normaliza- Data normalization addresses system-
tion of the data to remove systematic vari- atic errors that can skew the search for
ation. Other potential pre-processing steps biological effects. One of the most com-
include transformation of data, data filtering mon sources of systematic error is the
and in the case of two-colour arrays, back- dye bias introduced by the use of differ-
ground subtraction); (iii) inference (testing ent fluorophores to label the target. Print-
statistical hypotheses, e.g. which genes tip differences can also lead to sub-grid
are differentially expressed); (iv) classifi- biases within the same array while scanner
cation (analytical approaches that attempt anomalies can cause one side of an array to
to divide data into classes with no prior seem brighter than the other. Normalization
information or into predefined classes); and across multiple slides to remove bias can be
(v) validation (the process of confirming the accomplished by scaling the within-slide
veracity of the inferences and conclusions normalized data. In practice, examining the
drawn in the study). box plots of the normalized data of individ-
Reproducible and reliable microarray ual arrays for consistency of width can usu-
results can be only achieved through quality ally indicate whether normalization across
control starting with data generation. Good arrays is required.
laboratory proficiency and appropriate data Spatial plots can locate background
analysis practices are essential (Shi et al., problems and extreme values. The shape
2008). Numerous software packages, both and spread of scatter plots and the height
free and commercial, are available for quan- and width of box plots give an overall view
tifying microarray data. Typically, the inter- of data quality that can give clues about the
preted array data will highlight a relatively effects of filtering and different normaliza-
small number of spots that deserve further tion strategies. Gene expression profiling
investigation. Alternatively, the overall pat- will be taken as an example for the rest
tern of profiling can be used as a finger- of this section. Clustering algorithms are
print to characterize specific phenotypes. means of organizing microarray data accord-
The quantified data from the images ing to similarities in expression patterns. In
are obtained in typical form of tab-delim- this case, co-expressed genes must be co-
ited text files. First, dust artefacts, comet regulated, and a logical follow-up to this
tails and other spot anomalies should be analysis is the search for regulatory motifs
identified and flagged so that they will not and the common upstream or downstream
enter the analysis. Pre-processing the quan- factors that may tie these co-expressed
tified data before formal analysis includes genes together. Treatments can be clustered
the flagging of ambiguous spots with inten- based on similarity in gene expression pro-
sities lower than a threshold defined by the files. Genes can be clustered based on simi-
mean intensity plus two standard devia- larity in expression patterns across profiles.
tions of supposedly negative spots (no Two mathematical approaches are often
DNA, buffer and/or non-homologous DNA used, hierarchical or k-means clustering
controls). (Stanford) and self organizing maps (SOMs)
Interpreting the data from a micro- (Whitehead Institute).
array experiment can be challenging. A strategy for identifying differentially
Quantification of the intensities of each spot expressed genes is to compute the t-statis-
is subject to noise from irregular spots, dust tic and correct for multiple testing using
on the slide and non-specific hybridization. adjusted P-values. The B-statistic, derived
Deciding the intensity threshold between using an empirical Bayes approach, has
108 Chapter 3

been shown in simulations to be far supe- array. Compared with DNA microarrays, the
rior to either log ratios or the t-statistic development of protein-based approaches
for ranking differentially regulated genes poses technical problems for several rea-
(Lonnstedt and Speed, 2002). The twofold sons (Bernot, 2004): (i) proteins consist of
change continues to be a benchmark for 20 distinct amino acids while there are only
researchers perusing lists of microarray data four bases in DNA; (ii) depending on their
in order to validate the data by PCR, which amino-acid composition, proteins may be
can provide independent confirmation of hydrophilic, hydrophobic, acidic or basic
the expression patterns of specific genes. (while DNA is always hydrophilic and neg-
However, fold change has become more of atively charged); and (iii) proteins are often
a secondary criterion for the selection of post-translationally modified (by glycosyla-
candidates for follow-up from a list of genes tion, phosphorylation, etc.).
ranked according to more reliable measures Although detection of protein micro-
of differential expression (Arcellana-Panilio, arrays can be carried out using general
2005). After preliminary data mining and detection methods as described above, the
statistical analysis, validation and follow- problem is that protein concentrations in
up experiments can be designed. a biological sample may be many orders of
There are many examples of the array magnitude different from that of mRNAs.
technologies described in this section. In Therefore, protein array detection methods
yeast, 260,000 oligonucleotides correspond- must have a much larger range of detection.
ing to all the genes in yeast have been syn- The preferred method of detection is cur-
thesized on to a 1.28 cm2 chip. These chips rently fluorescence detection. Fluorescent
have allowed the identification of genes detection is safe, sensitive and can produce
expressed in various mutants under differ- high resolution. The fluorescent detection
ent culture conditions or at different stages method is compatible with standard micro-
of growth. Numerous genes of unknown array scanners but some minor alterations
function have thus been recognized, regu- to software may need to be made.
lated in a manner similar to or opposite to Protein microarrays have been made
that of genes of known function; transcrip- in the following manner (Macbeath and
tion of the genome is thus incorporated into Schreiber, 2000; Bernot, 2004). Proteins
a vast combinatorial network. In plants, are deposited on to a support and subse-
Affymetrix has commercialized microchips quently fixed to it. Thus 1600 distinct
to evaluate the expression of Arabidopsis proteins may be arranged per cm2. These
genes, allowing the identification of genes arrays are ordered so that it is known which
that are active during pathogen infection protein is represented by any given spot.
or during treatment with herbicides, fun- The microarrays are then incubated with
gicides or insecticides. This also facili- other ligands (fluorescently labelled) and
tates the determination of which genes are the result of the hybridization is analysed
transcribed in which tissues under which by confocal microscopy (it is also possible
conditions or during which stages of devel- to employ radioactively labelled ligands).
opment. Commercial microarrays are also The protein recognized may be identified
available from Affymetrix for several other using the signal localization data obtained.
crop plants such as maize and tomato. The intensity of the signal obtained is pro-
portional to the level of ligandprotein
interaction.
Except for the most frequently used
3.6.8 Protein microarrays and others DNA and protein microarrays discussed
above, other microarrays include those built
A protein chip or microarray is a piece of using tissues (cells) and carbohydrates.
glass on which different molecules of pro- Similar to other microarrays, a tissue chip
tein have been affixed at separate locations or microarray is a piece of glass on which
in an ordered manner to form a microscopic different tissues have been affixed, while
Omics and Arrays 109

sugar or carbohydrate microarrays include homogenous solution rather than on a solid

oligosaccharides, polysaccharides/glycans support (Hoheisel, 2006). The establish-
and glycoconjugates fixed on an array. ment of zip-code arrays can address these
Carbohydrates are very different from pro- problems by separating the actual assay
teins in the following aspects: (i) carbohy- from the microarray hybridization (Gerry
drates are highly heterogeneous as they have et al., 1999). Such microarrays contain a
a large number of different molecules deter- set of unique and distinct oligonucleotides
mined by over 500,000 different oligosac- that are immobilized at known locations.
charides units; (ii) their synthesis is very Because they should not be complemen-
complicated and involves a larger number tary to any sequence in any organism and
of enzymes; and (iii) biological information are made solely to identify the address of
that is stored in the various types of carbo- a particular location on the microarray, they
hydrates is less well understood. For these are called zip-code sequences (Fig. 3.11).
reasons, carbohydrate microarrays will be a The oligonucleotides are designed to have
useful tool for glycomics. similar thermodynamic properties and thus
A new technology that is related to hybridization can be carried out at one tem-
microarrays and should be mentioned is perature and under defined stringency con-
microfluidics. Microfluidics is the science ditions. Instead of having to produce many
and technology of systems that process or different microarrays, a single design can be
manipulate small (1091018 l) amounts used for various assays.
of fluids using channels with dimen- For example, Hoheisel (2006) described
sions of tens to hundreds of micrometres a universal microarray option that involves
(Whitesides, 2006). The first applications using the L-DNA enantiomer, the mir-
have a number of useful characteristics: ror image form of normal D-form DNA, for
(i) the ability to use very small quantities of the zip-code oligomer (Fig. 3.11). Because
samples and reagents and to carry out sepa- L-DNA forms a left-helical duplex, there is
rations and detections with high resolu- no cross-hybridization between L-DNA and
tion and sensitivity; (ii) low cost; (iii) short D-DNA. However, chimaeric molecules that
analysis times; and (iv) small footprints for consist of L-form and D-form stretches can be
the analytical devices. Microfluidics offers produced by standard chemistry. Therefore,
fundamentally new capabilities in the con- D-DNA primers are produced with an L-DNA
trol of concentrations of molecules in space zip-code tag that binds to the L-DNA com-
and time. In the areas of microanalysis, plementary oligonucleotide on the microar-
microfluidics offers approaches for bio- ray. L-DNA microarrays are stable because
logical analyses that require much greater L-DNA is resistant to nuclease activities.
throughput and higher sensitivity and reso- Simultaneously, only the zip-code part of
lution than were previously required. It has the molecules that is used in homogenous
great potential to improve the analytical solution is able to hybridize to the array.
processes involved in proteomics, DNA iso- Neither the D-formed primer portion nor the
lation, PCR and DNA sequencing. analyte (for example, genomic DNA or RNA
preparations) will cross-hybridize with the
array.

3.6.9 Universal chip or microarray

Most microarray platforms are designed 3.6.10 Whole-genome analysis using

to address a specific set of questions in a tiling microarrays
specific organism. This means that a spe-
cific microarray platform needs to be estab- The recent explosion in available genome
lished and produced for each application. sequence data has made it realistic to
Moreover, many assays that are carried out undertake microarray analysis at the whole-
on microarrays would work even better in a genome level. Interestingly, these sequence
110 Chapter 3

ddTTP ddGTP

ddCTP D-form 5
ddATP gene-specific
primer L-form
zip-code
A
T Genomic DNA
Molecule separation
Base discrimination by primer extension in solution on zip-code array

Genotyping

ProteindsDNA interactions

ProteinssDNA
Epigenetic studies interactions

Protein selection
CGH or attachment
by aptamers

Splice variant studies

Transcriptional profiling

D-form 5
Primer extension gene-specific
and labelling primer L-form
zip-code
AAAAAA-3

Sample 1

D-form 5 Hybridization to
Primer extension gene-specific
L-form
zip-code array
and labelling primer
zip-code
AAAAAA-3

Sample 2

Fig. 3.11. The concept of universal microarray. dsDNA, double-stranded DNA; SSDNA, single-stranded
DNA; CGH, comparative genomic hybridization.

data have led to the advent of high-density genomic content and should provide a dra-
DNA oligonucleotide-based whole-genome matic improvement in the understanding
tiling microarrays (WGAs) which can be of numerous biological processes. WGAs
employed to interrogate a full genomes comprise relatively short (< 100-mer) oligo-
worth of sequence data in a single experi- nucleotide features. Furthermore, they can
ment. This technology allows a more be created with > 6,000,000 discrete fea-
complete understanding of an organisms tures, each comprising millions of copies
Omics and Arrays 111

of a distinct DNA sequence. For instance, with sequence characteristics functions is

the Affymetrix GeneChip Arabidopsis a molecular marker popularly known as
tiling 1.0R array (http://www.affymetrix. single feature polymorphism (SFP). Using
com) is a single array comprising over 3.2 this approach a large number of SFPs were
million perfect match and mismatch probe identified between two laboratory strains of
pairs (approximately 6.4 million probes in yeast (Winzeler et al., 1998).
total) tiled with 35-bp spacing throughout For the larger and more complex
the complete non-repetitive A. thaliana Arabidopsis genome, tiling arrays were
genome (Zhang, X. et al., 2006). not available and hence the first experi-
WGAs can be employed for a myriad of ments involved hybridization of labelled
purposes in plants including empirical anno- genomic DNA using Affymetrix AtGenome1
tation of the transcriptome characterization, GeneChips based on available, expression-
mapping of regulatory DNA motifs using based annotation for ORFs. Despite this
ChIP-on-chip, novel gene discovery, analy- ORF-based focus, nearly 4000 SFPs were
sis of alternative RNA splicing, characteri- identified between the Columbia and
zation of the methylation state of cytosine Landsberg erecta accessions (Borevitz et al.,
bases throughout a genome (methylome) 2003). In order to determine genome-wide
and the identification of sequence poly- patterns of SFP, hybridization to the ATH1
morphism (Gregory et al., 2008). Overall, gene expression array was used to inter-
implementing standardized protocols for rogate genomic DNA diversity in 23 wild
RNA labelling, hybridization, microarray strains (accessions), in comparison with the
processing, data acquisition and data nor- reference strain Columbia. At < 1% false
malization within the plant community will discovery rate, 77,420 SFPs with distinct
minimize sources of error and data variabil- patterns of variation were detected across
ity between laboratories and across micro- the genome. Total and pair-wise diversity
array platforms. In this way WGA analysis was higher near the centromeres and the
among a diverse set of groups will results heterochromatic knob region (Borevitz et al.,
in high-quality, easily reproducible data 2007). By high-density array re-sequencing
that will aid the research of the entire plant of 20 diverse strains (accessions), more than
community. 1 million non-redundant SNPs were identi-
fied (Clark et al., 2007). Salathia et al. (2007)
developed a microarray-based method that
3.6.11 Array-based genotyping assesses 240 unique indel markers in a sin-
gle hybridization experiment at a cost of
Array techniques have become increasingly less than US$50 in materials per line. The
popular as a tool for genome-wide genotyp- genotyping array was built with 70-mer oli-
ing since they offer an assay that is highly gonucleotide elements representing indel
multiplexed at a low cost per data point. One polymorphisms between Columbia and
of the earliest reports of microarray-based Landsberg erecta. Multi-well chips allow
genotyping employed high-density WGAs groups of 16 lines to be genotyped in a sin-
produced by photolithographic synthesis gle experiment.
(Affymetrix) for the simultaneous discovery Microarray-based genotyping has
and array of DNA polymorphisms in yeast. recently been further developed in several
In genotyping assays based on microarrays, crop plants. Using a high-density microarray
allelic variation is detected as the differ- technology pioneered at Perlegen Sciences
ential hybridization of labelled genomic (http://www.perlegen.com), the International
DNA to individual probes or sets of probes Rice Functional Genomics Consortium ini-
covering identifiable genomic locations. tiated a project to identify a large fraction
The polymorphism of the two sequences, of the SNPs presented in cultivated rice
originating from two different cultivars or through whole-genome comparisons of 21
genotypes, results in differential hybridiza- rice genomes, including cultivars, germ-
tion intensity and this property associated plasm lines and landraces (McNally et al.,
112 Chapter 3

2006). Perlegen designed SNP-discovery morphism were found across diverse rice
arrays to include all possible SNP variations accessions.
with multiple levels of redundancy. In soybean, the GoldenGate assay, which
Edwards et al. (2008) developed a micro- is capable of multiplexing from 96 to 1536
array platform for rapid and cost-effective SNPs in a single reaction, has been tested
genetic mapping using rice as a model. In to determine the success rate of converting
contrast to methods employing genome til- verified SNPs into working arrays (Hyten
ing microarrays for genotyping, the method et al., 2008). Allelic data were successfully
is based on low-cost spotted microarray generated for 89% of the 384 SNP loci when
production, focusing only on known poly- it was used in three recombinant inbred line
morphic features. A genotyping microarray (RIL) mapping populations. Using the same
was produced comprising 880 SFP elements system, two panels of 1536 SNP markers
derived from indels identified by aligning have been developed in maize through col-
genomic sequences of the japonica cultivar laboration between Cornell, CIMMYT and
Nipponbare and the indica cultivar 93-11. Illumina, one with SNPs developed from
The SFPs were experimentally verified by candidate genes relevant to drought toler-
hybridization with labelled genomic DNA ance and the other with SNPs randomly
prepared from the two cultivars. Using the distributed on the maize genome (Yan et al.,
genotyping microarrays, high levels of poly- 2009).
4
Populations in Genetics
and Breeding

Many types of populations are currently 4.1.1 Genetic constitution-based

being used in genetic studies and plant classification
breeding. The properties of a popula-
tion depend on how it is developed and
For two alleles, A1 and A2, at a specific
which parents are involved. Doubled
genetic locus, A, there are three different
haploids (DHs), recombinant inbred lines
possible genotypes, A1A1, A2A2 and A1A2.
(RILs) and near-isogenic lines (NILs) are
If a population consists of individuals
three important types of populations
that have an identical genotype (no matter
that have a long history of application
whether they are homozygous, A1A1 or A2A2
in plant breeding and have been widely
or heterozygous, A1A2, for locus A) it is said
used in genetic mapping, gene discovery
and genomics-assisted breeding since the to be homogeneous. However, if a popula-
tion consists of individuals that have differ-
discovery of DNA-based markers. This
chapter describes in general the struc- ent genotypes (for example, some with A1A1
ture, development and utilization of these or A2A2, and others with A1A2) it is said to
important genetic populations, based on a be heterogeneous.
comprehensive discussion by Xu and Zhu Based on the above definitions, there
(1994). More details on applications of are four types of populations:
these populations will be covered in other 1. Homogeneous populations with homo-
chapters. zygous individuals: such as individuals
from a cultivar of a self-pollinated species
or from an inbred derived from an open-
pollinated species.
4.1 Properties and Classification 2. Homogeneous populations with het-
of Populations erozygous individuals: such as F1 plants
derived from two homogeneous and homo-
Populations that are currently used in genet- zygous cultivars of self-pollinated species
ics and plant breeding can be classified and or between two inbreds derived from an
their properties can be described based on open-pollinated species.
their genetic constitution, maintenance, 3. Heterogeneous populations with homo-
genetic background and origin. zygous individuals: such as pure-breeding

Yunbi Xu 2010. Molecular Plant Breeding (Yunbi Xu) 113

114 Chapter 4

individuals derived from continuous self- heterogeneous background. A population

ing of a hybrid of two inbred lines or culti- that consists of lines with nearly identical
vars, such as RILs, where each individual is genetic backgrounds can be derived from
homozygous, either A1A1 or A2A2 while dif- genetic processes such as continuous back-
ferent individuals have different genotypes. crossing of a hybrid to one of its parental
4. Heterogeneous populations with hetero- lines so that lines only differ for a specific
zygous individuals: such as individuals in target trait or locus. All other types of popu-
early generations such as F2 and F3 derived lations, including F2, backcross (BC), RILs
from two inbred lines or homozygous cul- and DHs have heterogeneous backgrounds,
tivars. A set of open-pollinated cultivars of i.e. individuals within these populations
an open-pollinated species is a heterogene- have heterogeneous backgrounds and differ
ous population containing heterozygous not only in the target traits but also in the
individuals. remainder of the traits.

4.1.2 Genetic maintenance-based 4.1.4 Origin-based classification

classification
Populations can be classified into two basic
Based on whether a population can main- categories based on the origin of the indi-
tain its genetic constitution through selfing viduals they contain: populations of natu-
from one generation to another, populations ral cultivars and populations formed by
can be classified into two types: planned materials among selected parents
or genetic mating populations.
1. Tentative or temporary populations: indi-
viduals in a population, such as F2, F3, BC1,
BC2, etc., have different genotypes and their Populations of natural cultivars
genetic constitution will change with recom- These populations consist of a group or sub-
bination resulting from selfing or inbreeding. set of cultivars which are selected from a
These types of population are difficult to large number of cultivars for specific target
maintain and in most cases, the same genetic traits or are based on specific pedigree rela-
constitution can be only used once. tionships. The variation for the target trait
2. Permanent or immortalized populations: among groups of cultivars can be investigated
this type of populations consists of a set of and the relationship between the target trait
pure-breeding lines derived from two par- and other traits or molecular markers can be
ents or a common set of parents. Individuals established. For example, the genetic effect
within a line have identical genotypes, of plant height can be studied by comparing
while individuals from different lines have tall cultivars with short ones.
different genotypes. Each line can serve as
a segregation unit from the parental popu-
lation and population structure and genetic Populations formed by planned matings
constitution can be maintained consistently Mating populations are specifically designed
generation after generation through selfing for genetic studies and derived from a spe-
or inbreeding processes. cific genetic mating design using selected
genetic stocks. There are several genetic mat-
ing designs that are widely used in genetics
4.1.3 Genetic background-based and breeding.
classification
DIALLEL CROSSES. A total of n cultivars or
Populations can be classified into those inbred lines are selected as male and female
within which individuals have a nearly parents to produce crosses of all possible
isogenic background and those that have a combinations. The F1s or F2s derived from
Populations in Genetics and Breeding 115

these crosses are then genetically analysed. females, to produce crosses of all possible
The mating design is as follows: combinations.

Parent P1 P2 P3 Pn Cultivar 1 2 3 n1

P1 n1+1
P2 n1+2
P3 n1+3

Pn n1+n2

A full diallel analysis will include all NCIII: n individuals are selected from
one-way hybrids and parents while a partial an F2 population to backcross with two par-
or incomplete diallel analysis may contain ents, P1 and P2:
just half the diallel without reciprocals or
parents. Diallel crosses are usually used to F2 individual 1 2 3 n
estimate general combining ability for the
parents and special combining ability for P1
specific crosses, providing information for P2
producing hybrids.
TRIPLE TESTCROSS (TTC) AND SIMPLIFIED TTC
NORTH CAROLINA DESIGNS. There are three North (STTC). TTC is an extension of NCIII, where
Carolina designs, denoted by NCI, NCII, and n individuals (n > 20) are selected from an F2
NCIII. These designs are most often used in population to backcross with both parental
cross-pollinated crops and to study broad- lines, P1 and P2, and the F1 (P1 P2):
based populations. Their use in self-pollinated
crops usually involves many inbred lines that
can reasonably be considered to represent a F2 individual 1 2 3 n
large, reference population, e.g. late matur-
ing soybean adapted to a geographical belt of P1
P2
USA. To simplify the description, however,
F1
inbreds are taken as an example.
NCI: two inbred lines are crossed to
produce F2, and then some individuals In sTTC: n cultivars or strains (n > 20)
are randomly selected from the F2 popu- are selected from the germplasm pool to cross
lation as males to intermate with other with two cultivars or strains, PH and PL, which
randomly selected females. The offspring show extreme phenotypes (with the highest
derived from this intermating will be used and lowest phenotypic values), respectively.
in genetic analysis. The design can be
described as below: Strain 1 2 3 n

Males 1 2 3 PH
PL

Female 4 5 6 7 8 9 10 11 12 The populations derived from the above

genetic mating designs have been widely
used in conventional quantitative genetics
Offspring to study and subsequently exploit modes of
gene action determining the inheritance and
NCII: n parental lines are divided into expression of the target traits. The reader is
two groups, one as males and the other as referred to Hallauer and Miranda (1988) and
116 Chapter 4

Mather and Jinks (1982) or sections discuss- et al. (2007) reviewed various approaches
ing quantitative genetics in plant breeding for haploid production in plants. Forster
texts for details regarding the genetic infor- and Thomas (2004) and Szarejko and
mation that can be derived from the study Forster (2007) reviewed the use of DHs in
of hybrids or families formed using each of genetic studies and plant breeding. Recent
these mating designs. Some of these designs reviews on specific crop species are avail-
have also been used in genetic mapping of able for tomato (Bal and Abak, 2007) and
quantitative traits. nutraceutical species (Ferrie, 2007).

Inbreeding populations
4.2.1 Haploid production
This type of population includes segregat-
ing populations such as F2 and F3 popula- There are several approaches to haploid
tions which are derived from selfing or production. Naturally occurring haploids
sibmating an F1 hybrid, BC populations that have been reported in a number of species
are derived from backcrossing the F1 to one including tobacco, rice and maize. In bar-
of the parents or advanced BC populations ley, the hap initiator gene was reported to
derived by multiple backcrossings of the F1 control haploidy and spontaneous haploids
to one of the parents. were recovered at high frequency (Hagberg
Populations used in genetic studies and and Hagberg, 1980), with up to 8% haploid
plant breeding can be derived from any of the offspring being recovered when a cultivar
mating designs discussed above. For breed- that was homozygous for the hap allele was
ing purposes, the sizes of populations that used as the female parent to cross with other
will be maintained can be much smaller than cultivars, but none were produced from the
those used in genetic studies because breed- reciprocal cross. In maize, the indeterminate
ers only need to retain the populations with gametophyte gene (ig) results in a monoploid
desirable traits. For genetic studies, however, embryo either from the sperm cell or the egg
geneticists need to maintain as large a popu- cell (Kermicle, 1969). Although DHs can be
lation as possible and all types of segregates recovered from such spontaneous haploids,
including those with undesirable traits. their frequencies are usually too low for
genetics and breeding purposes.
With the recognition of the importance of
4.2 Doubled Haploids (DHs) DHs in plant breeding, extensive efforts have
been made to induce haploid embryogenesis
Cells or plants that contain a single com- and increase the frequency at which DHs
plete set of chromosomes are called hap- can be recovered. The benefits of DHs have
loid. Haploids derived from diploids are already been demonstrated in many research
called monoploid, while haploids derived and breeding programmes. This progress has
from polyploids are called poly-haploid. led to DH cultivars for commercial produc-
Diploids produced from chromosome dou- tion and DH populations for genetics and
bling of haploids are called doubled or breeding studies. In barley, over 100 culti-
double haploid (DH). The DH approach vars have been released and similar numbers
has several advantages that make it useful of rice and rapeseed DH cultivars have been
in genetics and plant breeding. DHs can be listed (Forster and Thomas, 2004). DHs have
produced via in vivo and in vitro systems. also been used successfully in recalcitrant
Haploid embryos are produced in vivo by species such as oat (Kiviharju et al., 2005)
parthenogenesis, pseudogamy or chromo- and rye (Tenhola-Roininen et al., 2006).
some elimination after extensive crossing. Maluszynski et al. (2003) edited a
The haploid embryo is rescued, cultured manual presenting a set of protocols for the
and chromosome-doubling produces DHs. production of DH in 22 major crop plant spe-
The in vitro methods include gynegenesis cies including four tree species. The manual
(ovary and flower culture) and androgene- contains various protocols and approaches
sis (anther and microspore culture). Forster to DH production that have been success-
Populations in Genetics and Breeding 117

fully used for different germplasm resources endosperm. Chromosome or genome prefer-
in each species. The protocols describe in ential or uniparental elimination arises as a
detail all the steps in DH production, from result of certain crosses; fertilization occurs
donor plant growth conditions, through in but soon afterwards the genome of one par-
vitro procedures, media composition and ent is preferentially eliminated. Haploids
preparation to regeneration of haploid plants can be produced by interspecific hybridi-
and methods for chromosome doubling. The zation followed by chromosome elimina-
manual enables the researcher to choose the tion. In barley, this extensive hybridization
most suitable method for production of DH method consists of crossing cultivated bar-
for their particular laboratory conditions and ley, Hordeum vulgare (2n = 2x = 14) with
plant materials, e.g. microspore versus anther the wild, diploid cross-pollinated peren-
culture, wide hybridization or gynogenesis. nial Hordeum bulbosum (2n:::: 2x = 14).
The manual also contains information on Most progeny (95%) are barley haploids,
the organization of a DH laboratory, basic while the remainder is made up by diploid
DH media and associated simple cytogenetic hybrids. This technique, called the bulbo-
methods for ploidy level analysis. An excel- sum method, has been extensively utilized
lent overview of haploid induction and the for the production of haploids in barley.
application of doubled haploids is provided Haploids can also be produced in hexaploid
for Brassicaceae, Poaceae and Solanaceae wheat (var. Chinese Spring) by chromosome
in Haploids in Crop Improvement II elimination following hybridization of wheat
(Biotechnology in Agriculture and Forestry) with H. bulbosum (both 2x and 4x). A fre-
edited by Palmer et al. (2005). quency of 13.7% grain set with 2x bulbosum
There are now five methods generally and 43.7% grain set with 4x bulbosum were
applicable to the production of haploids in obtained (Barclay, 1975). During formation
plants with frequencies that are useful for of the embryo the chromosomes of H. bulbo-
genetics and breeding programmes (Palmer sum are eliminated. The immature embryos
and Keller, 2005): are cultured in vitro and plantlets from these
monoploid embryos can be induced via an
Extensive hybridization crosses fol-
efficient chromosome doubling technique to
lowed by chromosome elimination
produce fertile flowers bearing homozygous
from one parent of a cross, usually the
hexaploid seeds.
pollination parent.
The production of embryos as a result
Gynogenesis: cultured unfertilized
of wheat maize crosses was first reported
isolated ovules and ovaries of flower
by Zenkteler and Nitzsche (1984). Laurie
buds develop embryos from cells of the
and Bennett (1986) cytologically exam-
embryo sac.
ined embryos produced via this system and
Androgenesis: cultured anthers or iso-
found maize chromosomes to be preferen-
lated microspores undergo embryogen-
tially eliminated during the first three cell
esis or organogensis directly or through
divisions, leaving a haploid complement of
intermediate callus.
wheat chromosomes. This method was used
Parthenogenesis: development of an
in wheat haploid production and applied
embryo by pseudogamy, semigamy or
with some success in generating genetic
apogamy.
and mapping populations (Laurie and
Inducer-based approach: haploid-induc-
Reymondie, 1991). Mean frequencies of fer-
ing lines are used to produce haploids.
tilization, embryo formation, embryo germi-
nation and haploid regeneration of 83, 20,
Chromosome or genome elimination 45 and 8%, respectively have been reported
(Chen et al., 1999). Significant differences
Haploid embryos can be produced in plants in the percentage of embryo germination
after pollination by distantly related spe- and haploid regeneration were observed
cies. In most cases, normal double fertiliza- among crosses suggesting that the efficacy
tion takes place to form a hybrid zygote and of haploid production could be improved by
118 Chapter 4

selection of more responsive parents. Eighty germination and green-plant regeneration

per cent of haploid plants were doubled and and doubling is needed. Some green plants
had a normal seed set; however, only 6% will die during colchicine-induced chro-
produced viable progeny. Ultimately two mosome doubling and during transplanta-
DH green plants per pollinated head were tion of the colchicine-treated seedlings to
obtained on average. The frequency of hap- the field; therefore, the final population
loid regeneration was increased from 35 to size may be too small to represent a suffi-
50% in the winter 2000 study using a pre- cient number of possible genotypes to make
cold treatment of embryos. selection effective. In addition, application
Factors that have been reported to affect of 2,4-D is crucial and without it there may
the production of haploids by the chromo- be no seed set or embryo formation. Of the
some elimination approach include geno- various methods tested, the use of spikelet
type, temperature during growth (higher culture offers a practical and versatile alter-
temperature resulting in a higher rate of native for the production of wheat polyhap-
elimination), genome ratio of parental lines loid using wheat maize sexual crossing
and others. Factors affecting the efficacy of (Kaushik et al., 2004).
DH production in the wheat x maize system
include: (i) expertise and consistency in
SOMATIC REDUCTION AND CHROMOSOME ELIMINATION.
protocol implementation; (ii) control of tem-
Cases are known where either spontane-
perature and light regimes for optimal plant
ously or due to specific treatments, the chro-
growth and reproduction in both wheat and
mosome number was reduced to half in the
maize; (iii) wheat F1 genotype differences;
somatic tissues, a phenomenon described
(iv) timing of 2,4-dichlorophenoxyacetic
as somatic reduction or reductional mitosis.
acid (2,4-D) treatment; and (v) growth stage
Early studies include that of Swaminathan
at which colchicine is applied.
and Singh (1958), who induced a haploid
Compared to anther culture, the wheat
branch on a watermelon by irradiation of
maize system (sometimes called the maize
the seed used. This must have occurred
pollen method) has three advantages: less
by the reduction of chromosome number
genotype-dependent response, greater effi-
in the somatic tissue through an unknown
cacy and less time consuming. Based on
mechanism (perhaps due to spindle organ-
Kisana et al. (1993), the maize pollen method
izer abnormalities). Similarly, in Sorghum
is about two to three times more efficient
vulgare, somatic tetraploid (2n = 4x) cells
than anther culture. In the study by Chen et
responded to colchicine treatment and
al. (1999), twice as many green plants were
gave rise to diploid cells which took over
regenerated (mean = 7.54%) using the maize
the growing point completely thus giving
pollen method than anther culture. Kisana
rise to diploid individuals. There are also
et al. (1993) reported that aneuploids or
a number of other chemicals such as chlor-
gross chromosomal abnormalities were not
amphenicol and para-fluorophenylalanine
observed and confirmed that chromosome
(an amino acid analogue) which have some-
variations were not common in wheat
times been successfully used for produc-
maize-derived plants. They also concluded
tion of haploids in a number of materials.
that this technique could save 46 weeks in
Elimination of parental chromosomes has
obtaining the same age haploid green plants.
also been observed in somatically-produced
The cross incompatibility barrier in
wide hybrids. In these cases, the elimina-
wheat has been successfully overcome by
tion tends to be irregular and incomplete,
using maize pollen. The wheat maize
leading to asymmetric hybrids or cybrids
technique is currently being used as an
(Liu, J.H. et al., 2005).
alternative to the bulbosum technique and
anther culture for wheat haploid produc-
tion. In order to use the wheat maize MECHANISM OF CHROMOSOME ELIMINATION. Several
system in practical breeding programmes, hypotheses have been presented to explain
further enhancement of embryo formation, uniparental chromosome elimination during
Populations in Genetics and Breeding 119

hybrid embryo development in plants: for some 3 of H. vulgare are responsible for chro-
example, differences in timing of essential mosome elimination, although their effect
mitotic processes due to asynchronous cell may be neutralized or offset if a sufficient
cycles or asynchrony in nucleoprotein syn- dose of bulbosum chromosomes is available.
thesis leading to a loss of the most retarded
chromosomes. Other hypotheses propose Ovary culture or gynogenesis
the formation of multipolar spindles, spatial
separation of genomes during interphase Ovary culture involves production of a hap-
and metaphase, parent-specific inactiva- loid individual by culture of unfertilized
tion of centromeres and by analogy with the ovaries to obtain haploid plants from egg
host-restriction and modification systems of cells or other haploid cells of the embryo;
bacteria, degradation of alien chromosomes the process is known as gynogenesis. Under
by host-specific nuclease activity. Gernand the appropriate culture conditions the
et al. (2005) provide evidence for a novel unfertilized cell of the embryo sac develops
chromosome elimination pathway in wheat into an embryo by as yet unknown mecha-
pearl millet hybrids that involves the for- nisms. Haploid plants generally originate
mation of nuclear extrusions during inter- from egg cells in most species (in vitro par-
phase in addition to post-mitotically formed thenogenesis) but in some species, e.g. rice,
micronuclei. They found that the chroma- they arise chiefly from the synergids; in at
tin structure of nuclei and micronuclei was least Allium tuberosum even antipodal cells
different and heterochromatinization and produce haploid plants (in vitro apogamic)
DNA fragmentation of micronucleated pearl (Mukhambetzhanov, 1997).
millet chromatin was the final step during Gynogenesis may occur either via
haploidization. embryogenesis or plantlet regeneration
The mechanism of chromosome elimi- from callus. In rice 2-methyl-4-chlorophen-
nation in Hordeum hybrids was studied oxyacetic acid (MCPA) generally leads to a
by Subrahmanyam and Kasha (1975) and small amount of protocorm-like callus for-
Bennett et al. (1976) and the following con- mation from which shoots and roots regen-
clusions were drawn: (i) normal double fer- erate, while picloram promotes embryo
tilization occurs in interspecific crosses as regeneration. In contrast, sugarbeet usually
confirmed by cytological study; and (ii) after shows embryo development while in sun-
fertilization there is a gradual and selective flower embryos regenerate following a cal-
elimination of H. bulbosum chromosomes lus phase. In general, regeneration from a
from nuclei of endosperm as well as embryo callus phase appears, at least for the present,
cells so that eventually haploid embryos are to be easier than direct embryogenesis.
produced. A sudden shortage of proteins in Generally, gynogenesis has two or
the developing embryo and endosperm and more stages and each stage may have dis-
the better ability of vulgare chromosomes to tinct requirements. In rice, two stages, i.e.
form spindle attachments relative to bulbo- induction and regeneration, are recognized.
sum chromosomes, may be responsible for During induction, ovaries are floated on a
elimination of the bulbosum chromosomes. liquid medium containing low auxin levels
Other possible causes such as differences in and kept in the dark, while for regeneration
mitotic cycle, congression during mitosis, they are transferred on to an agar medium
etc. were ruled out by the authors. containing a higher auxin concentration
It has also been demonstrated that the and incubated in the light.
elimination of bulbosum chromosomes is Depending on the species, unfertilized
under genetic control (Subrahmanyam and ovules, ovaries or flower buds can be cultured.
Kasha, 1975). The above-mentioned authors In some members of the Chenopodiaceae,
used primary trisomics and monotelotri- Liliaceae and Cucurbitaceae, gynogenesis is
somics in crosses with tetraploid H. bul- the main route to DH production (Palmer and
bosum and concluded that both arms of Keller, 2005). Even where anther or micro-
chromosome 2 and the short arm of chromo- scope culture is successful, gynogenetic
120 Chapter 4

haploids have been produced, e.g. in barley, at lower levels somatic calli and somatic
maize, rice and wheat. embryos were also produced. Ovaries are
San Noeum (1976) was the first to generally cultured in the light but in some
demonstrate that gynogenesis can be species at least, e.g. sunflower and rice,
induced under in vitro conditions. She incubation in the dark favours gynogenesis
obtained gynogenic haploids using an and minimizes somatic callusing; in rice
ovary culture of H. vulgare. Subsequently, light may lead to the degeneration of gyno-
success has been obtained with many genic pro-embryos.
species, e.g. wheat, rice, maize, tobacco, Ovary culture has two main limita-
petunia, gerbera, sunflower, sugarbeets, tions: (i) it is not successful in all species;
onions, rubber, etc. About 0.26% of the and (ii) the frequency of responding ova-
cultured ovaries show gynogenesis and ries and the number of plantlets per ovary
one or two plantlets, rarely up to eight, is usually low. Therefore, anther culture is
originate from each ovary. preferred over ovary culture; only in those
Embryogenic frequency is low in many cases where anther culture fails, e.g. sugar-
cases, but relatively high frequencies have beet and for male sterile lines, ovary culture
been reported in some cases (Alan et al., assumes significance.
2003; Martinez, 2003). The rate of success
varies considerably with species and is Anther culture or androgenesis
markedly influenced by explant genotype
so that some cultivars do not respond at Anther culture or androgenesis is a proc-
all. In rice, japonica genotypes are far more ess by which a haploid individual develops
responsive than indica cultivars. In most from a pollen grain. Anther culture is often
cases, the optimum stage for ovule culture the method of choice for DH production
is the nearly mature embryo sac, but in rice in crop plants (Sopory and Munshi, 1996).
ovaries at the free nuclear embryo sac stage Good aseptic techniques are required but
are the most responsive. the methods are generally simple and appli-
The culture response is still genotype cable to a wide range of crops (Maluszynski
dependent (Alan et al., 2003; Bohanec et al., et al., 2003). In general, haploid plants are
2003). Generally, for culture of whole flow- generated in vitro from the microspores
ers, ovary and ovules attached to placenta contained in the anther and require chro-
respond better, but in gerbera and sunflower mosome doubling treatments. The number
isolated ovules give a better response. Cold of chromosomes in haploid plants can be
pretreatment (2448 h at 4C in sunflower doubled either naturally or by colchicine
and 24 h at 7C in rice) of the inflorescence treatment.
before ovary culture enhances gynogenesis. The process involved in anther cul-
The composition of the culture medium ture is poorly understood. Investigations
and stage of embryo sac development are have been hampered by the presence of
important considerations for successful the sporophytic anther wall that presents
culture (Keller and Korzun, 1996). Growth direct access to the microspores contained
regulators are crucial in gynogenesis and at within. This has become an important issue
higher levels they may induce callusing of because although many species respond to
somatic tissues and even suppress gynogen- anther culture, responsive genotypes can
esis. Growth regulator requirements seem be a limiting factor thus making it neces-
to depend on species. For example, in sun- sary to study, understand and manipulate
flower growth regulator-free medium is the microspore embryogenesis in order to
best and even a low level of MCPA induces develop genotype-independent methods
somatic calli and somatic embryos. But in (Forster et al., 2007). Many factors influ-
rice, 0.1250.5 mg l1 MCPA is optimal for ence the production of anther-culture-
gynogenesis. The sucrose level also appears derived plants including the physiological
to be critical; in sunflower 12% sucrose status of the donor plants, pre-treatment of
leads to gynogenic embryo production while anthers, developmental stage of the pollen,
Populations in Genetics and Breeding 121

components in the medium and culture media as it contains a complex mixture of

conditions such as light, temperature and nucleic acids, sugars, growth hormones and
humidity. The constraints associated with some vitamins.
this approach are the selective response The physiological state of the parent
of genotypes to the anther culture proc- plant plays a role in haploid production.
ess or medium, the high rate of albino for- In various plant species it has been shown
mation and somaclonal variation. These that the frequency of androgenesis is higher
factors have been discussed by Taji et al. in anthers harvested at the beginning of the
(2002) and are summarized in the following flowering period and declines with plant
discussion. age. This may be due to deterioration in the
The genotype of the donor plant general condition of the plants, especially
plays a significant role in determining during seed set. The lower frequency of
the frequency of pollen plant production. induction of haploids in anthers taken from
There are genotypes extremely recalci- older plants may also be associated with a
trant to anther culture. In rice, for exam- decline in pollen viability. Seasonal varia-
ple, japonica cultivars are much easier to tions, physical treatment and application of
culture from anthers than indica cultivars. hormones and salts to the plant also alter
Genotype-dependency is a major constraint its physiological status which is reflected in
that affects its wide application. changes in the anther response to culture.
The culture medium plays a vital role Temperature and light are two physi-
since the requirements vary with the geno- cal factors which play an important role
type and probably the age of the anther as in the culture of anthers. Higher tempera-
well as the conditions under which donor tures (30C) yield better results. Temperature
plants are grown. The medium should con- shocks also enhance the induction frequency
tain the correct amounts and proportions of of microspore androgenesis. Frequency of
inorganic nutrients to satisfy the nutritional haploid formation and growth of plantlets
as well as physiological needs of the many are generally better in the light. Certain
plant cells in culture. Sucrose is considered physical and chemical treatments given to
to be the most effective carbohydrate source flower buds or anthers prior to culture can
which cannot be substituted by other disac- be highly conducive to the development of
charides. The concentration of sucrose also pollen into plants. The most significant is
plays an important role in the induction of cold treatment.
pollen plants. Activated charcoal is also The developmental stage of pollen
added to the culture medium. greatly influences the fate of the microspore.
In addition to basal salts and vitamins, Androgenesis occurs when a microspore or
hormones in the medium are critical factors pollen is induced to shift from a gameto-
for embryo or callus formation. Cytokinins phytic pathway to a sporophytic pathway
(e.g. kinetin) are necessary for induction of embryo formation. Anthers of some spe-
of pollen embryos in many species of cies (Datura, tobacco) give the best response
Solanaceae, except tobacco. Auxins, in par- if pollen is cultured at first mitosis or later
ticular 2,4-D, greatly promote the formation stages (postmitotic), whereas in most others
of pollen callus in cereals. For regenera- (barley, wheat, rice) anthers are most pro-
tion of plants from pollen calli, a cytokinin ductive when cultured at the uninucleate
and lower concentration of auxin are often microspore stage (premitotic). Anthers at
necessary. a very young stage (containing microspore
Certain organic supplements added mother cells m tetrads) or a late stage (con-
to the culture medium often enhance the taining binucleate, starch-filled pollen)
growth of anther culture. Some of these of development are generally ineffective,
include the hydrolysed products of proteins albeit that some exceptions are known.
such as casein (found in milk), nucleic acids Barley and rice are considered to be
and others. Coconut milk obtained from ten- model cereal crops for androgenesis. The
der coconuts is often added to tissue culture application of barley anther culture protocols
122 Chapter 4

to other cereals such as wheat yielded a low feasible means for production of haploids in
frequency of green plants. Although a high cotton (Zhang and Stewart, 2004).
frequency of green plants is produced for There are many examples of DH lines
most barley crosses, androgenesis still poses developed from cultivars and intra- and
some problems that need to be addressed. interspecific hybrids between upland cot-
There are barley genotypes which are ton (Gossypium hirsutum L.) and American
extremely recalcitrant to microspore divi- Pima cotton (Gossypium barbadense L.)
sion and/or with a high rate of albinism. using semigamy. The semigametic trait has
The rate of embryogenesis is still low and also been transferred into different cotton
poorly-developed embryos are formed very cytoplasms to facilitate rapid replacement
frequently. New methods are needed that of nuclei. Stelly et al. (1988) proposed a
reduce the cost of DH production and are scheme called hybrid elimination and hap-
effective for all genotypes. loid production system using a cotton strain
Future objectives in plant androgen- with semigamy (Se), lethal gene (Le2dav),
esis include the development of efficient virescent (v7) and male sterility or glandless
androgenesis protocols for a wide range of (gl2gl3).
genotypes, a better understanding of the Semigametic lines can produce 3060%
biological processes involved in the stress haploids when self-pollinated and about
pre-treatment, the study of the influence 0.71.0% androgenic haploids when used
of different micronutrients on the induc- as female parents in crosses with normal
tion of gametic embryogenesis and possi- non-semigametic cottons (Turcotte and
ble gametophytic selection. Identification Feaster, 1967). A unique feature of semi-
of genetic loci associated with the anther gamy is that the inheritance of the gene is
culture response process will facilitate the conveyed by both male and female gametes
understanding of the mechanisms underly- but expression of the trait in terms of hap-
ing androgenesis. Identification and locali- loid production occurs only in the female
zation of molecular markers linked to the parent. As a consequence, for example, in
yield of green plants per anther and the reciprocal crosses between SeSe and sese
evaluation of their potential use for the parents, haploids will be produced only
prediction of the anther culture response when SeSe or Sese is the female parent.
of genotypes will also help to optimize the The results reported by Zhang and
production of DHs. Stewart (2004) verified that semigamy in
cotton is controlled by one gene, previously
Semigamy designated Se. The gene functions sporo-
phytically and gametophytically resulting
Semigamy is a form of parthenogenesis and in an incomplete dominance mode of action.
occurs when the nucleus of the egg cell and Consistent with the difference between the
the generative nucleus of the germinated two parental isogenic lines, semigametic
pollen grain divide independently, resulting F2.3 lines had significantly lower chloro-
in a haploid chimera (a plant whose tissues phyll content than non-semigametic F2.3
are of two different genotypes). Semigamy is lines, an observation that was confirmed by
a type of facultative apomixis in which the a significant association between haploid
male sperm nucleus does not fuse with the production and chlorophyll content. The Se
egg nucleus after penetrating the egg in the gene and the gene for reduced chlorophyll
embryo sac. Subsequent development can content could be either the same or closely
give rise to an embryo containing haploid linked.
chimaeral tissues of paternal and maternal
origins. In cotton, the semigametic phe- Inducer-based approach
nomenon was first reported by Turcotte and
Feaster (1963), who developed the Pima line Haploid inducing lines have been used
57-4 that produced haploid seeds at a high in maize to produce haploids by develop-
frequency. Currently semigamy is the only ment of the unfertilized egg cells (Eder and
Populations in Genetics and Breeding 123

Chalyk, 2002). A haploid induction rate (iii) improved chromosome doubling sys-
of up to 2.3% was detected by Coe (1959) tems using colchicine that gave a doubling
in crosses with the inbred line Stock 6. rate of greater than 10%.
A higher rate (about 6%) was later obtained A scheme to show in vivo haploid
by Sarker et al. (1994) and Shatskaya et al. induction includes the following steps:
(1994) in progenies of crosses between Creating new variation by intercrossing
Stock 6 and Indian and Russian germplasm,
with selected lines.
respectively. Inducer lines are now available In-vivo haploid induction in generation
with haploid seed induction rates of 812%
F1.
in temperate maize germplasm (Melchinger Chromosome doubling of haploid seed-
et al., 2005; Rber et al., 2005).
lings:
Segregation studies (Lashermes and
selection of haploid kernels;
Beckert, 1988; Deimling et al., 1997) and
germination of kernels;
quantitative trait loci (QTL) analysis (Rber,
cutting of coleoptile;
1999) demonstrated that in vivo haploid
doubling procedure: treatment of
induction in maize is a quantitative trait
seedlings with colchicine;
under the control of an unknown large
planting of treated seedlings in
number of loci. Individual QTL explained
greenhouse;
only small parts of the genetic variation.
transplanting DH plants at the
Compared with other methods of DH
three-leaf stage to the field and
production such as anther culture, the
selfing (generation D0); and
inducer-based approach is rather efficient,
formation of testcross hybrids.
less dependent on the genotype and can be Evaluation of testcrosses in multi-envi-
practised in almost every maize breeding
ronment yield trials (two stages).
programme without access to expensive lab-
oratory facilities (Rber et al., 2005; http://
www.uni-hohenheim.de/ipspwww/350a/
linien/indexl.html). 4.2.2 Diploidization of haploid plants
Requirements for in vivo DH produc-
tion in practical breeding include: (i) avail- As described above, haploids can be pro-
ability of inducer genetic stocks; (ii) high duced through various approaches. Haploid
induction rate; (iii) the inducer is a good plants may grow normally under in vitro or
pollinator; (iv) reproducibility with rea- greenhouse conditions up to the flowering
sonable seed quantities; (v) availability of stage, but viable gametes are not formed
a marker system that is independent of the due to the absence of one set of homologous
genetic background of the female and of chromosomes and consequently, there is no
environmental effects and can be used for seed set.
effective and unambiguous identification The only mechanism for perpetuating
of haploid kernels; and (vi) availability of the haploids is by duplicating the chro-
an artificial chromosome doubling system mosome complement in order to obtain
with high doubling rates that is safe, simple homozygous diploids. In pollen-derived
and cost-effective. plants duplication of chromosomes may
Since the late 1990s, these requirements occur spontaneously in cultures. However,
have been partially met in maize with: the spontaneous chromosome doubling
(i) inducer lines (e.g. RWS and UH400 devel- rate of haploids is usually low. In maize,
oped at the University of Hohenheim) with for example, the rate ranges from 0 to 10%
improved induction rates of 10% or higher; (Chase, 1969; Beckert, 1994; Deimling et al.,
(ii) a combination of two dominant mark- 1997; Kato, 2002). Therefore, it is neces-
ers (anthocyanin colour of endosperm and sary to diploidize the haploids by chemical
embryo for identification of haploids and means. Thus, artificial chromosome dou-
anthocyanin coloration of stalk for identi- bling (diploidization) is necessary for the
fication of false positives in the field); and efficient large-scale use of haploid plants.
124 Chapter 4

Chromosome doubling is thought 4.2.3 Evaluation of DH lines

to occur by one or more of four mecha-
nisms, namely endomitosis, endoredu- Randomness
plication, C-mitosis or nuclear fusion
(Jensen, 1974; Kasha, 2005). Endomitosis Systems used to produce DH lines should not
is described as chromosome multipli- have preference to specific gametes, which
cation and separation but failure of the means each gamete should have the same
spindle leads to one restitution nucleus probability of developing into a haploid.
with the chromosome number doubled. Chromosome elimination using the bulbo-
It has also been called nuclear restitu- sum approach in barley is usually a random
tion. Endoreduplication is duplication of process and there is no significant segregation
the chromatids without their separation distortion associated with it. Park et al. (1976)
and leads to diplo-chromosomes or to and Choo et al. (1982) did not find any gam-
polytene chromosomes if many replica- ete preference associated with this approach
tions occur. Endoreplication is a common by comparing the DH and single seed descent
feature in specialized plant cells where (SSD) populations. In rice, however, espe-
cells become differentiated or enlarged cially for the DH populations derived from
in cells that are very active in metabolite anther culture of distant crosses, distorted
production. C-mitosis is a specific form segregations were found for isozymes, restric-
of endomitosis where, under the influ- tion fragment length polymorphisms (RFLPs)
ence of colchicine, the centromeres do and morphological traits. As a result, the seg-
not initially separate during metaphase regation of two types of homozygotes devi-
while chromosome arms or chromatids ated from the 1:1 ratio for many single gene
do separate. Nuclear fusion occurs when loci (Chen, Y. et al., 1997).
two or more nuclei divide synchronously
and develop a common spindle. Thus, Stability
two or more nuclei could result with
doubled, polyploid or aneuploid chromo- Theoretically, DH lines have two properties:
some numbers. complete homogeneity and homozygosity
A simple procedure designed to within lines. Except for the variation that
achieve diploidization involves immer- might be produced during anther culture or
sion of very young haploids in a filter-ster- other generation processes, DH lines should
ilized solution of colchicine (0.4%) for 24 be genetically stable and the mutation rate
days, followed by their transfer to the cul- that can occur in DH lines should be in the
ture medium for further growth. In maize, same range as that of other true-breeding
the highest doubling rates are achievable cultivars.
by immersing 23-day-old seedlings in a There are some reports that identified
colchicine solution as suggested by Gayen somaclonal variation associated with anther
et al. (1994). Using an improved version culture-derived DH lines (Chen, Y. et al.,
of this method, Deimling et al. (1997) 1997) and theories that account for the ori-
obtained doubling rates of up to 63%. The gin of somaclonal variation (Taji et al., 2002).
studies of Eder and Chalyk (2002) using The variation within a DH line can be divided
genetically broader materials yielded an into two categories: (i) variation originating
average doubling rate of 27%. Optimized from the genetic heterogeneity of somatic
methods for colchicine treatment of hap- cells of the source (haploid) plant; (ii) varia-
loid seedlings yield average success rates tion due to structural alterations of DNA and
of about 10% fertile diploid plants with chromosomes caused by tissue culture.
satisfactory seed set (Mannschreck, 2004). Somaclonal variation as an important
In this procedure chromosome or gene cause for instability of DH lines and is not
instabilities are minimal compared to restricted to, but is particularly common in,
other methods of colchicine or chemical plants regenerated from callus. The varia-
treatment. tions can be genotypic or phenotypic which
Populations in Genetics and Breeding 125

in the latter case can be either genetic or the formation of multi-polar spindles on
epigenetic in origin. Typical genetic altera- chromosomes lagging at anaphase cause the
tions are: changes in chromosome numbers development of cell lines with haploid, tri-
(polyploidy and aneuploidy), chromo- ploid or other uneven ploidy status.
some structure (translocations, deletions Many studies have indicated that cryp-
and duplications) and DNA sequence (base tic structural modification of individual
mutations). chromosomes is more likely to cause soma-
clonal variation than modification induced
GENETIC VARIATION ARISING FROM SOURCE by ploidy changes in many tissue-cultured
PLANTS. The source plants used to initiate plants. Chromosomal changes occurring
cultures are likely to be heterogeneous with during tissue culture include transposition
respect to the state of differentiation, ploidy of mobile genetic elements (transposons),
level and age. These explant-related factors chromosome breakage and repositioning
will affect the genetic make-up of the cells of chromosome segments.
produced in the culture and thus the cal- As summarized by Taji et al. (2002),
lus arising from such a group of cells with several mechanisms have been proposed to
diverse genetic make-up will inevitably lead explain the genetic variability that occurs
to a mixed population of cells. Depending in tissue culture. The most possible causes
on the cell types from which the plants are are:
originated, those regenerated from such a 1. Reduced regulatory control of mitotic
genetically mosaic callus will undoubtedly events in culture: the ploidy status of plants
be of different genetic make-up. Taji et al. generated from callus, cell suspension or
(2002) indicated that such genetic mosaic- protoplast cultures of certain species differ
ness seems to occur commonly in polyploid significantly despite the fact that the cul-
plants rather than in diploids or haploids. tures originate from a highly homogenous
genetic background. This indicates a lack of
GENETIC VARIATION ARISING DURING CULTURE. tight regulation of cell-cycle-related controls
Although a significant degree of genetic during cell proliferation in culture.
variability can be traced to the genetically 2. Use of growth regulators: plant growth
heterogeneous cell types of explant at least regulators, particularly synthetic auxins
in polyploid species, there is substantial such as 2,4-D, are considered to be the
evidence to indicate that much of the vari- major cause of genetic variability in cul-
ability observed in generated plants stems ture. For example, cytokinins at low con-
from the culture process itself. Aneuploids, centrations have been shown to reduce the
polyploids or cells with structurally altered range of ploidy in culture while low levels
chromosomes may arise in culture. Many of both auxins and cytokinins appear to
differentiated cells when induced to divide preferentially activate the division of cyto-
in culture, undergo endoduplication of logically stable meristematic cells enabling
chromosomes resulting in the production the regeneration of genetically uniform
of tetraploid or octaploid cells with distinct plantlets.
phenotypes. 3. Medium components: some of the min-
Various phenomena have been eral nutrients influence the establishment of
observed in tissue culture of various plant genetic variability in culture. For instance,
species which explain the production of by altering the levels of phosphate and nitro-
cells with unusual ploidy levels (Bhojwani, gen as well as the form of nitrogen in the
1990). Occurrence of multi-polar spindles medium, the genetic composition (ploidy
due to failure of spindle formation during level) of the cultured cells can be controlled
cell division is one of the contributing fac- to a considerable extent. A marked increase
tors. Absence of spindle formation during in chromosome breakage has been observed
mitosis results in the appearance of cells in plant cell cultures grown with different
with doubled chromosome number while levels of magnesium or manganese.
126 Chapter 4

4. Culture conditions: some culture con- systems could thus be attributed to tissue-
ditions, such as incubation temperatures culture-induced methylation or demethyla-
above 35C and long duration of culture, tion of DNA. The activity of transposons and
have been implicated in inducing genetic retrotransposons induced by tissue culture
variability in regenerated plants. could also be responsible for some of the
5. Inherited genomic instability: molecular genetic and epigenetic variability observed
studies indicate the existence of certain in culture.
regions of genome that are more susceptible
to tissue-culture-induced structural alterna-
tions, although the reason for the increased
4.2.4 Quantitative genetics of DHs
susceptibility of these genomic loci known
as hot spots is not fully understood.
DH lines that are derived randomly from
an array of gametes produced by F1 plants
CAUSES OF EPIGENETIC VARIATION IN TISSUE CULTURE. are very useful in quantitative genetics.
Any culture-induced changes which are sta- Compared with diploid genetic models
ble but not heritable have frequently been for populations such as F2, F3 or BC, there
considered as epigenetic variation. However, are no dominance or dominance-related
a greater understanding of genetic and epi- epistasis effects involved in the genetic
genetic alterations in tissue culture in the model of DH populations. As a result, addi-
recent past has led to a clear distinction tive, additive-related epsitasis and linkage
between these two types of variation. For effects can be investigated properly. As a
instance, genetic mutations occur randomly permanent population, DH lines can be
and at a much lower rate than epigenetic replicated as many times as desired across
variations. Genetic changes are usually sta- different environments, seasons and labo-
ble and heritable. Epigenetic variation may ratories, providing endless genetic material
also lead to stable traits; however, reversal for phenotyping and genotyping particu-
can occur at high rates under non-selective larly for understanding the genotype-by-
conditions. Epigenetic traits are often trans- environment interaction. In DH populations,
mitted through mitosis in a stable manner the additive component of genetic variance
but rarely through meiosis and the level is larger than that of diploid populations
of induction of epigenetic traits is directly such as F2 and BC. Choo et al. (1985) dis-
related to the selection pressure experi- cussed in detail the quantitative genetics
enced by the cells. Epigenetic changes are associated with DH populations, including
generally assumed to reflect alteration in detection of epistasis, estimation of genetic
expression rather than in the information variance components, linkage test, estima-
content of genes. tion of gene numbers, genetic mapping of
As Taji et al. (2002) summarized, the polygenes and tests of genetic models and
epigenetic variation observed in cultured hypotheses. Rber et al. (2005) compared
cells or regenerated plants is mainly due to the expected gain from selection for DH
three cellular events: (i) gene amplification; lines and other populations and implica-
(ii) DNA methylation; and (iii) increased tions of epistatic effects, which is briefly
activity of transposable elements. In plants, described here.
nearly 25% of the genome can be methylated
at cytosine residues but the significance of
Expected gain from selection
this cytosine methylation is not apparent.
It has been suggested that methylation (and As is well known from quantitative genet-
demethylation) of DNA is one of the ways of ics (see e.g. Falconer and Mackay (1996)
controlling transcriptional activity and that and also Chapter 1), the expected gain from
this process can be affected by the tissue selection can be described by G = i hx rG sy,
culture process. The non-heritable genetic where i is the selection intensity, hx the
variability observed in many tissue culture square root of the heritability of the selection
Populations in Genetics and Breeding 127

criterion, rG the genetic correlation between for DH lines this correlation is 1. Thus com-
selection criterion and gain criterion and sy pared with S2 lines, the correlation of DH
the standard deviation of the gain criterion. lines is 1: 0.75 = 1.15 times stronger.
In long-term breeding programmes, the deci-
sive gain criterion for evaluating selection Implications of epistatic effects
progress in hybrid breeding is the general
combining ability (GCA) of the improved Epistatic gene action may positively or neg-
lines. At the beginning of a breeding cycle atively affect hybrid performance (Lamkey
the test units are the DH lines per se and and Edwards, 1999). In most cases, epi-
later on in the cycle their testcrosses. static effects have been reported to cause
Strong selection (large i) leads to a small a decrease in the testcross performance
effective population size and consequently of segregating generations (Lamkey et al.,
to a loss of genetic variance due to random 1995) or to penalize three-way and double
drift. To keep this loss within certain lim- crosses compared to their non-parental sin-
its, a minimum number of lines should be gle crosses (Sprague et al., 1962; Melchinger
recombined after each breeding cycle. This et al., 1986). These effects are commonly
number depends on the inbreeding coeffi- referred to as recombinational loss and
cient (F) of the candidate lines. The number may be explained by a disruption during
should be (2F) times larger for inbred lines meiosis of co-adapted gene arrangements
than for non-inbred genotypes. Assuming assorted by prior natural and artificial selec-
that S2 lines (F = 0.75) are recombined in tion. Marker-based analyses of QTL partially
conventional breeding, the number of DH corroborate this hypothesis (Stuber, 1999).
lines (F = 1) would have to be increased To avoid recombinational loss and still offer
1:0.75 = 1.33-fold to preserve an equiva- a chance to select for new positive interac-
lent level of genetic variation. This means tions, a balance between recombination and
that the selection intensity must be reduced fixation of gene arrangements is needed. The
accordingly when using DH lines. DH-line approach might offer the method
In contrast to the selection intensity, for achieving this goal as homozygosity can
hx and rG increase when using DH lines. be reached in one cycle of recombination
This increase is particularly large in the when F1 is used for DH development or in
first testcross stage. Neglecting epistasis, the different cycles when segregating popula-
GCA variance of inbred lines is equal to 1/2 tions of different generations are used.
F sA2 (Falconer and Mackay, 1996), where sA2
is the additive variance of the base popu-
lation. Thus the GCA variance of DH lines 4.2.5 Applications of DH populations
is 1:0.75 = 1.33 times larger than that of S2 in genomics
lines. This leads to better differentiation
among the testcrosses and consequently to In genetics, DHs may serve to recover
higher heritability. Seitz (2005) compared recessives. Using DHs, linkage data can be
three sets of S2 and S3 lines each with DH obtained directly by sampling gametes as
lines derived from the same crosses and monoploids. DHs are ideal for the study
evaluated the same testers in the same envi- of mutation frequencies and spectra. As
ronments. On average, the estimated genetic DHs represent homozygous, immortal and
testcross variances for grain yield (bu. acre1) true breeding lines, they can be repeatedly
amounted to 50, 94 and 124 for S2, S3 and phenotyped and genotyped so phenotypic
DH lines, respectively. and genotypic information can be accumu-
The genetic correlation between selec- lated over years and across laboratories. In
tion and gain criterion (rG) also increases genomics, DHs are therefore ideal for study-
with the degree to which the tested lines ing complex traits that are quantitatively
have been inbred. For example, the correla- inherited which may require replicated tri-
tion between St lines and their homozygous als over many years and locations for accu-
progenies for GCA is equal to Ft whereas rate phenotyping.
128 Chapter 4

DH populations are desirable genetic Ab and aB are recombinant gametes while

materials for genetic mapping including the AABB and aabb are genotypes for parental
construction of genetic linkage maps and lines and AAbb and aaBB are genotypes for
gene tagging using genetic markers. They recombinants. It is expected that for each
can be produced relatively rapidly, requir- molecular marker there are two parental
ing 11.5 years to become established after genotypes in DH populations and in any DH
the initial cross and they provide an ongo- line only one of the parental bands revealed
ing population that can be used indefinitely by markers will show up.
for mapping. QTL analysis is facilitated by A general step before map construc-
using DH mapping populations and the tion and gene mapping is to evaluate the DH
homozygosity of DHs enables accurate phe- population. In rice, 66 DH lines were derived
notyping by replicate trials at multiple sites from the F1 between indica Apura and
(Forster and Thomas, 2004). In addition, in upland japonica Irat 177 by anther culture.
DH populations, dominant markers are as Heterozygosity was found for some loci with
efficient as co-dominant markers because two parental bands while non-parental alle-
linkage statistics are estimated with equal les (or new alleles) were found for other loci.
efficiency (Knapp et al., 1995). DHs can be The limitations of using this DH population
also used to increase the expression level of in genetic linkage mapping do not result from
a transgene (Beaujean et al., 1998). the partial heterozygosity or new alleles but
Only the application of DHs for con- from the low RFLP polymorphism identified
struction of genetic linkage maps will be dis- between the parents (S.R. McCouch, Cornell
cussed here. Assuming that the two parental University, personal communication). Only
lines used for production of DH populations 40% of the 100 tested RFLP markers detected
have the genotypes P1 (AABB) and P2 (aabb), polymorphism. Of the markers that had been
their F1 will produce four types of gametes, mapped on to an F2 population, IR34583/
AB, Ab, aB and ab. As a result of single- Bulu, only 55% were polymorphic between
sex production, these gametes produce four Apura and Irat 177. However, a relatively sat-
types of haploids and by chromosome dou- urated molecular map can be established if
bling will produce four types of DHs: AABB, other types of molecular markers such as sin-
AAbb, aaBB and aabb. When A-a and B-b gle sequence repeats (SSRs) or single nucle-
are independent (not linked), the four types otide polymorphisms (SNPs) are used.
of DH lines are present in identical propor- In barley, a DH population consisting of
tions: 25%. The segregation of two loci in 113 lines was derived by anther culture from
DH population is shown in Fig. 4.1, among the F1 hybrid between two spring barley cul-
which AB and ab are parental gametes and tivars, Prottor and Nudinka (Heun et al.,

AB P2 ab
P1 ab
AB

AB
F1
ab

Gamete AB Ab aB ab
Haploid AB Ab aB ab

AB Ab aB ab
Double haploid
AB Ab aB ab

Ratio Independent 25% 25% 25% 25%

Linkage (1r)/2 r/2 r/2 (1r)/2

Fig. 4.1. Segregation of two genetic loci in a DH population.

Populations in Genetics and Breeding 129

1991). A genetic map was constructed using assess their true breeding potential for target
55 RFLP markers and two known genes and traits. They have the following advantages
is the first complete molecular map to be and clear beneficial applications (Melchinger
constructed using DH populations in crops. et al., 2005; Rber et al., 2005; Longin et al.,
Since then, many DH populations have been 2006; W. Schipprack, University of
developed using the different approaches Hohenheim, personal communication):
described above and have been used for map providing the quickest possible route to
construction and genetic mapping.
complete homozygosity;
giving an immediate product of stable
4.2.6 Application of DHs recombinants from species crosses;
in plant breeding no masking effects because of the high
homogeneity attained in the first gen-
The benefits of DHs in plant breeding have eration of DH populations;
been widely reviewed; readers should refer
increased performance per se due to selec-
to Forster and Thomas (2004), Forster et al. tion pressure in the haploid phase and/or
(2007), and the five volumes on In Vitro during the first generation of DHs;
Haploid Production in Higher Plants edited
complete genetic variance accessible
by Jain et al. (19961997). from the very beginning of the selec-
Application of DHs in plant breeding tion process;
can be described by comparison of the time
easy integration of line/hybrid develop-
required to obtain fixed inbreds relative to ment with recurrent selection;
inbreeding, starting from a heterozygote:
reduced efforts in the nursery after the
first multiplication of DH lines compared
to a conventional breeding nursery;
Selfing of a Haploids of a maximum genetic variance in line per
heterozygote heterozygote se and testcross trials;
high reproducibility of early-selection
Gametes: 1/2 A + 1/2 a Gametes: 1/2 A + 1/2 a
F2 1/4 AA, 1/2 Aa, 1/4 aa chromosome doubling results;
F3 1/4 Aa 1/2 AA + 1/2 aa
high efficiency in stacking specific tar-
F4 1/8 Aa geted genes in homozygous lines; and
F5 1/16 Aa simplified logistics for seed exchange
F6 1/32 Aa between main and off-season pro-
1/2 AA + 1/2 aa grammes since each line is fixed and
can be represented by a single plant.
Apparently, the DH approach has a time DHs have been used in plant breeding
reduction of three to four generations com- programmes to produce homozygous geno-
pared to inbreeding-based breeding. The DH types in a number of important species,
approach features many logistical advan- e.g. tobacco (Nicotiana tabacum L.), wheat,
tages simplifying breeding to a large extent barley, canola (Brassica napus L.), rice
and enabling evaluation of genetically fixed and maize (Maluszynski et al., 2003), but
hybrid components from the very beginning only rarely in triticale, oat, rye and others.
of the selection process. Depending on the Research in crops such as rice, wheat and
material, the costs and the breeding scheme maize has shown that significant progress
adopted, the DH approach can reduce the in haploid technology is attainable given an
time for development and commercializa- intensive research effort. Well-established
tion of new inbred lines and lead to a higher methods in these crops have allowed major
expected genetic gain per unit of time. parts or whole breeding programmes to
As outlined above, DH lines extracted be based on DH production. Oat, triticale,
from a heterozygote or a segregating popula- wild barley, potato and cabbage are exam-
tion represent immortalized, reproducible ples of crops where DH technologies are
gametes that can be immediately evaluated to less advanced but in which hundreds of
130 Chapter 4

DHs may still be obtained (Tuvesson et al., rare alleles and aid the efficient selec-
2007). In other crops, including some veg- tion for quantitative traits in breeding. In
etable species and forage and turf grass spe- outcrossing species, DHs enable undesir-
cies, DH methods are being developed, but able recessive genes to be eliminated from
applications in crop improvement are rare. lines at any breeding stage (Forster and
The DH approach has yet to be exploited Thomas, 2004).
in leguminous species, predominantly due Development of DHs through anther
to their cultivation in developing countries culture has been very successful with many
and consequent paucity of research fund- cultivars released in barley breeding world-
ing. Difficulties have also been posed by the wide and in rice breeding in China since the
small anther size and relatively low number 1970s. The production of DHs has become
of microspores per anther in legume crops the preferred tool in many advanced plant
(Croser et al., 2006). breeding institutes and commercial compa-
The DH technique offers an efficient nies for breeding many crop species. Due
tool for extracting individual gametes from to the obvious advantages of DH lines and
heterozygous materials and transform- the enhancements made in in vivo haploid
ing them into homozygous lines that can induction in recent years, many commer-
be reproduced ad libitum by selfing. DHs cial breeding companies such as Agreliant,
extracted from a heterogeneous popula- Monsanto and Pioneer are presently adopt-
tion, e.g. landraces, represent immortal- ing or are already routinely using this
ized, reproducible gametes that can be technology in their maize breeding pro-
immediately evaluated to assess their true grammes (Seitz, 2005). Recurrent selection
breeding potential for target traits. They for testcross performance using DHs has
can also serve as source material for breed- reduced the cycle length and improved
ing programmes of hybrids and synthetics. genetic advance (Gallais and Bordes, 2007).
Furthermore, DH lines may be used for In some companies in vivo haploid induc-
long-term conservation of heterogeneous tion has more or less replaced conventional
germplasm resources such as landraces line development with up to 15,000 DH
without the risk of genetic drift and other lines per year per breeding programme and
changes in gene frequencies, as well as for over 100,000 DH lines per year across all
in-depth characterization of the breeding programmes at costs of US$10 or less per
potential of each heterogeneous germplasm DH line. The first maize hybrids produced
collection because each of the extracted DH using DH lines have been commercialized
lines can be evaluated in replicated trials in in the USA and Europe (W. Schipprack,
diverse environments. University of Hohenheim, personal com-
With some DH methods, only a tiny munication). However, the development
fraction of the haploid seedlings will ger- of new, more efficient and cheaper large-
minate and survive to the adult stage due scale production protocols has meant that
to the uncovered genetic load and the stress DHs have also recently been applied in less
in plant development exerted by colchi- advanced breeding programmes.
cine treatment for chromosome doubling.
Nevertheless, because the DH technique is
rather simple, it is feasible to generate and
identify large numbers of haploid seeds, 4.2.7 Limitations and future prospects
treat them with colchicine and transplant
them to the field. Hence, by starting with a Genetics and breeding in DHs have not
sufficiently large number of haploid seeds given the desired and expected dividends,
it is possible to generate hundreds of via- despite the substantial investments made
ble DH lines with acceptable agronomic in haploid research since the late 1980s.
performance. Some of the widely recognized limitations
DHs are essentially important in the of DH breeding are as follows: (i) haploids
evaluation of diversity, because they fix cannot be obtained in the high frequency
Populations in Genetics and Breeding 131

required for selection in most important 4.3.1 Inbreeding and its genetic effects
crop species; (ii) the costbenefit ratio in
DH breeding is often not favourable, thus RILs result from continuous inbreeding such
discouraging its use despite the obvious as selfing or sibmating starting from an F2
advantages; (iii) haploids and DHs will population until homozygosity is reached.
express recessive deleterious traits and There are two genetic responses to inbreed-
deleterious mutations may arise during the ing, gene recombination and genotype
DH development process including anther homogenization. Starting from a heterozy-
culture, particularly for open-pollinated gote at a locus A-a, for example, selfing will
species; (iv) different ploidy levels may be produce three genotypes, AA, Aa and aa.
available so that haploid status may need With continuous selfing, two homozygotes,
to be confirmed cytologically; alternatively, AA and aa, will not segregate, while the
pollen culture may be necessary, which heterozygote Aa will continue to segregate
is expensive and has a relatively low suc- producing the three genotypes. However,
cess rate and is also genotype-dependent the proportion of heterozygotes in the popu-
in many species; (v) doubled haploidy may lation will decrease with continuous selfing
also decrease genetic diversity, which is and will approach zero. This process can be
better maintained in heterozygous lines; described as below.
(vi) the success of the DH method is highly Consider one locus with two alleles,
genotype dependent, so is not yet suitable A and a, underlying continuous selfing.
for all breeding programmes; (vii) some Homozygotes will increase by 50%, while
techniques, e.g. inducers in maize (espe- heterozygotes will decrease by 50% with
cially the good ones), are proprietary and each generation of selfing. At generation t,
not available to all interested breeders; and the proportion of heterozygotes in the
(viii) health and legal concerns related to population will be (1/2)t, while the propor-
handling the chemical doubling agents. tion of homozygotes will be 1 (1/2)t ; the
The Third International Conference on homozygotes AA and aa each account for
Haploids in Higher Plants (1215 February [1 (1/2)t]/2 = (2t 1)/2t+1 (Table 4.1).
2006, Vienna, Austria) highlighted the When two or more loci, for example k
following issues that are important to future loci, are involved, successive selfing from F1
studies on DHs: (i) new methods of haploid hybrids will produce (1/2)tk heterozygotes
and DH plant formation; (ii) mechanism and [1 (1/2)t]k = [(2t 1)/2t]k homozygotes
of initiation of haploids; (iii) application at generation t. The more loci are involved,
of haploid cells, gametes, haploid and DH the longer it takes to reach homozygos-
plants in fundamental and applied sci- ity (Fig. 4.2). In the seventh generation of
ence; (iv) genes controlling haploid forma- selfing starting from a heterozygous hybrid
tion from female and male gametes; and for example, the proportion of homozy-
(v) methods of diploidization of haploids. gotes will be 99% for the population with
one heterozygous locus involved, 96% for
the population with five heterozygous loci
involved, 89% for 15 loci, 79% for 30 loci
4.3 Recombinant Inbred Lines (RILs) and 46% for 100 loci.
If heterozygous loci are linked,
Recombinant inbred lines or random inbred successive inbreeding can still produce
lines (RILs) are usually a part of the ultimate a homozygous population. However, the
products of many breeding programmes rate of approach to homozygosity depends
and are also used as genetic materials. They on the recombination frequencies between
can be produced by various inbreeding the linked loci. The lower the recombina-
procedures. To help understand the whole tion frequency, the higher the proportion of
process of development and applications homozygotes in the population and the more
of RILs, the inbreeding procedure and its rapidly the population becomes homog-
effects will be discussed first. enized. If the recombination frequency, r,
132 Chapter 4

Table 4.1. Genotypes derived from a single-locus heterozygote and their frequencies in selfing generations.

Genotype
Frequency of Frequency of
Generation AA Aa aa heterozygotes homozygotes

0 1 - 1 0
1 1/4 2/4 1/4 1/2 50.0
2 3/8 2/8 3/8 1/4 75.0
3 7/16 2/16 7/16 1/8 87.5
4 15/32 2/32 15/32 1/16 93.8
5 31/64 2/64 31/64 1/32 96.9
10 1023/2048 2/2048 1023/2048 1/2048 99.9

t (2t 1)/2t + 1 2/2t + 1 = 1/2t (2t 1)/2t + 1 1/2t 1 1/2t

100

75
Homozygotes (%)

1 5 10 20 40 100
50

0
1 2 3 4 5 6 7 8 9 10 11 12
Generations of selfing

Fig. 4.2. Effects of generations and genetic loci on the proportion of homozygotes in self-pollinated
populations (numbers of generations are 1, 5, 10, 20, 40, 100).

is close to zero or two loci are completed the genetic combinations of two parental
linked, the rate of homogenization will genomes represented in individual F2 plants
be close or equal to the rate for the popu- are each represented by an RIL (Fig. 4.3).
lation with one heterozygous locus. If r is The genetic combinations of two parental
about 50%, the rate of homogenization will genomes are fixed in a group of RILs.
be about the same as that for the popula- For quantitative traits that are con-
tion with two heterozygous loci. It can be trolled by polygenes or multiple QTL, the
estimated that for two linked loci and after mean value of the population will return
one generation of selfing, the proportion of to the average value of the parental lines
homozygotes will be 41% for r = 10%, 34% because dominance and dominance-related
for r = 20%, 26% for r = 40% and 25.26% epistasis will dissipate with increasing
for r = 45%. homogenization. The variance will also
Continuous inbreeding (e.g. selfing) change with increasing homogenization
results in the fixation of segregation so that but the direction of change will depend
Populations in Genetics and Breeding 133

P1 P2 making it possible to manipulate large-

sized populations. For some plant species,
such as tobacco and Brassica however, self-
incompatibility prohibits the production of
RILs through inbreeding.
F1

4.3.2 Development of RILs

RILs are the products of successive inbreed-

F2 ing. Based on reproduction systems and the
degree of inbreeding, there are several types
of procedures for developing RILs.
Full-sib mating: for outbreeding organ-
isms, the most severe inbreeding is full-sib
mating, i.e. mating between the offspring
of the same parents. Because outbreeding
organisms are highly heterozygous, they
RIL have to be inbred for several generations
to approach homozygosity. The inbreeding
parents can be then used to produce prog-
enies that will be intermated to produce the
next generation of progenies. This process
Fig 4.3. Production of RILs by successive selfings. will continue until the progenies are highly
Two parental lines, P1 and P2, are crossed to homozygous.
produce F1. The F1 is then selfed to produce F2. Selfing: for self-pollinated plants, cul-
The selfing process continues until a certain level tivars are genetically homozygous so they
of homozygosity is reached. The end product can be used to produce hybrids directly, fol-
consists of a set of RILs, each of which is a fixed lowed by successive selfing. There are two
recombinant of the parental lines. different procedures for the management
of the progenies, bulking and SSD. In the
on the effect of related genes and their bulking method, hybrids are bulk planted
interactions. Figure 4.4 shows the changes and harvested until F5 to F8 before they are
in mean value and variance in RIL popu- planted by families.
lations derived by SSD under different
genetic models. Single seed descent (SSD) method
In animals, RILs are usually stable for up
to 20 generations of sibmating. Such long- The SSD method was proposed by the
term continuous sibmating results in such Canadian scientist Guolden in 1941.
a low viability that it is very hard to main- Starting from F2, one or several seeds are
tain the population. The mouse was the first harvested from each plant and planted to
animal used for genetic mapping with RILs produce the next generation until F5 to F8.
and its RIL population is relatively small. When most plants are homozygous, all the
However, the problems associated with SSD seeds from each plant are harvested to
small population size can be ameliorated by produce RILs. Plant breeders use three pro-
using combined information from multiple cedures to implement the concept of SSD
sets of RILs. In plant species by contrast, it (Fehr, 1987).
takes about half of the time required in ani-
mals to obtain stable RILs through inbreed- SINGLE-SEED PROCEDURE. When the single-
ing. Also, maintaining RIL populations in seed procedure is used, the size of the pop-
plants can be achieved at much lower costs, ulations will decrease in each generation
134 Chapter 4

A B

38 10

III
III
34 8

IV
I

Variance
Mean

30 6
I
IV II

26 4

P1 II
22 P2 2

P F1 F2 F3 F4 F5 F6 F2 F3 F4 F5 F6
Generation Generation

Fig. 4.4. Change of mean (A) and variance (B) in RIL populations derived by SSD. (I) Additive increasing
alleles are completely dominant. (II) Additive without dominance effect. (III) Additive increasing alleles are
completely dominant with complementary interaction. (IV) Additive increasing alleles are completely domi-
nant with duplicate interaction.

because of lack of seed germination or fail- hill the following generation. An individual
ure of plant establishment to produce seed. plant is harvested from each line when the
It is necessary to decide on the number of population has reached the desired level of
inbred plants that are desired in the last homozygosity.
generation and begin with an appropriate With the single-hill procedure the iden-
population size in the F2 generation. The tity of each F2 plant and its progeny can be
single-seed procedure ensures that each maintained during self-pollination. When
individual in the final population traces to the identity of an F2 is maintained, the seed
a different F2 individual. However, the pro- packet and hill must be properly identified
cedure cannot ensure that a particular F2 with a line designation for planting and
will be represented in the final population harvest.
because failure of any seed to germinate or
generate a productive plant automatically MULTIPLE-SEED PROCEDURE. Use of the single-
eliminates that seeds F2 family. seed procedure requires that the size of the
populations in F2 be larger than in later gen-
SINGLE-HILL PROCEDURE. The single-hill pro- erations, due to lack of seed germination or
cedure can be used to ensure that each F2 plant establishment for seed set. Usually,
plant will have progeny represented in two samples are harvested, one for planting
each generation of inbreeding. Progeny in the next generation and one for a reserve.
from individual plants are maintained as Researchers sometimes bulk two or three
separate lines during each generation of seeds from each plant during harvest. Part
inbreeding by planting a few seeds in a of the sample is planted and the remainder
hill or row, harvesting self-pollinated seeds is reserved. The procedure is referred to as
from the hill and planting them in another modified SSD. The number of seeds planted
Populations in Genetics and Breeding 135

and harvested each season depends on the opportunities to recombine in RIL popu-
number of lines desired from the popula- lations. This property was discovered by
tion and the anticipated percentages of seed Haldane and Waddington (1931) by studying
germination, seedling establishment and inbreeding populations. For tightly linked
seed set. loci, the number of recombinants observed
in RILs is twice that observed in the popu-
lations with only one cycle of meiosis. At
Advantages and disadvantages of SSD
the beginning stage of genetic mapping,
procedures
this multiple recombination in RILs makes
Fehr (1987) summarized the merits of the it difficult to detect linkage. Once linkage
SSD procedures and indicated the follow- relationships are roughly established among
ing advantages: loci, the greater frequency of recombination
makes it easy to detect non-allelism among
They are an easy way to maintain popu- loci. It also makes the estimation of genetic
lations during inbreeding. distances more accurate because the con-
Natural selection does not influence fidence interval for an estimated genetic
the population unless genotypes differ distance is a function of recombination
in their ability to produce at least one frequency. With the increased number of
viable seed each generation. meiosis events, there are more opportunities
The procedures are well suited to green- to find recombinants between two tightly
house and off-season nurseries where linked loci (Fig. 4.5).
the performance of genotypes may not In populations that have undergone only
be representative of their performance one cycle of meiosis, recombinant frequency
in the area in which they are normally r (%) is linearly related to map distance
grown. R (cM), as indicated by the dashed line in
The disadvantages are: (i) artificial selec- Fig. 4.5. In RIL populations derived from
tion is based on the phenotype of individual selfing, r is almost equal to 2R when the
plants, not on progeny performance, when map distance is small, which is indicated
SSD is used for cultivar development rather by the solid line and formula R = r/(22r)
than genetic population development; and (Fig. 4.5). For the RIL populations derived
(ii) natural selection cannot influence the
populations in a positive manner unless
undesirable genotypes do not germinate or 50
set any seed.
Recombinant frequency r (%)

2R
40 r =
1+2R

4.3.3 Map distance and recombinant 30

fraction in RIL populations
20
Theoretically and in practice, no matter how
many cycles of inbreeding are completed,
some degree of heterozygosity will always 10
exist in the RIL population. From the above
discussion, we can estimate the remain-
10 20 30 40 50
ing heterozygosity for each generation of
Map distance R (cM)
inbreeding. In genetic mapping, nearly com-
pletely homozygous RILs are used. RILs Fig 4.5. Relationship between map distance
have undergone several cycles of meiosis (R) and recombinant frequency (r) for RIL
before fixation, which differs from F2 or BC populations derived by continuous selfing (solid
populations where only one cycle of meiosis line) and for populations that have undergone only
occurs. As a result, linked genes have more one cycle of meiosis, e.g. F2 or BC (dashed line).
136 Chapter 4

from sibmating, the skew becomes more sig- populations of comparable population size.
nificant with r nearly equal to 4R when the According to Taylor (1978), RILs derived
map distance is small. from sibmating were more powerful in the
estimation of map distances than popula-
tions undergoing single meiosis when R
12.5cM. Based on Taylors method, it can be
4.3.4 Construction of genetic inferred that RILs derived from self-pollina-
maps using RILs tion have greater influence on the estima-
tion of map distance when R 23cM.
As each RIL is inbred as a DH line and thus Because of the advantages of RILs,
can be propagated indefinitely, a panel of they have been receiving great attention in
RILs has a number of advantages for genomic genomics research. Numerous RIL popu-
studies: (i) each line needs to be genotyped lations have been developed in plant spe-
only once; (ii) multiple individuals can be cies, especially in maize and rice. Burr et
phenotyped from each line to reduce indi- al. (1988) reported RFLP maps constructed
vidual, environmental and measurement var- using two maize RIL populations, T232
iability; (iii) multiple invasive (destructive) CM37 and CO159 Tx303. Among
phenotypes can be obtained on the same set 334 mapped genetic loci, 220 were poly-
of genomes; and (iv) as recombinations are morphic in both populations. By comparing
more frequent in RILs than in populations the map distances obtained from these two
with only one meiosis, greater resolution populations with each other and with pub-
can be achieved in genetic mapping. licly accepted map distances, they found
In genetic mapping with RIL popula- that the differences could be twice as large
tions, recombinant frequency should be in some cases. Although these differences
converted into map distance using the for- were still within the range of confidence
mula R = r/(22r) proposed by Haldane and intervals, they might be due to the genetic
Waddington (1931). There are no mapping difference in recombinant frequencies at
functions available for RIL populations to specific chromosomal regions. In maize
adjust for double crossover events as there there is no significant polymorphism caused
are for populations with one cycle of mei- by chromosome rearrangement, except for
osis as discussed in Chapter 2. When the chromosome 10. Therefore, it is not surpris-
map distance is within the range that allows ing that there was no significant difference
confidence about linkage detection, recom- in map distance between the two maize RIL
binant frequency has a linear relationship populations. Table 4.2 provides some exam-
with map distance (Fig. 4.5; Silver, 1985). ples of RIL populations developed in maize
Non-linked loci may be linked simply (Burr et al., 1988) and in rice (Xu, Y., 2002)
due to chance. These false linkages can often that have been widely used for linkage map-
be confirmed by whether a linkage detected ping and gene tagging.
with one marker is also judged to be linked
by other markers in the same linkage group
and whether the suspected linkage found in 4.3.5 Intermated RILs and nested RIL
one population can also be detected in other populations
RIL populations. Mouse geneticists dis-
cussed the case when a linkage could not be
Intermated RILs
certain because of small population sizes,
and Silver (1985) provided a table for the The production of RILs allows for the accu-
95 and 99% confidence intervals for esti- mulation of recombination breakpoints
mated map distances based on RILs derived during the inbreeding phase. However, the
from sibmating. At low rates of recombi- accumulation in RILs is limited by the fact
nation, these intervals are relatively small that each generation of inbreeding makes
when compared with those obtained from the recombining chromosomes more simi-
the binomial distribution for F2 and BC1 lar to one another so that meiosis ceases to
Populations in Genetics and Breeding 137

Table 4.2. Some examples of RIL populations developed in maize (Burr et al., 1988) and rice (Xu, 2002).

Species Population Population size No. markersa

Maize T232 CM37 48

CO159 Tx303 160
Mo17 B73 44
PA326 ND300 74
CK52 A671 162
CG16 A671 172
Ch593-9 CH606-11 101
CO220 N28 173
Rice 9024 LH422 194 141
CO39 Moroberekan 281 127
Lemont Teqing 315 217
IR58821 IR52561 166 399
IR74 J almagna 165 144
Zhenshan 97 Minghui 63 238 171
Asominori IR24 65 289
Acc8558 H359 131 225
IR1552 Azucena 150 207
IR74 FR13A 74 202
IR20 IR55178-3B-9-3 84 217
a
Numbers of markers shown for the first generation of genetic maps and more markers have been added to many
of these maps since then.

generate new recombinant haplotypes. As map expansion. Random-mating designs

an alternative to RILs, Darvasi and Soller with variance in offspring number are also
(1995) proposed randomly mating the F2 poor at increasing mapping resolution. It is
progeny of a cross between inbred found- suggested that the most effective designs
ers and using successive generations of ran- for IRIL construction are inbreeding avoid-
dom mating to promote the accumulation ance and random mating with equal con-
of recombination breakpoints in the result- tributions from each parent to the next
ing advanced intercross lines or interma- generation.
ted recombinant inbred lines (IRILs). IRIL
design has obvious appeal in its union of Multi-way or nested RIL populations
the advantages of both IRILs and RILs and
has been employed in the production of Using two or more RIL populations in genetic
mapping populations in several species. As mapping provides several advantages:
a result, IRILs have become interestingly (i) polymorphisms that are not detected in
popular for QTL mapping. one population may be detected in another;
Breeding designs for IRILs have been (ii) weak linkage identified in one popula-
investigated by Rockman and Kruglyak tion can be confirmed or excluded by using
(2008). Their results indicated that the other populations; (iii) multiple popula-
simplest design, random pair mating with tions with shared genetic data can be com-
each pair contributing exactly two offspring bined and considered as a single population
to the next generation, performed as well to provide more reliable results; and (iv)
as the most extreme inbreeding avoidance multiple populations provide a wide spec-
schemes in expanding the genetic map, trum of target loci across the genome since
increasing fine-mapping resolution and for quantitative traits the related loci with
controlling genetic drift. Circular mating genetic differences between the parents are
designs offered negligible advantages for almost always different from population to
controlling drift and gave greatly reduced population.
138 Chapter 4

The Complex Trait Consortium for ing variation in maize, 25 RIL mapping
mouse proposed the development of a large populations were created. Twenty-five
panel of eight-way RILs (Complex Trait diverse lines were selected to capture
Consortium, 2004). An eight-way RIL, also 80% of the nucleotide polymorphism in
known as Collaborative Cross, is formed by maize. In order to provide a uniform eval-
intermating eight parental inbred strains uation background, each line was crossed
followed by repeated sibling mating to to a common parent, B73 (the standard
produce a new set of inbred lines whose US inbred), to form 25 RIL populations.
genome is a mosaic of the eight parental Each of these RIL populations has at least
strains (Broman, 2005). Such a panel of 200 RILs, each descended from a unique
RILs would serve as a valuable resource F2 plant, resulting in a total of 5000 RILs.
for mapping the loci that contribute to Using SSD and low density planting, 88%
complex phenotypes in mouse and would success in advancing lines per genera-
support studies that incorporate multiple tion was achieved. This has developed as
genetic, environmental and developmen- an integrated mapping approach, called
tal variables into comprehensive statisti- Nested Association Mapping (NAM),
cal models of complex traits (Complex which exploits simultaneously the advan-
Trait Consortium, 2004). The genomes of tages of linkage analysis and association
eight founder strains are rapidly combined (or linkage disequilibrium, LD) mapping
and are then inbred to produce finished as discussed in Chapter 6. The power of
RIL strains. Eight-way RIL strains achieve NAM for genome-wide QTL mapping has
99% inbreeding by generation 23. Each been demonstrated by computer simu-
strain captures 135 unique recombinant lation with varied numbers of QTL and
events. With genetic contributions from trait heritabilities (Yu et al., 2008). With
multiple parental strains including several a dense coverage (2.6 cM) of common-
wild derivatives, the eight-way RILs will parent-specific (CPS) markers, the genome
capture an abundance of genetic diversity information for 5000 RILs can be inferred
and will retain segregating polymorphisms based on the parental genome informa-
every 100200 bp. This level of genetic tion. Essentially, the linkage information
diversity will be sufficient to drive phe- captured by the CPS markers and the LD
notypic diversity in almost any trait of information among loci residing between
interest. An estimated 1000 strains will CPS markers was then projected to RIL
be required to guarantee high mapping based on parental information, ultimately
resolution and detect extended networks allowing for genome-wide high-resolution
of epistatic and geneenvironment interac- mapping. The power of NAM with 5000
tions. This estimate is based on the statisti-