Plant Breeding
Plant Breeding
In Memoriam
Norman Ernest Borlaug
(25 March 191412 September 2009)
Norman Borlaug was one of the greatest men of our times a steadfast champion and
spokesman against hunger and poverty. He dedicated his 95 richly lived years to filling the
bellies of others, and is credited by the United Nations World Food Program with saving
more lives than any other man in history.
An American plant pathologist who spent most of his years in Mexico, it was Dr
Borlaugs high-yielding dwarf wheat varieties that prevented wide-spread famine in South
Asia, specifically India and Pakistan, and also in Turkey. Known as the Green Revolution,
this feat earned him the Nobel Peace Prize in 1970. He was instrumental in establishing
the International Maize and Wheat Improvement Center, known by its Spanish acronym
CIMMYT, and later the Consultative Group of International Agricultural Research (CGIAR),
a network of 15 agricultural research centres.
Dr Borlaug spent time as a microbiologist with DuPont before moving to Mexico in
1944 as a geneticist and plant pathologist to develop stem rust resistant wheat cultivars. In
1966 he became the director of CIMMYTs Wheat Program, seconded from the Rockefeller
Foundation. His full-time employment with the Center ended in 1979, although he remained
a part-time consultant until his death. In 1984 he began a new career as a university pro-
fessor and went on to establish the World Food Prize, which honours the achievements of
individuals who have advanced human development by improving the quality, quantity or
availability of food in the world. In 1986, he joined forces with former US President Jimmy
Carter and the Nippon Foundation of Japan, under the chairmanship of Ryoichi Sasakawa,
to establish Sasakawa Africa Association (SAA) to address Africas food problems. Since
then, more than 1 million small-scale African farmers in 15 countries have been trained by
SAA in improved farming techniques.
Dr Borlaug influenced the thinking of thousands of agricultural scientists. He was a path-
breaking wheat breeder and, equally important, his stature enabled him to influence politi-
cians and leaders around the world. His legacy and his work ethic to get things done and not
mind getting your hands dirty influenced us all and remain CIMMYT guiding principles.
We will honor Dr Borlaugs memory by carrying forward his mission and spirit of inno-
vation: applying agricultural science to help smallholder farmers produce more and better-
quality food using fewer resources. At stake is no less than the future of humanity, for, as
Borlaug said: The destiny of world civilization depends upon providing a decent standard
of living for all. His presence will never really leave CIMMYT; it is embedded in our soul.
Thomas A. Lumpkin
Director General, CIMMYT
Marianne Bnziger
Deputy Director General for Research and Partnerships, CIMMYT
Hans-Joachim Braun
Director for Global Wheat Program, CIMMYT
Molecular Plant Breeding
Yunbi Xu
Preface ix
Foreword by Dr Norman E. Borlaug xv
Foreword by Dr Ronald L. Phillips xvii
1 Introduction 1
1.1 Domestication of Crop Plants 1
1.2 Early Efforts at Plant Breeding 3
1.3 Major Developments in the History of Plant Breeding 4
1.4 Genetic Variation 9
1.5 Quantitative Traits: Variance, Heritability and Selection Index 10
1.6 The Green Revolution and the Challenges Ahead 16
1.7 Objectives of Plant Breeding 17
1.8 Molecular Breeding 18
v
vi Contents
References 627
Index 717
The colour plate section can be found following p. 270
This page intentionally left blank
Preface
The genomics revolution of the past decade has greatly enhanced our understanding of
the genetic composition of living organisms including many plant species of economic
importance. Complete genomic sequences of Arabidopsis and several major crops, together
with high-throughput technologies for analyses of transcripts, proteins and mutants, pro-
vide the basis for understanding the relationship between genes, proteins and phenotypes.
Sequences and genes have been used to develop functional and biallelic markers, such as
single nucleotide polymorphism (SNP), that are powerful tools for genetic mapping, germ-
plasm evaluation and marker-assisted selection.
The road from basic genomics research to impacts on routine breeding programmes has
been long, windy and bumpy, not to mention scattered with wrong turns and unexpected
blockades. As a result, genomics can be applied to plant breeding only when an integrated
package becomes available that combines multiple components such as high-throughput
techniques, cost-effective protocols, global integration of genetic and environmental fac-
tors and precise knowledge of quantitative trait inheritance. More recently, the end of the
tunnel has come in sight, and the multinational corporations have ramped up their invest-
ments in and expectations from these technologies. The challenge now is to translate and
integrate the new knowledge from genomics and molecular biology into appropriate tools
and methodologies for public-sector plant breeding programmes, particularly those in low-
income countries. It is expected that harnessing the outputs of genomics research will be
an important component in successfully addressing the challenge of doubling world food
production by 2050.
The term molecular plant breeding has been much used and abused in the literature, and
thus loved or maligned in equal measure by the readership. In the context of this book, the
term is used to provide a simple umbrella for the multidisciplinary field of modern plant
breeding that combines molecular tools and methodologies with conventional approaches
for improvement of crop plants. This book is intended to provide comprehensive coverage
of the components that should be integrated within plant breeding programmes to develop
crop products in a more efficient and targeted way.
ix
x Preface
The first chapter introduces some basic concepts that are required for understanding
fundamentally important issues described in subsequent chapters. The concepts include
crop domestication, critical events in the history of plant breeding, basics of quantitative
genetics (variance, heritability and selection index), plant breeding objectives and molecu-
lar breeding goals. Chapters 2 and 3 introduce the key genomics tools that are used in
molecular breeding programmes, including molecular markers, maps, omics technolo-
gies and arrays. Different types of molecular markers are compared and construction of
molecular maps is discussed. Chapter 4 describes common types of populations that have
been used in genetics and plant breeding, with a focus on recombinant inbred lines, dou-
bled haploids and near-isogenic lines. Chapter 5 provides an overview of marker-assisted
germplasm evaluation, management and enhancement. Chapters 6 and 7 discuss the theory
and practice, respectively, of using molecular markers to dissect complex traits and locate
quantitative trait loci (QTL). Chapters 8 and 9 cover the theory and practice, respectively,
of marker-assisted selection. Genotype-by-environment interaction (GEI) is discussed in
Chapter 10, including multi-environment trials, stability of genotype performance, molecu-
lar dissection of GEI and breeding for optimum GEI. Chapter 11 provides a summary of
gene isolation and functional analysis approaches, including in silico prediction of genes,
comparative approaches for gene isolation, gene cloning based on cDNA sequencing, posi-
tional cloning and identification of genes by mutagenesis. Chapter 12 describes the use of
isolated and characterized genes for gene transfer and the generation of genetically modi-
fied plants, focusing on the vital elements of expression vectors, selectable marker genes,
transgene integration, expression and localization, transgene stacking and transgenic crop
commercialization. Chapter 13 is devoted to intellectual property rights and plant vari-
ety protection, including plant breeders rights, international agreements affecting plant
breeding, plant variety protection strategies, intellectual property rights affecting molec-
ular breeding and the use of molecular techniques in plant variety protection. The last
two chapters (14 and 15) discuss supporting tools that are required in molecular breeding
for information management and decision making, including data collection, integration,
retrieval and mining and information management systems. Decision support tools are
described for germplasm and breeding population management and evaluation, genetic
mapping and marker-trait association analysis, marker-assisted selection, simulation and
modelling, and breeding by design.
Intended audience and guidance for reading and using this book
This book is intended to provide a handbook for biologists, geneticists and breeders, as well
as a textbook for final year undergraduates and graduate students specializing in agronomy,
genetics, genomics and plant breeding. Although the book has attempted to cover all rel-
evant areas of molecular breeding in plants, many examples have been drawn from the
genomics research and molecular breeding of major cereal crops. It is hoped that the book
can also serve as a resource for training courses as described below. As each chapter covers
a complete story on a special topic, readers can choose to read chapters in any order.
Advanced Course on Quantitative Genetics: Chapters 1, 2, 4, 6, 7, 10 and 14, which
cover all molecular marker-based QTL mapping, including markers, maps, populations,
statistics and genotype-by-environment interaction.
Comprehensive Course on Marker-assisted Plant Breeding: Chapters 1, 2, 3, 4, 5, 8, 9,
10, 13, 14 and 15, which cover basic theories, tools, methodologies about markers, maps,
omics, arrays, informatics and support tools for marker-assisted selection.
Short Course on Genetic Transformation: Chapters 1, 11, 12 and 13, which provide
a brief introduction to gene isolation, transformation techniques, genetic-transformation-
related intellectual property and genetically modified organism (GMO) issues.
Preface xi
This book has been almost a decade in preparation. In fact, the initial idea for the book
was stimulated by the impact from my previous book Molecular Quantitative Genetics
published by China Agriculture Press (Xu and Zhu, 1994), which was well received by
colleagues and students in China and used as a textbook in many universities. Preliminary
ideas related to the book were developed in a review article on QTL separation, pyramiding
and cloning in Plant Breeding Reviews (Xu, 1997). Much of the hopeful thinking described
in this paper has fortunately come true during the following 10 years, and the manipula-
tion of QTL has been revolutionized and become mainstream. As complete sequences for
several plant genomes have become available and with more anticipated, as shown by
numerous genes and QTL that have been separated and cloned individually, some of them
have been pyramided for plant breeding through genetic transformation or marker-assisted
selection.
I started making tangible progress on this book while working as a molecular breeder
for hybrid rice at RiceTec, Inc., Texas (19982003). This experience shaped my thinking
about how an applied breeding programme could be integrated with molecular approaches.
With numerous QTL accumulating for a model crop, taking all the QTL into consideration
becomes necessary. Initial thoughts on this were described in Global view of QTL . . ., pub-
lished in the proceedings on quantitative genetics and plant breeding, which considered
various genetic background effects and genotype-by-environment interaction (Xu, Y., 2002).
Hybrid rice breeding, which involves a three-line system, requires a large number of test-
crosses in order to identify traits that perform well in seed and grain production. My expe-
rience in development of marker-assisted selection strategies for breeding hybrid rice was
then summarized in a review article in Plant Breeding Reviews (Xu, Y., 2003), which also
covered general strategies for other crops using hybrids.
Moving on to research at Cornell University with Dr Susan McCouch helped me to bet-
ter understand how molecular techniques could facilitate breeding of complex traits such
as water-use efficiency, which is a difficult trait to measure and requires strong collabora-
tion among researchers across many disciplines. In addition, this experience with rice as a
model crop raised the issue of how we can use rice as a reference genome for improvement
of other crops, which was discussed in an article published in a special rice issue of Plant
Molecular Biology (Xu et al., 2005).
With over 20 years experience in rice, I decided to shift to another major crop by work-
ing for the International Maize and Wheat Improvement Center (CIMMYT) as the principle
maize molecular breeder. CIMMYT has given me exposure to an interface connecting basic
research with applied breeding for developing countries and the resource-poor. Comparing
public- and private-sector breeding programmes has given me an intense understanding
of the importance of making the type of breeding systems that have been working well
for the private sector a practical reality for the public sector, particularly in developing
countries. This has been addressed in a recent review paper published in Crop Science
(Xu and Crouch, 2008), which discussed the critical issues for achieving this translation.
My most recent research has focused on the development of various molecular breeding
xii Preface
platforms that can be used to facilitate breeding procedures through seed DNA-based geno-
typing, selective and pooled DNA analysis, and chip-based large-scale germplasm evalua-
tion, markertrait association and marker-assisted selection (see Xu et al., 2009b for further
details). Thus, my career has evolved alongside the transition from molecular biology
research to routine molecular plant breeding applications and I strongly believe that now
is the right time for a mainstream publication providing comprehensive coverage of all
fields relevant for a new generation of molecular breeders.
Acknowledgements
The dream of writing this book could not have become reality without the wonderful sup-
port of Dr Susan McCouch at Cornell University and Dr Jinhua Xiao, now at Monsanto, who
have both fully supported my proposal since 2002. Their support and consistent encour-
agement has greatly motivated me throughout the process. While working with Susan,
she allowed me so much flexibility in my research projects and working hours so that I
could continue to make progress on the writing of this book. At the same time the Cornell
libraries were an indispensible source of the major references cited throughout the book.
Susans encouragement provided the impetus to keep working on the book through a very
difficult time in my life. I also extend my appreciation to Dr Jonathan Crouch, the Director
of the Germplasm Resources Program at CIMMYT, where I received his full understanding
and support so that I could complete the second half of the book. Jonathans guidance and
contribution to my research projects and publications while at CIMMYT has significantly
impacted the preparation of the book.
I would also like to thank the chief editors of the three journals for which I have served
on the editorial boards during the preparation of this book: Dr Paul Christou for Molecular
Breeding, Dr Albrecht Melchinger for Theoretical and Applied Genetics, and Dr Hongbin
Zhang for International Journal of Plant Genomics. I thank them for their patience, support
and flexibility with my editorial responsibilities during the preparation of the book. In addi-
tion, Drs Christou and Melchinger also reviewed several chapters in their respective fields.
My appreciation also goes to Yanli Lu (a graduate student from Sichuan Agricultural
University of China) and Dr Zhuanfang Hao (a visiting scientist from the Chinese Academy
of Agricultural Sciences) who helped prepare some figures and tables during their work
in my lab at CIMMYT, Mexico. I would like to give special thanks to Dr Rodomiro Ortiz
at CIMMYT for his consistent information sharing and stimulating discussions during our
years together at CIMMYT. Finally, I would like to thank my colleagues at CIMMYT, par-
ticularly Drs Kevin Pixley, Manilal William, Jose Crossa and Guy Davenport, who provided
useful discussions on various molecular breeding-related issues.
Forewords
I am greatly indebted to Dr Norman E. Borlaug, visioned plant breeder and Nobel laure-
ate for his role in the Green Revolution, and Dr Ronald L. Phillips, Regents Professor and
McKnight Presidential Chair in Genomics, University of Minnesota, who each contributed
a foreword for the book. Their contributions emphasized the importance of molecular
breeding in crop improvement and the role that this book will play in molecular breeding
education and practice.
Preface xiii
Reviewers
Each chapter of the book has undergone comprehensive peer review and revision before
finalization. The constructive comments and critical advice of these reviewers have greatly
improved this book. The reviewers were selected for their active expertise in the field of the
respective chapter. Reviewers come from almost all continents and work in various fields
including plant breeding, quantitative genetics, genetic transformation, intellectual prop-
erty protection, bioinformatics and molecular biology, many of whom are CIMMYT scien-
tists and managers. Considering that each chapter is relatively large in content, reviewers
had to contribute a lot of time and effort to complete their reviews. Although these inputs
were indispensible, any remaining errors remain my sole responsibility. The names and
affiliations of the reviewers (alphabetically) are:
Several editors at CABI have been working with me over the years: Tim Hardwick (2002
2006), Sarah Hulbert (20062007), Stefanie Gehrig (20072008), Claire Parfitt (20082009),
Meredith Caroll (2009) and Tracy Head (2009). These editors and their associates have
done a superb job of converting a series of manuscripts into a useable and coherent book.
I thank them for their effort, consideration and cooperation.
xiv Preface
Research grants
During the preparation of the book, my research on genomic analysis of plant water-use
efficiency at Cornell University was supported by the National Science Foundation (Plant
Genome Research Project Grant DBI-0110069). My molecular breeding research at CIMMYT
has been supported by the Rockefeller Foundation, the Generation Challenge Programme
(GCP), Bill and Melinda Gates Foundation and the European Community, and through
other attributed or unrestricted funds provided by the members of the Consultative Group
on International Agricultural Research (CGIAR) and national governments of the USA,
Japan and the UK.
Family
It is difficult to imagine writing a book without the full support and understanding of ones
family. My greatest thanks go to my wife, Yu Wang, who has given me her wholehearted and
unwavering support, and to my sons, Sheng, Benjamin and Lawrence, who have retained
great patience during this long adventure. And finally to my parents, for their love, encour-
agement and vision that unveiled in me from my earliest years the desire to thrive on the
challenge of always striving to reach the highest mountain in everything I do.
Foreword
DR NORMAN E. BORLAUG
The past 50 years have been the most productive period in world agricultural history.
Innovations in agricultural science and technology enabled the Green Revolution, which
is reputed to have spared one billion people the pains of hunger and even starvation.
Although we have seen the greatest reductions in hunger in history, it has not been enough.
There are still one billion people who suffer chronic hunger, with more than half being
small-scale farmers who cultivate environmentally sensitive marginal lands in developing
countries.
Within the next 50 years, the world population is likely to increase by 6080%, requir-
ing global food production to nearly double. We will have to achieve this feat on a shrinking
agricultural land base, and most of the increased production must occur in those countries
that will consume it. Unless global grain supplies are expanded at an accelerated rate, food
prices will remain high, or be driven up even further.
Spectacular economic growth in many newly industrializing developing countries,
especially in Asia, has spurred rapid growth in global cereal demand, as more people eat
better, especially through more protein-heavy diets. More recently, the subsidized conver-
sion of grains into biofuels in the USA and Europe has accelerated demand even faster.
On the supply side, a slowing in research investment in the developing world and more
frequent climatic shocks (droughts, floods) have led to greater volatility in production.
Higher food prices affect everyone, but especially the poor, who spend most of their
disposable income on food. Increasing supply, primarily through the generation and diffu-
sion of productivity-enhancing new technologies, is the best way to bring food prices down
and secure minimum nutritional standards for the poor.
Todays agricultural development challenges are centred on marginal lands and in
regions that have been bypassed during the Green Revolution, such as Africa and resource-
poor parts of Asia, and are experiencing the ripple effects of food insecurity through hun-
ger, malnutrition and poverty.
Despite these serious and daunting challenges, there is cause for hope. New science
and technology including biotechnology have the potential to help the worlds poor and
food insecure. Biotechnologies have developed invaluable new scientific methodologies
and products for more productive agriculture and added-value food. This journey deeper
into the genome to the molecular level is the consequence of our progressive understanding
of the workings of nature. Genomics-based methods have enabled breeders greater preci-
sion in selecting and transferring genes, which has not only reduced the time needed to
xv
xvi Foreword
eliminate undesirable genes, but has also allowed breeders to access useful genes from
distant species.
Bringing the power of science and technology to bear on the challenges of these riskier
environments is one of the great challenges of the 21st century. With the new tools of bio-
technology, we are poised for another explosion in agricultural innovation. New science
has the power to increase yields, address agroclimatic extremes and mitigate a range of
environmental and biological challenges.
Molecular Plant Breeding, authored by my CIMMYT colleague Yunbi Xu, is an out-
standing review and synthesis of the theory and practice of genetics and genomics that
can drive progress in modern plant breeding. Dr Xu has done a masterful job in integrating
information about traditional and molecular plant breeding approaches. This encyclope-
dic handbook is poised to become a standard reference for experienced breeders and stu-
dents alike. I commend him for this prodigious new contribution to the body of scientific
literature.
Foreword
DR RONALD L. PHILLIPS
The road is long from basic research findings to final destinations reflecting important
applications but it is a road that can ultimately save time and money. There may be obsta-
cles along the way that delay building that road but they are generally overcome by careful
thought and timely considerations. A new road may involve the former road but with some
widening and the filling in of certain potholes. We seldom look back and think that the
improvements were not useful.
The road to improved varieties by traditional plant breeding has and continues to serve
society well. That approach has been based on careful observation, evaluation of multi-
ple genotypes (parents and progenies), selection at various generational levels, extensive
testing and the sophisticated utilization of statistical analyses and quantitative genetics.
About 50% of the increased productivity of new varieties is generally attributed to genetic
improvements, with the remaining 50% due to many other factors such as time of planting,
irrigation, fertilizer, pesticide applications and planting densities.
The statistical genetics associated with traditional plant breeding can now be supple-
mented by extensive genomic information, gene sequences, regulatory factors and linked
genetic markers. We can now draw on a broader genetic base, the identification of major
loci controlling various traits and expression analyses across the entire genome under vari-
ous biotic and abiotic conditions. One can anticipate a future when the networking of
genes, genotype-by-environment (G E) interactions, and even hybrid vigour will be better
understood and lead to new breeding approaches. The importance of de novo variation may
modify much of our current interpretation of breeding behaviour; de novo variation such as
mutation, intragenic recombination, methylation, transposable elements, unequal crossing
over, generation of genomic changes due to recombination among dispersed repeated ele-
ments, gene amplification and other mechanisms will need to be incorporated into plant
breeding theory.
This book calls for an integration of approaches traditional and molecular and
represents a theoretical/practical handbook reflecting modern plant breeding at its fin-
est. I believe the reader will be surprised to find that that this single-authored book is so
full of information that is useful in plant genetics and plant breeding. Students as well
as established researchers wanting to learn more about molecular plant breeding will be
xvii
xviii Foreword
well-served by reading this book. The information is up-to-date with many current refer-
ences. Even many of the tables are packed with information and references. A good rep-
resentation of international and domestic breeding is reflected through many examples.
The importance of G E interactions is clearly demonstrated. Various statistical models
are provided as appropriate. The importance of defining mega-environments for varietal
development is made clear. The role of core germplasm collections, appropriate population
sizes, major databases and data management issues are all integrated with various plant
breeding approaches. Marker-assisted selection receives considerable attention, includ-
ing its requirements and advantages, along with the multitude of quantitative trait locus
(QTL) analysis methods. Transformation technologies leading to the extensive use of trans-
genic crops are reviewed along with the increased use of trait stacking. The procurement of
intellectual property that, in part, is driving the application of molecular genetics in plant
breeding provides the reader with an understanding of why private industry is now more
involved and why some common crops represent new business opportunities.
Molecular Plant Breeding is not like other plant breeding books. The interconnecting
road that it depicts is one where you can look at the beautiful new scenery and appreciate
the current view, yet see the horizon down the road.
1
Introduction
Several definitions of plant breeding have and technologies discussed in the following
been put forward, such as the art and sci- chapters of this book.
ence of improving the heredity of plants for
the benefit of humankind (J.M. Poehlman),
or evolution directed by the will of man
(N.I. Vavilov). Bernardo (2002), however, 1.1 Domestication of Crop Plants
offers the most universal description: Plant
breeding is the science, art, and business of The earliest records indicate that agricul-
improving plants for human benefit. ture developed some 11,000 years ago in
Plants are employed in the manufac- the so-called Fertile Crescent, a hilly region
ture of a multitude of products for domes- in south-western Asia. Agriculture devel-
tic (cosmetics, medicines and clothing), oped later in other regions. Archaeologists
industrial (manufacture of rubber, cork suggest that plant domestication began
and engine fuel) and recreational uses because of the increasing size of popula-
(paper, art supplies, sports equipment and tions and changes in the exploitation of
musical instruments) and plant breeders local resources (see http://www.ngdc.noaa.
have therefore been driven by the chal- gov/paleo/ctl/10k.html for further details).
lenges of meeting the ever increasing Domestication is a selection process carried
demands of the manufacturers of these out by man to adapt plants and animals to
products. Lewington has described the their own needs, whether as farmers or con-
diverse uses of plants in his book Plants sumers. Successive selection of desirable
for People (2003). plants changed the genetic composition of
Plant breeding began by the domesti- early crops. Primitive farmers, knowing little
cation of crop plants and has become ever or nothing about genetics or plant breeding,
more sophisticated. New developments accomplished much in a short time. They
in molecular biology have now led to an did so by unconsciously altering the natural
increasing number of methods which can process of evolution. Indeed, domestication
be used to enhance breeding effective- is nothing more than directed evolution; as
ness and efficiency. This chapter includes a result, the process of evolution is acceler-
a brief history of plant breeding together ated. The key to domestication is the selec-
with breeding objectives and some back- tive advantage of rare mutant alleles, which
ground information relevant to the theories are desirable for successful cultivation,
but unnecessary for survival in the wild. domesticated plants is another example.
The process of selection continues until For further information see http://oregon-
the desired mutant phenotype dominates state.edu/instruct/css/330/index.htm and
the population. There are three important Swaminathan (2006).
steps in the domestication process. Man It is generally believed that domesti-
not only planted seeds, but also: (i) moved cation of crop plants was undertaken in
seeds from their native habitat and planted several regions of the world independ-
them in areas to which they were perhaps ently. The Russian geneticist and plant
not as well adapted; (ii) removed certain geographer N.I. Vavilov, collected plants
natural selection pressures by growing the from all over the world and identified
plants in a cultivated field; and (iii) applied regions where crop species and their wild
artificial selection pressures by choosing relatives showed great genetic diversity. In
characteristics that would not necessarily 1926 he published Studies on the origin
have been beneficial for the plants under of cultivated plants in which he described
natural conditions. Cultivation also cre- his theories regarding the origins of crops.
ates selection pressure, resulting in changes Vavilov concluded that each crop had a
in allele frequency, gradations within and characteristic primary centre of diversity
between species, fixation of major genes, which was also its centre of origin. He
and improvement of quantitative traits. By identified eight areas and hypothesized
the end of the 18th century, the informal that these were the centres from which all
processes of selection practised by farmers our modern major crops originated. Later,
everywhere led to the worldwide creation he modified his theory to include second-
of thousands of different cultivars or land- ary centres of diversity for some crops.
races for each major crop species. These centres of origin included China,
More than 1000 species of plants have India, Central Asia, the Near East, the
been domesticated at one time or another, of Mediterranean, East Africa, Mesoamerica,
which about 100200 are now major com- and South America. From these foci, agri-
ponents of the human diet. The 15 most culture was progressively disseminated to
important examples can be divided into the other regions such as Europe and North
following four groups: America. Subsequently, others includ-
ing the American geographer Jack Harlan,
1. Cereals: rice, wheat, maize, sorghum,
challenged Vavilovs hypothesis because
barley.
many cultivated plants did not fit Vavilovs
2. Roots and stems: sugarbeet, sugarcane,
pattern, and appeared to have been domes-
potato, yam, cassava.
ticated over a broad geographical area for a
3. Legumes: bean, soybean, groundnut.
long period of time.
4. Fruits: coconut, banana.
In recent years, variation in DNA frac-
Certain characteristics may have been tions and other approaches have been used
selected deliberately or unwittingly. When to study the diversity of crop species. In
farmers set aside a portion of their har- general, these studies have not confirmed
vest for planting in the next season, they Vavilovs theory that the centres of origin
were selecting seeds with specific char- are the areas of greatest diversity, because
acteristics. This selection has resulted in while centres of diversity have been iden-
profound differences between crop plants tified, these are often not the centres of
and their progenitors. For example, many origin. For some crops there is little connec-
wild plants have a seed dispersal mecha- tion between the source of their wild ances-
nism that ensures that seeds will be sepa- tors, areas of domestication, and the areas
rated from the plants and distributed over of evolutionary diversification. Species
as large an area as possible, while mod- may have originated in one geographic area,
ern crops have been modified by selec- but domesticated in a different region and
tion against seed dispersal. The absence some crops do not appear to have centres
of seed dormancy mechanisms in some of diversity, thus a continuum of evolution-
Introduction 3
ary activity is perceived rather than discrete selected the best plants to provide seed for
centres. their next crop. When they found particular
In 1971, Jack Harlan described his own plants that fared well even in bad weather,
views on the origins of agriculture. He pro- were especially prolific, or resisted disease
posed three independent systems, each that had destroyed neighbouring crops,
with a centre and a concentre (larger, dif- they naturally tried to capture these desir-
fuse areas where domestication is thought to able traits by crossbreeding them into other
have occurred): Near East + Africa, China + plants. In this way, they selected and bred
South-east Asia, and Mesoamerica + South plants to improve their crop for commercial
America. purposes. Although unbeknown to them,
Evidence gathered since that time farmers have been utilizing genetics for cen-
suggests that these centres are also more turies to modify the food we eat by selecting
diffuse than he had envisioned. After the and growing seeds which produce a health-
initial phases of evolution, species spread ier crop that has a better flavour, richer col-
out over large, ill-defined areas. This is our and stronger resistance to certain plant
probably due to the dispersal and evolu- diseases.
tion of crops associated with iterant popu- Modern plant breeding started with
lations. Regional and/or multiple areas of sedentary agriculture and the domestica-
origin may prove to be more accurate than tion of the first agricultural plants, cereals.
the hypothesis of a unique, localized ori- This led to the rapid elimination of undesir-
gin for many crops. However, the probable able characters such as seed-shattering and
geographic origin of many crops is listed in dormancy and we can only speculate on
Table 1.1. how much foresight or what kind of plan-
ning based on experience was used by the
first selectors of non-shattering wheat and
1.2 Early Efforts at Plant Breeding rice, compact-headed sorghum, or soft-
shelled gourds. For 10,000 years man has
For thousands of years selective breed- consciously been moulding the phenotype
ing has been employed to re-engineer (and so the genotype) of hundreds of plant
plants to produce traits or qualities that species as one of the many routine activi-
were considered to be desirable to con- ties in the normal course of making a living
sumers. Selective breeding began with the (Harlan, 1992). Over long periods of time
early farmers, ranchers and vintners who there was a transition from the collection of
Region Crops
Near East (Fertile Crescent) Wheat and barley, flax, lentils, chickpea, figs, dates, grapes, olives,
lettuce, onions, cabbage, carrots, cucumbers, melons;
fruits and nuts
Africa Pearl millet, Guinea millet, African rice, sorghum, cowpea, groundnut,
yam, oil palm, watermelon, okra
China Japanese millet, rice, buckwheat, soybean
South-east Asia Wet- and dryland rice, pigeon pea, mung bean, citrus fruits, coconut,
taro, yams, banana, breadfruit, coconut, sugarcane
Mesoamerica and Maize, squash, common bean, lima bean, peppers, amaranth,
North America sweet potato, sunflower
South America Lowlands: cassava; Mid-altitudes and uplands (Peru): potato,
groundnut, cotton, maize
wild plants for food to the selection of those of commercial seed production enterprises.
to be cultivated which began to guide the Besides selecting plants with useful charac-
evolutionary process. Now plant breeders teristics, breeders also arrange marriages
accelerate the evolution of major crop spe- between plants with different traits in the
cies through skilful manipulation of breed- hope of producing fertile offspring carry-
ing procedures. High-input agriculture ing both traits. The use of artificial crosses
emerged as a result of voyages of discovery in pre-Mendelian breeding is exemplified
and modern science. by the case of Fragaria ananassa devel-
Many traits important to early agricul- oped in the botanical garden of Paris by
turists were heritable and, therefore, could Duchesne, in the 17th century by crossing
be reliably selected. However, this phase Fragaria chiloense with Fragaria virgin-
of breeding was empirical and generally iana. In England, at about the same time
not considered scientific in the modern new cultivars of fruits, wheat and peas were
sense because changes in these plant and being obtained by artificial hybridization
animal populations were not analysed in (Snchez-Monge, 1993).
an attempt to explain biological phenom- Hybridization combined with selec-
ena. At this stage of agriculture, the focus tion was adopted by Patrik Sheireff in
was on the practical goal of producing 1819 in wheat and rice where the new
food rather than finding rational explana- selections were grown along with culti-
tions for nature (Harlan, 1992). Ideas about vars for comparative purposes. He specu-
heredity during the period when many lated that introduction and hybridization
early crops were domesticated ranged to be the important sources of new cul-
from mythological interpretations to near- tivars and stressed crossing of carefully
scientific notions of trait transmission. In selected parents to meet the aims of new
his Presidential Address to the American cultivars. Although the essential elements
Society for Horticulture Science in 1987, of plant breeding were known by this time,
Janick (1988) stated: there was still a lack of knowledge regard-
ing the scientific basis of variation among
The origin of new information in
horticulture derives from two traditions: plants. For example, the first generation
empirical and experimental. The roots of of crossed materials were mistakenly
empiricism stem from efforts of prehistoric expected to inevitably produce new culti-
farmers, Hellenic root diggers, medieval vars but instead took several generations to
peasants, and gardeners everywhere to stabilize. Many historical examples of suc-
obtain practical solutions to problems of cessful plant breeding can be found in the
plant growing. The accumulated successes literature, although there were still many
and improvements passed orally from important discoveries to be made before it
parent to child, from artisan to apprentice,
could be called a technology (Chahal and
have become embodied in human
Gosal, 2002).
consciousness via legend, craft secrets,
and folk wisdom. This information is
now stored in tales, almanacs, herbals,
and histories and has become part of our
1.3 Major Developments in the
common culture. More than practices
and skills were involved as improved History of Plant Breeding
germplasm was selected and preserved via
seed and graft from harvest to harvest and Plant breeders of today use various meth-
generation to generation. The sum total of ods to accelerate the evolutionary process
these technologies makes up the traditional in order to increase the usefulness of plants
lore of horticulture. It represents a by exploiting genetic differences within a
monumental achievement of our forbears
species. This has been made possible by the
unknown and unsung.
determination of the genetic basis for devel-
Large-scale breeding activities began oping crop breeding procedures and this in
very early in Europe, often under the auspices turn has a long history.
Introduction 5
The role of reproduction in plants was In 1859 Darwin proposed in The Origin of
first reported in 1694 by Camerarius who Species that natural selection is the mech-
noticed the difference between male and anism of evolution. Darwins thesis was
female reproductive organs in maize and that the adaptation of populations to their
produced the first artificial hybrid plant. He environments resulted from natural selec-
established that seed could not be produced tion and that if this process continued for
without the participation of pollen produced long enough, it would ultimately lead to
in male reproductive organs of plants. The the origin of new species. Darwins Theory
first hybridization experiment was carried of Evolution through Natural Selection
out on wheat by Fairchild in 1719 and the hypothesized that plants change gradually
current technique of hybridization is largely by natural selection operating on variable
based on the work of Klreuter (17331806), populations and was the outstanding dis-
a French researcher who carried out his covery of the 19th century with direct rele-
experiments in the 1760s. Hybridization vance to plant breeding.
freed the breeder from the severe constraints
of working within a limited population,
enabled him to bring together useful traits
1.3.4 Breeding types and polyploidy
from two or more sources, and allowed spe-
cific genes to be introduced.
By understanding the reproductive Other historical developments in plant
capacities of plants, plant breeders can breeding include, pedigree breeding, back-
manipulate these crosses to produce fer- cross breeding (Harlan and Pope, 1922) and
tile offspring which carry traits from both mutation breeding (Stadler, 1928). Natural
parents. Crossing has been very valuable and artificial polyploids also offered new
to plant breeders, because it allows some possibilities for plant breeding. Blakeslee
measure of control over the phenotype of and Avery (1937) demonstrated the use-
a plant. Nearly all modern plant breeding fulness of colchicine in the induction of
involves some use of hybridization. chromosome doubling and polyploidy,
enabling plant breeders to combine entire
chromosome sets of two or more species to
evolve new crop plants.
1.3.2 Mendelian genetics
in Major Crops that brought into focus the of plant genomics, particularly molecular
causes and levels of genetic uniformity and markers, and other molecular tools that can
its consequences. It was a turning point be used to dissect complex traits into sin-
in the history of germplasm resources and gle Mendelian factors (Xu and Zhu, 1994;
the International Board for Plant Genetic Buckler et al., 2009; Chapters 6 and 7).
Resources (IBPGR) was established in 1974, Genotype-by-environment interaction
and was later renamed the International (GEI) and its importance to plant breed-
Plant Genetic Resources Institute (IPGRI) ing were first recognized by Mooers (1921)
and now Biodiversity International, to col- and Yates and Cochran (1938). Since then,
lect, evaluate and conserve plant germplasm various statistical methods have been
for future use. developed for the evaluation of GEI using
joint linear regression, heterogeneity of
variance and lack of correlation, ordina-
tion, clustering, and pattern analysis. As
1.3.6 Quantitative genetics and an important field in quantitative genet-
genotype-by-environment interaction ics, GEI has been receiving more attention
in recent years and is covered in Chapter
Quantitative genetics is the study of the 10 along with molecular methods for GEI
genetic control of those traits which show analysis.
continuous variation. It is concerned with
the level of inheritance of these differences
between individuals rather than the type of 1.3.7 Heterosis and hybrid breeding
differences, that is quantitative rather than
qualitative (Falconer, 1989). Several important
Although early botanists had observed
books have been published which document
increased growth when unrelated plants
the major developments in quantitative genet-
of the same species were crossed, it was
ics and these include Animal Breeding Plans
Charles Darwin who carried out the first
(Lush, 1937), Population Genetics and Animal
seminal experiments. In 1877, he showed
Improvement (Lerner, 1950), Biometrical
that crosses of related strains did not
Genetics (Mather, 1949), Population Genetics
exhibit the vigour of hybrids. He observed
(Li, 1955), An Introduction to Genetics Statistics
heterosis, i.e. the tendency of cross-bred
(Kempthorne, 1957) and Introduction to
individuals to show qualities superior to
Quantitative Genetics (Falconer, 1960).
those of both parents, in crops like maize
Many of the misconceptions regarding
and concluded that cross-fertilization was
the inheritance of quantitative traits, which
generally beneficial and self-fertilization
include most of the economically important
injurious. In 1879, William Beal demon-
characters, were corrected by the classical
strated hybrid vigour in maize by using
work of Fisher (1918) who successfully
two unrelated cultivars. The best combi-
applied Mendelian principles to explain
nations yielded 50% more than the mean
the genetic control of continuous varia-
of the parents. Reports by Sanborn in 1890
tion. He divided the phenotypic variance
and McClure in 1892 confirmed Beals ear-
observed into three variance components:
lier reports and extended the generality of
additive, dominance and epistatic effects. This
the superiority of hybrids over the average
approach has been substantially refined and
of the parental forms.
applied to the improvement of the efficiency
of plant breeding. Fisher also laid the found-
ations for scientific crop experimentation
by developing the theory of experimental 1.3.8 Refinement of populations
designs that is an essential part of any plant
breeding programme. Quantitative genetics Several different population breeding
has however evolved considerably in the methods can be used: (i) bulk; (ii) mass
past two decades because of the development selection; and (iii) recurrent selection. One
Introduction 7
of the methods used for managing large All the genes necessary to make an
populations of segregates was the bulk entire organism can be induced to function
method proposed by Harlan et al. (1940) in the correct sequence from a living cell
for multi-parent crosses. This concept isolated from a mature tissue (called totipo-
changed the breeding methodologies for tency). Regeneration of whole plants from
self-pollinated species. Mass selection is single cells is an important new source of
a system of breeding in which seeds from genetic variability for refining the properties
individuals selected on the basis of pheno- of plants because when somatic embryos
type are bulked and used to grow the next derived from single cells are grown into
generation. Mass selection is the oldest plants, the plants characteristics vary some-
breeding method for plant improvement what. Larkin and Scowcroft (1981) coined
and was employed by early farmers for the the term somaclonal variation to describe
development of cultivated species from this observed phenotypic variation among
their ancestral forms. plants derived from micro-propagation
The enhancement of open-pollinated experiments. When it was recognized as a
populations of crops such as rye, maize genuine phenomenon, somaclonal variation
and sugarbeet, herbage grasses, legumes, was considered to be a potential tool for the
and tropical trees such as cacao, coconut, introduction of new variants of perennial
oil palm, and some rubber, depends essen- crops that can be asexually propagated (e.g.
tially on changing the gene frequencies so banana). Somaclonal variation has also been
that the favourable alleles are fixed, while exploited by plant breeders as a new source
maintaining a high (but far from maximal) of genetic variation for annual crops.
degree of heterozygosity. Recurrent selec-
tion is a method of plant breeding associ-
ated with quantitatively inherited traits by
which the frequencies of favourable genes 1.3.10 Genetic engineering and
are increased in populations of plants. The gene transfer
methodology is cyclical with each cycle
encompassing two phases: (i) selection of The discovery of the structure of DNA by
genotypes that possess the favourable or Watson and Crick has enhanced traditional
required genes; and (ii) crossing among the breeding techniques by allowing breeders to
selected genotypes. This leads to a gradual pinpoint the particular gene responsible for
increase in the frequencies of the desired a particular trait and to follow its transmis-
alleles. While recurrent selection is often sion to subsequent generations. Enzymes
successful it also has potential limitations that cut and rejoin DNA molecules allow sci-
in closed populations and this has led to entists to manipulate genes in the laboratory.
numerous modifications and alternative In 1973 Stanley Cohen and Herbert Boyer
schemes (see Hallauer and Miranda, 1988). spliced the gene from one organism into the
Recurrent selection breeding methods have DNA of another to produce recombinant
been applied to a wide range of plant spe- DNA which was then expressed normally
cies, including self-pollinated crops. and this formed the basis of genetic engin-
eering. The goal of plant genetic engineers
is to isolate one or more specific genes and
introduce these into plants. Improvement in
1.3.9 Cell totipotency, tissue culture a crop plant can often be achieved by intro-
and somaclonal variation ducing a single gene, and genes can now be
transferred to plants using the natural gene
The discovery of auxins, by Went and transfer system of a promiscuous pathogenic
colleagues, and cytokinins, by Skoog and soil bacterium, Agrobacterium tumefaciens.
colleagues, preceded the first success of in DNA can also be introduced into cells by
vitro culture of plant tissues (White, 1934; bombardment with DNA-coated particles
Nobcourt, 1939). or by electroporation. Transgenic breeding
8 Chapter 1
has the potential to decrease or increase 1.3.12 Breeding efforts in the public
the environmental impact of agricultural and private sectors
practices.
The initial successes in plant genetic Agricultural research has mainly been the
engineering marked a significant turning responsibility of a national and/or state gov-
point in crop research. In the 1990s in par- ernment department. To accelerate progress
ticular, there was an upsurge of private sec- in food production especially in developing
tor investment in agricultural biotechnology. countries, international agricultural research
Some of the first products were plant strains centres were established with major empha-
capable of synthesizing an insecticidal pro- sis on the development of high yielding culti-
tein encoded by a gene isolated from the vars. Two centres, International Rice Research
bacterium Bacillus thuringiensis (Bt). Bt cot- Institute (IRRI), Philippines, and Centro Inter-
ton, maize, and other crops are now grown nacional de Mejoramiento de Maiz y Trigo
commercially. There are also crop cultivars (CIMMYT), Mexico, established in the 1960s,
which are tolerant to or capable of degrad- made phenomenal contributions to food pro-
ing herbicides. Proponents stress the value duction by developing shorter and higher-
of these crops in conserving tillage soil, yielding rice, wheat and maize cultivars.
reducing the use of harmful chemicals and Encouraged by the astonishing success of
reducing the labour and costs involved in these centres and two others which were
crop production. established later, the Consultative Group on
International Agricultural Research (CGIAR)
was established in 1971. The CGIAR now has
1.3.11 DNA markers and genomics 15 international agricultural research cen-
tres, of which eight concentrate on specific
During the 1980s and 1990s, various types crop plants and one on genetic resources
of molecular markers such as restriction with a mission to contribute towards sus-
fragment length polymorphism (RFLP) tainable agriculture for food security espe-
(Botstein et al., 1980), randomly ampli- cially in developing countries. The breeding
fied polymorphic DNA (RAPD) (Williams materials developed at these centres are dis-
et al., 1990; Welsh and McClelland, 1990), tributed to public and private sector research
microsatellites and single nucleotide poly- programmes for utilization in the develop-
morphism (SNP) were developed. Because ment of locally adapted cultivars. Through
of their abundance and importance in the National Agricultural Research Systems
plant genome, molecular markers have been (NARS), these centres work in close coor-
widely used in the fields of germplasm dination with public and private breed-
evaluation, genetic mapping, map-based ing programmes in each country and share
gene discovery and marker-assisted plant their breeding technologies and stocks of
breeding. Molecular marker technology germplasm.
has become a powerful tool in the genetic In the USA, crop breeding, with the
manipulation of agronomic traits. exception of cotton, began largely as a
Initiated by the complete sequencing tax-supported endeavour with breeding
of the Arabidopsis genome in 2000 (The programmes taking place in most State
Arabidopsis Genome Initiative, 2000) and Agricultural Experimental Stations and in
the rice genome in 2002 (Goff et al., 2002; the United States Department of Agriculture
Yu et al., 2002), the genomes of an increas- (USDA). This pattern changed with the
ing number of plants have been or are being advent of hybrid maize when inbred lines
sequenced. Technological developments in were initially developed by public institu-
bioinformatics, genomics and various omics tions and utilized to produce hybrids by pri-
fields are creating substantial data on which vate companies. With the implementation
future revolutions in plant breeding can be of a Plant Variety Protection Act in the USA
based. in 1974, private breeding was expanded to
Introduction 9
include forages, cereals, soybean, and other leading to the proliferation of specific traits
crops. The activities of private companies within that population. The degree of gene
contributed to the total crop breeding flow varies widely and is dependent on the
effort and offered a large number of culti- type of organism and population structure.
var options for farmers and consumers. For example genes in a mobile popula-
In the USA and other industrialized coun- tion are likely to be more widely distrib-
tries today, the new life-science companies uted than those in a sedentary population,
notably the big multinationals such as Dow, resulting in high and low rates of gene flow,
DuPont and Monsanto, dominate the appli- respectively.
cation of biotechnology to agriculture, and
have developed many proprietary products.
1.4.2 Mutation
to phenotypic variation and to identify the both in terms of action and in transmission
molecular pathway from gene to function. through meiosis.
The recent progress made in humans by
combining linkage disequilibrium mapping
(Chapter 6) and transcriptomics (Chapter
3) holds great promises for high-resolution 1.5.2 The concept of allelic and
association mapping and identification genotypic frequencies
of regulatory genetic factors (Dixon et al.,
2007). Information from omics research will A biological population is defined geneti-
be integrated with our current knowledge at cally as a group of individuals that exist
the phenotypic level to increase the effec- together in time and space and that can
tiveness and efficiency of plant breeding. mate or be crossed to each other to produce
fertile progeny. Statistically, this group is
called a population. Breeding populations
are created by breeders to serve as a source
1.5.1 Qualitative and quantitative traits of cultivars that meet specific breeding
objectives.
In general, qualitative traits are genetically At the population level, genetics can be
controlled by one or a few major genes, characterized by allelic and genotypic fre-
each of which has a relatively large effect quencies. The allele frequencies refer to the
on the phenotype but is relatively insensi- proportion of each allele in the population,
tive to environmental influences. Trait dis- while the genotypic frequencies refer to the
tribution in a typical segregating population proportion of individuals (plants) in the
such as an F2 shows multi-peak distribu- population that have a particular genotype.
tion, although individuals within a category A gene may have many allelic states. Some
show continuous variation. Each individual of the alleles of a given gene may have such
in the population can be classified unam- marked effects as to be clearly recognized
biguously into distinct categories that cor- as a classical major mutant. Other alleles,
respond to different genotypes so that they though potentially separable at the DNA
can be studied using Mendelian methods. level, may well cause only minor differ-
Quantitative traits are genetically ences at the level of the external phenotype.
controlled by many genes, each of which has For example, one allele at a locus involved
a relatively small effect on the phenotype, with growth hormone production could be
but is largely influenced by environmental inactive and result in a dwarf plant, while
factors (Buckler et al., 2009). Trait distri- others may simply reduce or increase height
bution in an F2 population usually shows by a few percent.
a normal or bell-shape distribution and as Allele and genotypic frequencies can be
a result, individuals cannot be classified calculated by simple counting in the popu-
into phenotypic categories that correspond lation. For a gene with n alleles, there are
to different genotypes thus making the n(n + 1)/2 possible genotypes. The relation-
effects of individual genes indistinguish- ship between allele frequency and genotypic
able. Quantitative genetics is traditionally frequency for a single gene at the population
described as the study of all these genes as a level can be used to infer the genetic status
whole and the total variation observed in a of the gene in that population, relative to the
population results from the combined effects expected equilibrium under some assumed
of genetic (polygenes as a whole) and envi- mating system. Allele frequencies are gen-
ronmental factors. However, quantitative erally not an issue in breeding populations
variation is not due solely to minor allelic created from non-inbred parents or from
variation in structural genes as regulatory three or more inbred parents. But breed-
genes no doubt also contribute to this vari- ing populations in both self-pollinated and
ation. We expected polygenes to show all cross-pollinated crops are often created by
the typical properties of chromosomal genes crossing two inbred individuals.
12 Chapter 1
1.5.3 HardyWeinberg equilibrium (HWE) mean, m, also known as the first moment
about the origin, is a parameter used to
A population is in equilibrium if the allele measure the central location of a frequency
and genotypic frequencies are constant distribution. The population variance, s 2,
from generation to generation. A collec- also known as the second moment about the
tion of pure selfers is also at equilibrium mean, provides measures of the dispersion
if all are completely selfed, with PA1A1 = p of the distribution. If the yield trait for a cul-
and PA2A2 = q. This implies that the allele tivar that is genetically homogenous is taken
frequency and genotypic frequency share a as an example, the genetic effect for this
simple relationship: cultivar population is a constant. The yield
for all individuals should also be a constant
PA1A1 = p2 provided that environmental factors do not
PA1A2 = 2pq affect the yield which is equal to the pop-
PA2A2 = q2 ulation mean. However, the yield for each
individual is affected not only by its geno-
or
type but also by environmental factors such
(p + q)2 = p2 + 2pq + q2 as temperature, sunlight, water, and vari-
ous nutrients. As a result, individuals may
With one generation of random mating, have different phenotypic values, in this
i.e. an individual in the population that is case yield, resulting in continuous variation
equally likely to mate with any other indi- among individuals. Therefore, the individ-
vidual, the above simple relationship will ual yield measures vary either positively or
be obeyed. However, HWE represents ide- negatively around the population mean so
alized populations and breeders routinely that they are either higher or lower than the
use procedures that cause deviations from population mean by a certain number which
HWE. These procedures include the lack is determined by its variance.
of random mating, the use of small popula-
tion sizes, assortative mating, selection, and
inbreeding during the development of prog-
enies. Some of these procedures, such as 1.5.5 Heritability
inbreeding and the use of small population
sizes, affect all loci in the population while The response of traits to selection depends
others affect only certain loci. Suppose that on the relative importance of the genetic
two traits are controlled by different sets and non-genetic factors which contribute
of loci, and a change in one trait does not to phenotypic variation among genotypes
affect the other. If selection occurs only for in a population, a concept referred to as
the first trait, the loci affecting that trait heritability. The heritability of a trait has
may deviate from HWE, but the loci for the a major impact on the methods chosen for
other trait will remain in equilibrium. In population improvement, inbreeding, and
large natural populations, migration, muta- selection. Selection for single plants is
tion, and selection are the forces that can more efficient when the heritability is high.
change allelic frequencies from generation The extent to which replicated testing is
to generation. required for selection depends on the herit-
ability of the trait.
The question of whether a trait varia-
1.5.4 Population means and variances tion is a result of genetic or environmental
variation is meaningless in practice. Genes
Theoretically, a population can be described cannot cause a trait to develop unless the
by its parameters such as the mean and vari- organism is growing in an appropriate
ance which depend on the probability dis- environment, and, conversely, no amount
tribution of the population. The arithmetic of manipulation will cause a phenotype to
Introduction 13
develop unless the necessary gene or genes genetic gain, and predicted progress or gain,
are present. Nevertheless, the variability and has been denoted as R, GS, G and G.
observed in some traits might result prima- Starting with a parental population of
rily from difference in the numbers and the mean, m, a subset of individuals is selected.
_
magnitude of the effect of different genes, The selected individuals have a mean x ,
but that variability in other traits might while the offspring
_ of the selected popula-
stem primarily from the differences in the tion has a mean y . The difference between
environments to which various individuals the selected population and the original
have been exposed. It is therefore essential population is defined as the selection dif-
to identify reliable measures to determine ferential, and denoted by S, i.e.
the relative importance of not only the _
numbers and magnitude of the effects of the S=xm
genes involved, but also of the effects of dif-
ferent environments on the expression of The response to selection, R, can be
phenotypic traits (Allard, 1999). written as
Heritablity is defined as the ratio of _
genetic variance to phenotypic variance: R=ym
Parental population
Individuals
x
selected,
k = 5%
S=xm
Selection differential
Progeny population
Fig. 1.1. Distribution of parental and progeny populations with a selection intensity of 5%. Because the
phenotypic values of the selected plants include both a genetic and an environmental component, the
progeny means depend on the heritability of the trait selected.
1.5.7 Selection index and selection With tandem selection, one trait is selected
for multiple traits until it is improved to a satisfactory level
or a critical phenotypic value. Then, in the
In most plant breeding programmes, there next generation or programme, selection
is a need to improve more than one trait at for a second trait is carried out within the
a time. For example, a high-yielding culti- population selected for the first trait, and
var susceptible to a prevalent disease would so on for the third and subsequent traits.
be of little use to a grower. Recognition A selection index is a single score which
that improvement of one trait may cause reflects the merits and demerits of all target
improvement or deterioration in associ- traits. Selection among individuals is based
ated traits serves to emphasize the need on the relative values of the index scores.
for the simultaneous consideration of all Selection indices provide one method
traits which are important in a crop spe- for improving multiple traits in a breeding
cies. Three selection methods, which are programme. The use of a selection index
recognized as appropriate for the simulta- in plant breeding was originally proposed
neous improvement of two or more traits by Smith (1936) who acknowledged criti-
in a breeding programme, are index selec- cal input from Fisher (1936). Subsequently,
tion, independent culling, and tandem methods of developing selection indices
selection. Independent culling requires the were modified, subjected to critical evalu-
establishment of minimum levels of merit ation, and compared to other methods of
for each trait. An individual with a pheno- multiple trait selection.
type value below the critical culling level It is generally recognized that a selec-
for any trait will be removed from the popu- tion index is a linear function of observable
lation. That is, only individuals meeting phenotypic values of different traits. There
requirements for all traits will be selected. are a number of forms of the equations avail-
Introduction 15
able from index selection for multiple traits on the extent of previous testing of the par-
in grain. To construct a selection index, the ents included in the crosses. Although these
observed value of each trait is weighted by concepts were developed for breeding maize,
an index coefficient, an open-pollinated crop, they are generally
applicable to self-pollinated crops.
I = b1x1 + b2x2 + + bnxn The GCA for an inbred line or a cul-
tivar can be evaluated by the average per-
where I is an index of merit of an indi- formance of yield or other economic traits
vidual, xi represents the observed pheno- in a set of hybrid combinations. The SCA
typic value of the ith trait, and b1 bn are for a cross combination can be evaluated
weights assigned to phenotypic trait meas- by the deviation in its performance from
urements represented as x1 xn. The b val- the value expected from the GCA of its two
ues are the products of the inverse of the parental lines. If the crosses among a set of
phenotypic variancecovariance matrix, inbred lines are made in such a way that
genotypic variancecovariance matrix, and each line is crossed with several other lines
a vector of economic weights. A number of in a systematic manner, the total variation
variations of this index, most changing the among crosses can be partitioned into two
manner of computing the b values, have components ascribable to GCA and SCA. _
been developed. These include the base The mean performance of a cross (x AB)
index of Williams (1962), the desired gain between two inbred lines A and B can be
index of Pesek and Baker (1969), and retro- represented as
spective indexes proposed by Johnson et al. _
(1988) and Bernardo (1991). The emphasis x AB = GCAA + GCAB + SCAAB
in the retrospective index developments is
on quantifying the knowledge experienced The GCAA and GCAB are the GCA of the
breeders have obtained. Baker (1986) sum- parents A and B, respectively, and the cross
marized all select indexes in plant breeding of A B is expected to have a performance
developed before that time. equal to the sum (GCAA + GCAB) of the GCA
of their parents. The actual performance of
the cross, however, may be different from
1.5.8 Combining ability the expectation by an amount equivalent to
the SCA. Sprague and Tatum (1942) inter-
Combining ability is a very important con- preted these combining abilities in terms
cept in plant breeding and it can be used to of type of gene action. The differences due
compare and investigate how two inbred to GCA of lines are the results of additive
lines can be combined together to produce genetic variance and additive by additive
a productive hybrid or to breed new inbred interaction whereas SCA is a reflection of
lines. Selection and development of paren- non-additive genetic variances.
tal lines or inbreds with strong combining
ability is one of the most important breeding
objectives, no matter whether the goal is to 1.5.9 Recurrent selection
create a hybrid with strong vigour or develop
a pure-line cultivar with improved charac- Recurrent selection can be broadly defined
teristics compared to their parental lines. In as the systematic selection of desirable
maize breeding, Sprague and Tatum (1942) individuals from a population followed by
partitioned the genetic variability among recombination of the selected individuals to
crosses into effects due to primarily either form a new population. The basic feature of
additive or non-additive effects, which cor- recurrent selection methods is that they are
respond to two categories of combining abil- procedures conducted in a repetitive man-
ity, general combining ability (GCA) and ner, or recycling, including development
special combining ability (SCA). The rela- of a base population with which to begin
tive importance of GCA and SCA depends selection, evaluation of individuals from
16 Chapter 1
the population, and selection of superior for outcrossing crops, to rectify limitations
individuals as parents that can be crossed in inbred development by continuous self-
to produce a new population for the next ing that rapidly leads to inbreeding and
cycle of selection, as shown below: allele fixation and thus inadequate oppor-
tunity for selection. There are two ways by
Develop a
which recurrent selection address this lim-
population itation in inbred development (Bernardo,
2002). First, recurrent selection increases
the frequency of favourable alleles in the
Select superior Evaluate indi- population by repeated cycles of selection.
individuals as viduals in the Secondly, recurrent selection maintains the
parents population
degree of genetic variation in the popula-
tion to allow sustained progress from subse-
A cycle of selection is completed each quent cycles of selection. Genetic variation
time a new population is formed. The initial is maintained by recombining a sufficiently
population that is developed for a recurrent large number of individuals to reduce
selection programme is referred to as the random fluctuations in allele frequencies,
base, or cycle 0, population. The population i.e. genetic drift.
formed after one cycle of selection is called Since the late 1950s, extensive research
the cycle 1 population; the cycle 2 popula- has been conducted to determine the rela-
tion is developed from the second cycle of tive importance of different genetic effects
selection, and so on. on the inheritance of quantitative traits for
Recurrent selection procedures are most cultivated plant species. As indicated
conducted for primarily quantitatively by Hallauer (2007), quantitative genetic
inherited traits. The objective of recurrent research has provided extensive information
selection is to improve the mean perform- to assist plant breeders in developing breed-
ance of a population of plants by increas- ing and selection strategies. Directly and/or
ing the frequency of favourable alleles in a indirectly, the principles for the inheritance
consistent manner in order to enhance the of quantitative traits are pervasive in devel-
value of the population and to maintain the oping superior cultivars to meet the world-
genetic variability present in the popula- wide food, feed, fuel and fibre demands.
tion as effectively as possible. In addition, The principles of quantitative genetics will
separation of the genetic and environmen- have continued importance in the future.
tal effects is an important facet of effective
recurrent selection methods. The improved
populations can be used as a cultivar per
se, as parents of a cultivar-cross hybrid and 1.6 The Green Revolution and the
as a source of superior individuals that can Challenges Ahead
be used as inbred lines, pure-line cultivars,
clonal cultivars, or parents of a synthetic The application of science and technology
line. Successful recurrent selection results to crop production in the second half of the
in an improved population that is superior 20th century resulted in significant yield
to the original population in mean perform- improvements for rice, wheat and maize in
ance and in the performance of the best the developed countries, and the final result
individuals within it. Ideally, the popula- of these efforts was the Green Revolution
tion will be improved without its genetic which led to a new type of agriculture
variability being significantly reduced so high-input or chemical-genetic agri cul-
that additional selection and improvement ture which replaced the more traditional
can occur in the future. Recurrent selection system. Countries involved in the Green
is complementary to inbred development Revolution, a term coined by Borlaug
procedures; in fact the concept of recur- (1972), included Japan, Mexico, India and
rent selection was developed, particularly China among others.
Introduction 17
and social access to a balanced diet and breeding is becoming quicker, easier, more
safe dinking water will be threatened, with effective and more efficient (Phillips, 2006).
a holistic approach to nutritional and non- Plant breeders will be well equipped with
nutritional factors needed to achieve suc- innovative approaches to identify and/
cess in the eradication of hunger. Science or create genetic variation, to define the
and technology can play a very impor- genetic feature of the genes related to the
tant role in stimulating and sustaining an variation (position, function and relation-
Evergreen Revolution leading to long-term ship with other genes and environments),
increases in productivity without any asso- to understand the structure of breeding
ciated ecological harm (Borlaug, 2001; populations, to recombine novel alleles or
Swaminathan, 2007). The objectives of the allele combinations into specific cultivars
plant breeder can be realized through con- or hybrids, and to select the best individu-
ventional breeding integrated with various als with desirable genetic features which
biotechnology developments (e.g. Damude enable them to adapt to a wide range of
and Kinney, 2008; Xu et al., 2009c). environments.
Plant breeding can be defined as an Sequencing data for many plants is now
evolving science and technology (Fig. 1.2). readily available and the GenBank database
It has gradually been evolving from art to is doubling every 15 months. Over 20 plant
science over the last 10,000 years, starting species including many important crops are
as an ancient art to the present molecular in the process of being sequenced (Phillips,
design-based science. With the develop- 2008). The next challenge is to determine
ment of molecular tools which will be dis- the function of every gene and eventually
cussed further in Chapters 2 and 3, plant how genes interact to form the basis of com-
plex traits. Fortunately, DNA chips and
other technologies are being developed to
Art-based Plant Breeding study the expression of multiple or even
all genes simultaneously. High throughput
Collection of wild plants for food robotics and bioinformatics tools will play
Selection of wild plants for cultivation an essential role in this endeavour.
(starting from 10,000 years ago)
New information about our crop spe-
cies is expanding our capabilities to use
Large-scale breeding activities supported molecular genetics. For example, we did
by commercial seed production enterprises not previously realize how similar broadly
Hybridization combined with selection
related species are in terms of their gene
Evolution through natural selection
(1700s1800s) content and gene order. Since these spe-
cies cannot usually be crossed, there was
Mendelian genetics no means of assessing their relatedness.
Quantitative genetics With the advent of DNA-based molecular
Mutation markers, the extensive genetic mapping of
Polyploidy chromosomes became readily possible for
Tissue culture a variety of species. We learned that the
(1900s) genomes were highly similar and that this
similarity allowed the prediction of gene
Gene cloning and direct transfer locations among species. For example, rice
Genomics-assisted breeding has become the model or reference spe-
(2000s and beyond)
cies for the cereals as many of the gene
sequences on the rice chromosomes are
Molecular Plant Breeding
shared with other cereals such as maize,
Fig. 1.2. The steps of evolution of plant breeding. sorghum, sugarcane, millet, oats, wheat
With the availability of more sophisticated tools, and barley (Xu et al., 2005). Knowing the
the art of plant breeding became science-based complete DNA sequence of a model or ref-
technology, molecular plant breeding. erence genome allows genes/traits from this
20 Chapter 1
model to be tracked to other genomes. We improve the understanding of the role of het-
have come to realize that the differences erosis in evolution and the domestication of
between species of plants are not due to crop plants (Lippman and Zamir, 2007), and
novel genes, but to novel allelic specifica- finally to make it possible to predict hybrid
tions and interactions. performance.
Since many fundamental aspects of Messenger RNA transcript profiling is
current plant breeding procedures are not an obvious candidate for functional genomic
well understood, further data relating to application to plant breeding. Although
the genetics of crop species may help to direct selection at the gene transcript level
shed light on the genetic gains obtained using microarray or real-time PCR may be
from plant breeding. For example, in suc- a long-term goal, other genomic tools can
cessful plant breeding programmes, the be used to achieve shorter term goals with
genetic base often becomes narrower rather more practical applications (Crosbie et al.,
than broader. Elite by elite crosses may be 2006). Genetic modification of crops today
the rule in these programmes. Molecular involves the interfacing of molecular bio-
genetic markers have been widely employed logy, cell and tissue culture, and genetics/
to identify cryptic and novel genetic vari- breeding. The transfer of genes by cellu-
ation among cultivars and related species lar and molecular means will increase the
and used to increase the efficiency of selec- available gene pool and lead to second
tion for agronomic traits and the pyramid of generation biotechnology plant products
genes from different genetic backgrounds. such as those with a modified oil, protein,
Long-term selection programmes would vitamin, or micronutrient content or those
be expected to lead to genetic fixation, how- that have been engineered to produce com-
ever this has not been found to be the case pounds that can be used as vaccines or anti-
so far and variation is still observed. Several carcinogens.
mechanisms for de novo variation have been While all these new innovations have
described, including intragenic recombin- been useful, practical plant breeding con-
ation, unequal crossing over among repeated tinues to be based on hybridization and
elements, transposon activity, DNA methyl- selection with little change in the basic
ation, and paramutation. Another important procedures. A more complete understand-
feature in plant breeding whose molecular ing of the mechanisms by which genetic
basis is not understood is heterosis although and environmental variation modify yield
it is used as the basis for many seed-producing and composition is needed so that specific
industries. Genomics and particularly tran- quantitative and qualitative targets can be
scriptomics are now being used to identify identified. To achieve this aim, the exper-
the heterotic genes responsible for increas- tise of plant genomics (including various
ing crop yields. Comprehensive quantitative omics), physiology and agronomy, as well
trait locus-based phenotyping (phenomics) as plant modelling techniques must be com-
combined with genome-wide expression bined (Wollenweber et al., 2005) and many
analysis, should help to identify the loci logistic and genetic constraints also need to
controlling heterotic phenotypes and thus be resolved (Xu and Crouch, 2008).
2
Molecular Breeding Tools:
Markers and Maps
Table 2.1 lists the major molecular Schwarz (2005) and Falque and Santoni
marker technologies that are currently (2007). Further information regarding the
available. Only a selection of widely-used application of DNA markers in genetics and
representative types of markers will be dis- breeding can be found in Lrz and Wenzel
cussed in this section. Figure 2.1 shows the (2005). After a brief review of the classical
molecular mechanism of several major DNA markers, DNA markers will be discussed in
markers and the genetic polymorphisms more detail in this section.
that can be generated by restriction site or
PCR priming site mutation, insertion, dele-
tion or by changing the number of repeat 2.1.1 Classical markers
units between two restriction or PCR prim-
ing sites and nucleotide mutation resulting Morphological markers
in a single nucleotide polymorphism (SNP).
There are several comprehensive reviews In the late 1800s, following his studies on
that cover all the important DNA markers, the garden pea (Pisum sativum), G.J. Mendel
e.g. Reiter (2001), Avise (2004), Mohler and proposed two basic rules of genetics,
A. Mutation at
enzyme restriction
or PCR priming site
RFLP, AFLP, CAPS
B. Insertion
between enzyme
restriction or PCR
priming sites Insertion
C. Deletion
between enzyme
restriction or PCR
Deletion
banding sites
D. Change of
tandem repeat
units between
enzyme restriction
or PCR banding
sites
SSR, VNTR, ISSR
Fig. 2.1. Molecular basis of major DNA markers. Parts AE show different ways in which DNA markers
(listed below each diagram) can be generated. The cross in part A indicates that mutation has eliminated
the priming site. Abbreviations: as defined in Table 2.1; VNTR, variable number of tandem repeat; CAPS,
a DNA marker generated by specific primer PCR combined with RFLP; ISSR, inter simple sequence repeat.
24 Chapter 2
which were later known as the Mendelian 1998). Many of these markers have been
laws of equal segregation and independ- linked with other agronomic traits.
ent assortment. Mendel selected individu- Morphological markers are usually
als which differed in a particular trait and mapped by classical two- or three-point
used them as the parental lines in a cross linkage tests. The linkage groups are estab-
breeding experiment to determine the phe- lished and the order of the markers and
notype of the offspring with regard to the the relative distance between any two are
selected trait. The term phenotype (derived determined by their recombinant frequen-
from Greek) literally means the form that cies. Relatively complete linkage maps
is shown and is used by both geneticists have been constructed in many crop spe-
and breeders. The seven pairs of contrasting cies using morphological markers and these
phenotypes studied by Mendel included maps provide the fundamental information
round versus wrinkled seeds, yellow ver- for the genetic mapping of many physiolog-
sus green seeds, purple versus white petals, ical and biochemical traits.
inflated versus pinched pods, green versus However, it is difficult to construct a
yellow pods, axial versus terminal flowers relatively saturated genetic map because of
and long versus short stems. The plants in the limitation in the number of morphologi-
the segregated populations of the pea, such cal markers with distinguishable polymor-
as F2 and backcross, were classified into two phisms. In addition, many morphological
distinct groups depending on their pheno- markers have deleterious effects on pheno-
types. These contrasting morphological types and some are significantly affected by
phenotypes are the starting point for any other factors such as environments or matu-
genetic analysis and can be mapped to par- rity which results in potential problems
ticular chromosomes using the Mendelian when these markers are used for genetics
laws of inheritance and can thus be used as and plant breeding.
morphological markers of the genome and
the particular trait. Cytological markers
Morphological markers therefore gen-
erally represent genetic polymorphisms By studying the morphology, number and
which are visible as differences in appear- structure of chromosomes from different
ance, such as the relative difference in plant species, particular cytogenetic features can
height and colour, distinct differences in be found, such as various types of aneu-
response to abiotic and biotic stresses, and ploidy, variants of chromosome structure
the presence/absence of other specific mor- and abnormal chromosomes. These can
phological characteristics. A large number be used as genetic markers to locate other
of variants showing particular morphologi- genes on to chromosomes and determine
cal or physiological phenotypes have been their relative positions, or used for genetic
generated by tissue culture and mutation mapping via chromosome manipulations
breeding. Using selection techniques these such as chromosome substitution.
variants can be genetically stabilized and The structural features of chromo-
then used as morphological markers. somes can be shown by chromosome kary-
Some genetic stocks contain more than otype and bands. The banding patterns are
one morphological marker, for example indicated by colour, width, order and posi-
there are a total of over 300 morphologi- tion, revealing the difference in distribu-
cal markers available for genetic studies in tions of euchromatin and heterochromatin.
rice (Khush, 1987) and more are being cre- There are Q bands (produced by quina-
ated for functional genomics. Many mor- crine hydrochloride), G bands (produced
phological marker stocks are also available by Giemsa stain) and R bands (reversed
for tomato (http://www.plantpath.wisc.edu/ Giemsa). These chromosome landmarks are
GeminivirusResistantTomatoes/MERC/ not only useful for characterizing normal
Tomato/Tomato.html), maize (Neuffer et al., chromosomes but also for detecting chro-
1997) and soybean (Palmer and Shoemaker, mosome mutation.
Markers and Maps 25
Cytological markers have been widely otide difference within a gene or between
used to identify linkage groups within spe- genes; and in others it represents the site
cific chromosomes and have been widely of a variable number of tandem repeats of
applied in physical mapping. However, junk DNA present between genes. The
because of the limited number and reso- development of RFLP markers has acceler-
lution, they have limited applications in ated the construction of molecular linkage
genetic diversity analysis, genetic mapping maps for many organisms, improved the
and marker-assisted selection (MAS). accuracy of gene location, and reduced the
time required to establish a complete link-
Protein markers age map.
The digestion of purified DNA using
Isozymes are structural variants of an restriction enzymes which cut the DNA
enzyme and while they differ from the strand wherever there is a recognition
original enzyme in molecular weight and site sequence (usually four to eight base
mobility in an electric field, they have the pairs), leads to the formation of RFLPs
same catalytic activity. The difference in which yield a molecular fingerprint that
enzyme mobility is caused by point muta- may be unique to a particular individual.
tions resulting from amino acid substitu- If the bases are positioned at random in the
tion such that isozymes reflect the products genome, an enzyme having a recognition
of different alleles rather than different site with six bases will cleave the DNA at
genes. Therefore, isozymes can be geneti- every 4096 bases on average (46). A genome
cally mapped on to chromosomes and then of 109 bases could thus produce around
used as genetic markers for mapping other 250,000 restriction fragments of variable
genes. Isozyme markers are based on their length. Gel electrophoresis on such a large
biochemistry and thus are also known as number of genomic DNA digestion prod-
biochemical or protein markers. ucts produces a continuous smear image.
However, their use as markers is lim- Particular fragments that are homologous
ited. For example a total of 57 isozymes between several individuals, and possibly
representing about 100 loci have been iden- allelic, can be separated only by means
tified in plants (Vallegos and Chase, 1991) of molecular probes using the Southern
but for specific species only 1020 iso- technique (Southern, 1975). RFLP analysis
zymes are available so that they cannot be includes the following steps (Fig. 2.2):
used to construct a complete genetic map.
Each isozyme can only be identified with a 1. DNA isolation: a significant amount of
specific stain which also limits their use in DNA must be isolated from multiple indi-
practice. viduals from target genotypes (parents and
segregating populations, germplasm survey,
garden blot, etc.) and purified to a fairly
2.1.2 DNA markers stringent degree as contaminants can often
interfere with the restriction enzyme and
RFLP inhibit its ability to digest the DNA.
2. Restriction digestion: restriction enzyme
Botstein et al. (1980) first used DNA restric- is added to purified genomic DNA under
tion fragment length polymorphism (RFLP) buffered conditions. The enzyme cuts at
in human linkage mapping and this pio- recognition sites throughout the genome
neered the utilization of DNA polymor- and leaves behind hundreds of thousands
phisms as genetic markers. It is known that of fragments.
the genomes of all organisms show many 3. Gel electrophoresis: digested products
sites of neutral variation at the DNA level. (restriction fragments) are electrophoresed
These neutral variant sites do not have any on agarose gel and when visualized appear
effect on the phenotype. In some cases a neu- as smears because of the large number of
tral site is nothing more than a single nucle- fragments.
26 Chapter 2
A1 A2 A1 A2 A1 A2
A1 A2
Fig. 2.2. RFLP workflow from DNA extraction to radio-autograph. Modified from Xu and Zhu (1994).
4. The agarose gel is denatured using NaOH DNA (cDNA). The standard procedure for
solution and then neutralized. developing genomic DNA probes is to digest
5. The DNA fragments are transferred to a total DNA with a methylation-sensitive
nitrocellulose membrane using Southern enzyme (e.g. PstI), thereby enriching the
blotting. library for single-copy sequences (Burr et al.,
6. Probe visualization: the membrane-bound 1988). Typically, the digested DNA is size
genomic DNA is probed by hybridization fractionated on a preparative agarose gel.
using a cloned fragment of the genome of DNA fragments ranging from 500 to 2000 bp
interest or a genome from a relatively close are excised and eluted for cloning into a
species as the probe. plasmid vector (e.g. pUC18). Digests of the
7. The membrane is washed to remove non- plasmids are screened for inserts and their
specifically hybridized DNA. lengths can be estimated. Southern blots of
8. In most cases the sizes of the fragments the inserts can be probed with total sheared
are determined by radioactive methods. genomic DNA to select clones that hybrid-
The probe-restriction enzyme combina- ize to single- and low-copy sequences and to
tions may identify two or more differently eliminate clones that hybridize to medium-
sized fragments. Polymorphism is revealed and high-copy repeated sequences. Single-
whenever the recognized fragments are of and low-copy probes are screened for RFLPs
non-identical lengths. among a sample of genotypes using genomic
DNAs digested with restriction endonucle-
Differences in size of restriction frag- ases (one per assay). Typically, in species
ments are due to: (i) base pair changes that with moderate to high polymorphism rates,
result in gain and loss of restriction sites; two to four restriction endonucleases with
and (ii) insertions/deletions at the restric- hexanucleotide recognition sites are tested.
tion sites within the restriction fragments EcoRI, EcoRV and HindIII are widely used.
on which the probe sequence is located. In species with low polymorphism rates,
Molecular probes are DNA fragments additional restriction endonucleases can
isolated and individualized by cloning or be tested to increase the chance of find-
PCR amplification. They may originate from ing a polymorphism. Both the theory and
fragmented total genomic DNA and thus the techniques for RFLP analysis in plant
contain coding or non-coding sequences, genome mapping have been intensively
unique or repeated, of nuclear or cytoplas- reviewed (Botstein et al., 1980; Tanksley
mic origin. They may also be complementary et al., 1988).
Markers and Maps 27
Most RFLP markers are co-dominant and is used to amplify random sequences from
locus specific. RFLP genotyping is highly a complex DNA template that is comple-
reproducible and the methodology is sim- mentary to it (or includes a limited number
ple and requires no special instrumenta- of mismatches). This means that the ampli-
tion. High-throughput markers (e.g. cleaved fied fragments generated by PCR depend
amplified polymorphic sequence (CAPS) on the length and size of both the primer
or insertion/deletion (indel) markers) can and the target genome. Ten-base oligomers
be developed from RFLP probe sequences. of varying GC content (ranging from 40 to
The CAPS technique, also known as PCR- 100%) are usually used. If two hybridiza-
RFLP, consists of digesting a PCR-amplified tion sites are similar to one another (at least
fragment with one or several restriction 3000 bp) and in opposite directions, that is,
enzymes, and detecting the polymorphism in a configuration that will allow the PCR,
by the presence/absence restriction sites amplification will take place. The amplified
(Konieczny and Ausubel, 1993). products (of up to 3.0 kb) are usually sepa-
RFLP markers are powerful tools rated on agarose gels and visualized using
for comparative and synteny mapping. ethidium bromide staining. The use of a
However, RFLP analysis requires large single 10-mer oligonucleotide promotes the
amounts of high quality DNA and has low generation of several discrete DNA products
genotyping throughput and is very diffi- and these are considered to originate from
cult to automate. Most genotyping involves different genetic loci. Polymorphisms result
radioactive methods so its use is limited to from mutations or rearrangements either at
specific laboratories. RFLP probes must be or between the primer binding sites and are
physically maintained and it is therefore visible in conventional agarose gel electro-
difficult to share them between laboratories. phoresis as the presence or absence of a par-
In addition, the level of RFLP is relatively ticular RAPD band. RAPDs predominantly
low and selection for polymorphic parental provide dominant markers but homologous
lines is a limiting step in the development allele combinations can sometimes be iden-
of a complete RFLP map. tified with the help of detailed pedigree
information.
RAPD RAPDs have several advantages and for
this reason they are widely used (Karp and
Williams et al. (1990) and Welsh and Edwards, 1997). (i) Neither DNA probes nor
McClelland (1990) independently described sequence information is required for the
the utilization of a single, random-sequence design of specific primers. (ii) The proce-
oligonucleotide primer in a low stringency dure does not involve blotting or hybridiza-
PCR (3545C) for the simultaneous ampli- tion steps thus making the technique quick,
fication of several discrete DNA fragments simple and efficient. (iii) RAPDs require rel-
referred to as random amplified polymor- atively small amounts of DNA (about 10 ng
phic DNA (RAPD) and arbitrary primed PCR per reaction) and the procedure can be auto-
(AP-PCR), respectively. Another related mated; they are also capable of detecting
technique is DNA amplification fingerprint- higher levels of polymorphism than RFLPs.
ing (DAF) (Caetano-Anolls et al., 1991). (iv) Development of markers is not required
These methods differ from one another in and the technology can be applied to vir-
primer length, the stringency of the con- tually any organism with minimal initial
ditions and the method of separation and development. (v) The primers can be uni-
detection of the fragments. They all can be versal and one set of primers can be used for
used to identify RAPD. any species. In addition, RAPD products of
The principle of RAPD consists of a interest can be cloned, sequenced and con-
PCR on the DNA of the individual under verted into other types of PCR-based mark-
study using a short primer, usually ten ers such as sequence tagged sites (STS),
nucleotides, of arbitrary sequence. The sequenced characterized amplified regions
primer which binds to many different loci (SCAR), etc.
28 Chapter 2
Reproducibility affects the way in which clear what might be causing the problem, it
RAPD bands can be standardized for compar- is worth starting from the beginning by dis-
ison across laboratories, samples and trials posing of all the reagents used and preparing
and whether RAPD marker information can fresh ones. A careful experiment revealed
be accumulated or shared. Due to frequently that reproducibility could be improved and
observed problems with reproducibility of Taberner et al. (1997) reported that 3396 out
overall RAPD profiles and specific bands, of 3422 bands (99.2%) were reproducible.
this marker class is often treated with On the other hand, low reproducibility
reserve. In replication studies by Prez et al. is a major limitation of RAPD markers, par-
(1998), mispriming error amounted to 60%. ticularly in ongoing genetic and plant breed-
Several factors have been shown to affect ing programmes in which the accumulated
the number, size and intensity of bands. information and markers and marker data
These include PCR buffers, deoxynucleo- are shared between laboratories and experi-
tide triphosphates (dNTPs), Mg2+ concen- ments. RAPD markers may still find their
tration, cycling parameters, source of Taq applications in independent genetic diver-
polymerase, condition and concentration sity and phylogenetic studies that do not
of template DNA and primer concentra- depend on data sharing or accumulation. As
tion. Results obtained by RAPDs are highly RAPD markers can be converted into other
prone to user error and bands obtained can types of markers, they have a unique role in
vary considerably between different runs of the development of target markers for crop
the same sample. To correct the problems species that have limited molecular markers
that may be encountered when carrying out available to cover the whole genome.
RAPD-PCR, it is important to bear in mind To overcome the problem associated
the following: (i) the concentration of DNA RAPD analysis, Paran and Michelmore
can alter the number of bands; (ii) RAPD (1993) converted RAPD fragments into
profiles vary depending on the Mg2+ con- simple and robust PCR markers known as
centration and the PCR buffer provided by SCARs. This procedure increases the repro-
Taq polymerase suppliers may or may not ducibility of RAPD markers and also avoids
contain Mg2+ ions; (iii) there are different the occurrence of non-homologous mark-
sources of Taq polymerase and there is great ers of equal molecular weight. These spe-
variation between profiles produced using cific markers are obtained by introducing
Taq polymerase obtained from different RAPD bands (polymorphic) into single
companies; (iv) there are a large number of markers which are then sequenced and
alternative cycling times and temperatures specific primers are designed usually by
which are equally important and depend on expanding the original decamer primer
the type of machine used and even the wall sequence with 1015 bases so that only the
thickness of the PCR tubes. band of interest is amplified. In general,
Generally if a PCR does not work there DNA can be isolated from agarose gels,
is likely to be something wrong with the cloned and sequenced to produce the start-
template DNA, primers, Taq polymerase or ing DNA template for the development of a
choice of conditions. Initially it is impor- variety of PCR-based markers. The cloned
tant to try and repeat the PCR under the and sequenced DNA fragments can then be
same conditions to ensure that there was used for the development of CAPS, single
not a simple error that resulted in the fail- strand conformation polymorphism (SSCP)
ure. In addition it is recommended that both or SNP markers.
positive and negative controls are included.
A positive control with a template known AFLP
to amplify well will ensure that all reagents
have been added and that they are all func- Amplified fragment length polymorphism
tioning. A negative control without template (AFLP; Zabeau and Voss, 1993; Vos et al.,
DNA will reveal any contamination. In most 1995) is based on the selective PCR ampli-
cases if the PCR does not work and it is not fication of restriction fragments from a total
Markers and Maps 29
with a complementary sequence for the rare the detector near the bottom of the gel/end
cutter and the other with the complemen- of the capillary, resulting in a linear spac-
tary sequence for the frequent cutter. In this ing of DNA fragments and therefore increas-
way only fragments which have been cut by ing the resolution over the whole size range
the frequent cutter and rare cutter will be (Schwarz et al., 2000).
amplified. Primers are designed from the In general, AFLP assays can be carried
known sequence of the adaptor, plus one out using relatively small DNA samples
to three selective nucleotides which extend (typically 1100 ng per individual). AFLP
into the fragment sequence. Sequences not has a very high multiplex ratio and genotyp-
matching these selective nucleotides in the ing throughput and is relatively reproduc-
primer will not be amplified so that the ible across laboratories. Simple off-the-shelf
specific amplification of only those frag- technology can be applied to virtually any
ments matching the primers is achieved. organism with no formal marker devel-
The option to permutate the order of the opment required and in addition, a set of
selective bases and to recombine the prim- primers can be used for different species.
ers with each other will theoretically lead However, there are limitations to the AFLP
to the gradual collection of all restriction assay. (i) The maximum polymorphic infor-
fragments from a particular enzyme com- mation content for any bi-allelic marker
bination that is of a suitable size for DNA is 0.5. (ii) High quality DNA is needed to
fragment analysis from a genotype. The ensure complete restriction enzyme diges-
multiplex ratio of an AFLP assay is a func- tion. Rapid methods for isolating DNA may
tion of the number selective nucleotides in not produce sufficiently clean template
the AFLP primer combination, the selective DNA for AFLP analysis. (iii) Proprietary
nucleotide motif, GC content and physical technology is needed to score heterozygotes
genome size and complexity. Typically, two and ++ homozygotes, otherwise AFLPs must
selective nucleotides are used for species be dominantly scored. (iv) AFLP markers
with small genomes (1 1085 108 bp), often cluster densely in centromeric regions
e.g. Arabidopsis thaliana L. (1 108 bp) and in species with large genomes, e.g. barley
rice (Oryza sativa L.) (4 108 bp), and three (Qi et al., 1998) and sunflower (Gedil et al.,
selective nucleotides are used for species 2001). (v) Developing locus-specific mark-
with large genomes (5 1086 109 bp), ers from individual fragments can be dif-
e.g. maize, soybean, sunflower and many ficult. (vi) AFLP primer screening is often
others. It is theoretically possible to use necessary to identify optimal primer spe-
several tens of combinations of restriction cificities and combinations otherwise the
enzymes at sites of four to six bases and a assays can be carried out using off-the-shelf
large number of combinations of selective technology. (vii) There are relatively high
bases on the amplification primers. Thus, technical demands in AFLP analysis includ-
as indicated by Falque and Santoni (2007), ing radio-labelling and skilled manpower.
the restrictionamplification combinations (viii) Marker development is complicated
are nearly infinite. and not cost-effective. (ix) Reproducibility
AFLP products can be separated in high- is relatively low compared to RFLP and
resolution electrophoresis systems. The simple sequence repeat (SSR) markers but
number of bands produced can be manipu- better than RAPD marker as AFLP reveals
lated by the number of selective nucleotides large numbers of bands and not all the bands
and the nucleotide motifs used. A well- will be comparable across laboratories or
balanced number of amplified restriction trials due to potential false positive, false
fragments ranges from 50 to150 bp. A major negative and complicated gel backgrounds.
improvement has been made by switching The AFLP technique can be modified
from radioactive to fluorescent dye-labelled so that one primer is obtained from a known
primers for the detection of fragments in multi-copy sequence to detect sequence-
gel-based or capillary DNA sequencers in specific amplification polymorphisms. This
which fluorescently labelled fragments pass approach was used successfully to generate
Markers and Maps 31
libraries enriched for one or more repeat ing barley, soybean, sugarbeet, maize,
motifs (although SSR-enriched libraries can cassava and potato; typical SNP frequen-
be commercially purchased) and the high cies are also in the range of one SNP every
start-up costs for automated methods. 100300 bp in plants (see Edwards et al.,
2007a for a review).
SNP SNPs may fall within coding sequences
of genes, non-coding regions of genes or in
A single nucleotide polymorphism or the intergenic regions between genes at dif-
SNP (pronounced snip) is an individual ferent frequencies in different chromosome
nucleotide base difference between two regions. In Arabidopsis the distribution of
DNA sequences. SNPs can be catego- SNPs was found to be even across the five
rized according to nucleotide substitu- chromosomes with the exception of cen-
tion as either transitions (C/T or G/A) or tromeric regions which contain few tran-
transversions (C/G, A/T, C/A or T/G). For scribed genes (Schmid et al., 2003). SNPs
example, sequenced DNA fragments from within a coding sequence will not neces-
two different individuals, AAGCCTA to sarily change the amino acid sequence of
AAGCTTA, contain a single nucleotide dif- the protein that is produced due to redun-
ference. In this case there are two alleles: dancy in the genetic code. A SNP in which
C and T. C/T transitions constitute 67% of both forms lead to the same polypeptide
the SNPs observed in humans, and about sequence is termed synonymous, while if
the same rate was also found in plants a different polypeptide sequence is pro-
(Edwards et al., 2007a). In practice, single duced they are non-synonymous. SNPs
base variants in cDNA (mRNA) are consid- that are not in protein coding regions may
ered to be SNPs as are single base inser- still have consequences for gene splic-
tions and deletions (indels) in the genome. ing, transcription factor binding or the
As a nucleotide base is the smallest unit sequence of non-coding RNA. Of the 317
of inheritance, SNPs provide the ultimate million SNPs found in the human genome,
form of molecular marker. 5% are expected to occur within genes.
For a variation to be considered a SNP, Therefore, each gene may be expected to
it must occur in at least 1% of the popula- contain 6 SNPs.
tion. SNPs make up about 90% of all human A variety of approaches have been
genetic variation and occur every 100300 adopted for discovery of novel SNPs in a
bases. Two of every three SNPs involve the wide range of organisms including plants.
replacement of cytosine (C) with thymine These fall into three general categories
(T). This is supported by a genome-wide (Edwards et al., 2007b): (i) in vitro discov-
analysis in rice. A polymorphism data- ery, where new sequence data is generated;
base constructed to define polymorphisms (ii) in silico methods that rely on the analysis
between cultivars Nipponbare (from sub- of available sequence data; and (iii) indirect
species japonica) and 93-11 (from subspe- discovery, where the base sequence of the
cies indica) contains 1,703,176 SNPs and polymorphism remains unknown. On the
479,406 indels (Shen et al., 2004), which other hand, a large number of different SNP
equates to approximately 1 SNP/268 bp genotyping methods and chemistries have
in the rice genome. Using alignments of been developed based on various meth-
the improved whole-genome shotgun ods of allelic discrimination and detection
sequences for japonica and indica rice, platforms. A convenient method for detect-
SNP frequencies varied from 3 SNPs/kb in ing SNPs is RFLP (SNP-RFLP) or by using
coding sequences to 27.6 SNPs/kb in the the CAPS marker technique. If one allele
transposable elements with a genome-wide contains a recognition site for a restriction
measure of 15 SNPs/kb or 1 SNP/66 bp enzyme while the other does not, digestion
(Yu et al., 2005). Based on partial genomic of the two alleles will give rise to fragments
sequence information, SNP frequencies of different length. A simple procedure is
have been revealed in many crops, includ- to analyse the sequence data stored in the
Markers and Maps 35
major databases and identify SNPs. Four be bound to streptavidin-coated wells and
alleles can be identified when the complete denatured under alkaline conditions. An
base sequence of a segment of DNA is con- oligonucleotide probe complementary to
sidered and these are represented by A, T, G one allele is added to the single-strand target
and C at each SNP locus in that segment. DNA molecules. The differences in melting
Sobrino et al. (2005) assigned the major- curves are measured by slowly heating and
ity of SNP genotyping assays to one of four observing the changes in fluorescence of a
groups based on the molecular mechanisms: double-strand-specific, intercalating dye.
allele-specific hybridization, primer exten- The 5' nuclease or TaqMan assay, molecu-
sion, oligonucleotide ligation and invasive lar beacon and the scorpion assays are all
cleavage. These four are described below. examples of ASH SNP genotyping technolo-
Chagn et al. (2007) added three methods gies. Large-scale scanning of SNPs in a vast
to this list, sequencing, allele-specific PCR number of loci using allele-specific hybridi-
amplification, DNA conformation methods zation can be carried out on high-density
and also generalized the enzymatic cleav- oligonucleotide chips.
age method to include the invader assay 2. The Invader assay, also known as flap
and also dCAPS and targeting induced endonuclease discrimination, is based on
local lesions in genomes (TILLING). the specificity of recognition and cleavage
by a three-dimensional flap endonuclease
1. Allele-specific hybridization (ASH), also which is formed when two overlapping oli-
known as allelic-specific oligonucleotide gonucleotides hybridize perfectly to a target
hybridization, is based on distinguishing by DNA (Lyamichev et al., 1999). The cleaved
hybridization between two DNA targets dif- fragment may be labelled with a probe-
fering at one nucleotide position (Wallace specific fluorescent dye which fluoresces
et al., 1979). Allelic discrimination can be following probe cleavage due to spatial sep-
achieved using two allele-specific probes aration from the quencher. Alternatively, the
labelled with a probe-specific fluorescent flap may act as the invader probe in a sec-
dye and a generic quencher that reduces flu- ondary reaction to amplify the fluorescent
orescence in the intact probe. During ampli- signal (Invader squared) (Hall et al., 2000).
fication of the sequence surrounding the Third Wave Technologies Inc. (http://www.
SNP, probes complementary to the DNA tar- twt.com) has manufactured an Invader assay
get are cleaved by the 5' exonuclease activ- for flap endonuclease discrimination which
ity of Taq polymerase. Spatial separation of can be carried out in solid phase using
the dye and quencher results in an increase oligonucleotide-bound streptavidin-coated
in probe-specific fluorescence which can be particles (Wilkins-Stevens et al., 2001).
detected with a plate reader. 3. Primer extension is a term used to
Under optimized assay conditions, describe mini-sequencing, single-base exten-
the SNP can be detected by the difference sion or the GOOD assay (Sauer et al., 2002).
in Tm of the two probetemplate hybrids A popular method which was designed
as only the perfectly matched probetarget specifically for genotyping SNPs is the
hybrids are stable and those with one-base mini-sequencing technique (Syvnen, 1999;
mismatch are unstable. To increase the reli- Syvnen et al., 1990). The method forms the
ability of SNP genotyping the probes should basis of a number of methods for allelic dis-
be as short as possible. Originally, ASH crimination. The robust detection of known
used the dot blot format in which probes are mutations employs oligonucleotides which
hybridized to membrane-bound genomic anneal immediately upstream of the query
DNA or PCR fragments. However, the SNP and are then extended by a single
more advanced PCR-based dynamic allele- dideoxynucleotide triphosphate (ddNTP)
specific hybridization (DASH) method uses in cycle sequencing reactions. The fidel-
a microtitre plate format (Howell et al., ity of thermostable proof-reading DNA
1999). Since one of the PCR primers is bioti- polymerases guarantees that only the com-
nylated at the 5' end, the PCR products can plementary ddNTP is incorporated. Several
36 Chapter 2
Illumina
BeadArray
Allele-specific Luminex 100 Flow
Semi-homogen.
extend ligate Cytometry
Sequenom iPlex
Oligonucleotide Solid phase Mass Spec.
ligation assay microspheres Fluorescence
ABI SNPlex
Single nucleotide
primer extension Homogeneous Mass Microarray
spectrometry minisequencing
ABI TaqMan
Capillary 5-Nuclease
Allele-specific
electrophoresis
hybridization Fluor. res. energy
transfer (FRET) ABI SNaPshot
Solid phase
DASH,
microarray
Amplicon Tm
Allele-specific Fluorescence
PCR polarization Perkin-Elmer
FP-TDI
Fig. 2.5. Chemistry, demultiplexing, detection options in SNP genotyping. From Syvnen (2001) reprinted
by permission from Macmillan Publishers Ltd.
LIGHT DETECTION. Pyrosequencing involves level, the multiple steps can be assembled
hybridization of a sequencing primer to a and automated so that one laboratory tech-
single stranded template and sequential nician can produce 10,000 data points per
addition of individual dNTPs. Incorporation day. The TaqMan platform is highly suita-
of a dNTP into a primer releases pyrophos- ble for genetically modified organism tests
phate which triggers a luciferase-catalysed and MAS using a few markers for a large
reaction. The genotype of a SNP is deter- number of samples.
mined by the sequential addition (and The SNaPshot Multiplex Assay
degradation) of nucleotides. The light (Applied Biosystems, Foster City, USA)
produced is detected by a charge coupled is based on mini-sequencing, i.e. a single-
device camera and each light signal is pro- base extension using fluorescent labelled
portional to the number of nucleotides ddNTPs. The systems multiplex ready
incorporated (http://www.pyrosequencing. reaction mix enables robust multiplex
com), for which reason pyrosequencing is SNP interrogation of PCR-generated tem-
suitable for the quantitative estimation of plates. Multiplexing can be accomplished
allele frequencies in pooled DNA samples. by representing multiple SNP products
Furthermore, pyrosequencing proved to spatially. This is achieved by tailing the
be an appropriate method for genotyping 5' end of the unlabelled SNaPshot primers
SNPs in polyploidy plant genomes such with different lengths of non-complemen-
as potato because all possible allelic states tary oligonucleotide sequences that serve
of binary SNP could be accurately distin- as mobility modifiers. The reactions may
guished (Rickert et al., 2002). be carried out in 5- to 10-plex using capil-
There are various SNP detection systems lary electrophoresis for data detection in a
which differ in their chemistry, detection 96-well format so that one individual can
platform, multiplex level and application; generate over 10,000 data points per day.
some of these will be discussed below. SNaPshot is suitable for MAS of several
The reader is also referred to Bagge and traits simultaneously and if multiple sets
Lbberstedt (2008) for further information. of 10-plex are combined, it can be used for
The TaqMan SNP Genotyping Assay rough mapping and marker-assisted back-
(Applied Biosystems, Foster City, USA) is crossing with several hundreds of samples
a single-tube PCR assay that exploits the and markers involved.
5' exonuclease activity of AmpliTaq Gold The SNPlex Genotyping System
DNA. The assay kit includes two locus- (Applied Biosystems, Foster City, USA)
specific PCR primers that flank the SNP of uses OLA/PCR technology for allelic dis-
interest and two allele-specific oligonucle- crimination and ligation product amplifica-
otide TaqMan probes. These probes have a tion. Genotype information is then encoded
fluorescent reporter dye at the 5' end and into a universal set of dye-labelled, mobil-
a non-fluorescent quencher with a minor ity modified fragments known as Zipchute
groove binder at the 3' end. Upon cleav- Mobility Modifiers, for rapid detection by
age by the 5' exonuclease activity of Taq capillary electrophoresis. The same set of
polymerase during PCR, the reporter dye Zipchute Mobility Modifiers can be used
will fluoresce as it is no longer quenched for every SNPlex pool regardless of which
and the intensity of the emitted light can SNPs are chosen. The SNPlex System
be measured. Modified probes such as allows for multiplexed genotyping of up
locked nucleic acids, a modified nucleic to 48 SNPs simultaneously against a single
acid analogue, showed better hybridization sample with the ability to detect up to 4500
properties than standard TaqMan probes SNPs in parallel in 15 min. This integrated
(Kennedy et al., 2006). TaqMan is a simple system delivers cost-efficient, medium- to
assay, since all the reagents are added to the high-throughput genotyping and is suitable
microtitre well at the same time in a 96- or for various genetic and breeding applica-
384-well format. Although the assay can tions including fingerprinting, gene map-
be carried out at the monoplex or duplex ping and MAS for both foreground and
Markers and Maps 39
background. Both SNaPshot and SNPlex ing well-known reaction principles for DNA
can be used with capillary electrophoresis amplification and SNP genotyping.
systems as the genotyping platform which Identification of a specific single-base
can be also used for SSR genotyping. change among up to billions of bases that
MassARRAY iPLEX Gold (SEQUE- constitute a plant species is a challenging
NOM, San Diego, USA) combines the ben- task. PCR offers a means of reducing the
efits of the simple and robust single-base complexity of a genome and increasing
primer extension biochemistry with the the copy number of the DNA templates
sensitivity and accuracy of MALDI-TOF to the levels required for the specific and
mass spectometry (see Chapter 3) detection. sensitive detection of single-base changes.
It uses a single termination mix and universal However, the design of robust PCR assays
reaction conditions for all SNPs. The primer with multiplexing levels exceeding 1020
is extended, dependent upon the template amplicons has proven to be more diffi-
sequence, resulting in an allele-specific dif- cult than initially anticipated because in
ference in mass between extension prod- multiplex PCR the number of undesired
ucts. The assays can be multiplexed up to interactions between the PCR primers
40 SNPs in a 384-well format allowing for increases exponentially as the number of
throughput levels of up to 150,000 geno- primers included in the reaction mixture
types per instrument per day. MassARRAY increases. This interaction usually results
is flexible and suitable for generating both in preferential amplification of unwanted
small and large marker numbers for each primerdimer artefacts instead of the
sample so that it can be used for a variety of intended DNA templates (amplicons).
genetic and breeding purposes. Another problem in multiplex PCR is the
There are two major chip-based sequence-dependent differences in PCR
high-throughput genotyping systems, DNA efficiency between the amplicons. The
microarrays developed by Affymetrix (Santa problems of multiplexing can be reduced
Clara, USA) and a high-density biochip to some extent by using PCR primers that
assay by Illumina Inc. (San Diego, USA), are as similar to one another as possible.
both of which offer different levels of mul- The multiplexing level that can be read-
tiplexes up to several thousands or more ily achieved in standard PCRs is less than
plexes (Yan et al., 2009). As an increasing that offered by current technology for pro-
number of sets of these chips become avail- ducing high-density DNA microarrays.
able, outsourced genotyping through com- Simultaneous analysis of a reasonable
panies or service centres becomes one of amount of genomic DNA with the current
the options for genotyping large numbers detection sensitivity of microarray scan-
of samples using the same set of markers ners requires an amplification step. The
(e.g. fingerprinting) to achieve high effi- PCR step complicates the molecular reac-
ciency and low cost per data point. tions underlying the assays and introduces
multiple laboratory steps into the proce-
THE FUTURE OF SNP TECHNOLOGY. A key techni- dures and is therefore the chief obstacle to
cal obstacle in the development of micro- highly multiplexed SNP genotyping.
array-based methods for genome-wide SNP
genotyping is the PCR amplification step Diversity array technology
which is required to reduce the complexity
and improve the sensitivity of genotyping Diversity array technology (DArT) is a novel
SNPs in large, diploid genomes. The level type of DNA marker which employs a
of complexity that can be achieved in PCR microarray hybridization-based technique
does not match that of current microarray- developed by CAMBIA (http://www.diversity
based methods thus making PCR the lim- arrays.com) that enables the simultaneous
iting step in these assays (Syvnen, 2005). genotyping of several hundred polymorphic
Highly multiplexed microarray systems loci spread over the genome (Jaccoud et al.,
have recently been developed by combin- 2001; Wenzel et al., 2004). DArT can be
40 Chapter 2
A
Gx Gy Gn DNAs of interest
B
Gx Gy
Choose two genomes to analyse
Same complexity
reduction as used to make
the diversity panel
Hybridize to chip
Fig. 2.6. Procedure of diversity array technology (DArT). (A) Preparing the array. RE, restriction enzyme.
(B) Genotyping a sample.
Markers and Maps 41
DArT markers are biallelic and behave derived from polymorphisms within genes.
in a dominant (present versus absent) or co- FMs are derived from polymorphic sites
dominant (two doses versus one dose versus within genes that are causally associated
absent) manner. DArT detects single-base with phenotypic trait variation and are supe-
changes as well as indels. It is a good alter- rior to RMs as a result of their complete link-
native to currently used techniques includ- age with trait locus alleles and functional
ing RFLP, AFLP, SSR and SNP in terms of motifs (Anderson and Lbberstedt, 2003).
cost and speed of marker discovery and The major drawback of the RMs is that their
analysis for whole-genome fingerprinting. predictive value depends on the known
It is cost-effective, sequence-independent, linkage phase between marker and target
non-gel based technology that is amenable locus alleles (Lbberstedt et al., 1998b).
to high-throughput automation and the dis- Genetic diversity at or below the spe-
covery of hundreds of high quality markers cies level has mostly been characterized
in a single assay. An open source software by molecular markers that more or less
package, DArTsoft, is available for automatic randomly sampled genetic variation in the
data extraction and analysis. The weak- genome. RM is a very effective tool among
nesses of this technology include marker others for the establishment of a breed-
dominance and its technically demanding ing system, the study of gene flow among
nature. Also there is some concern as to natural populations, and the determination
whether DArT markers are randomly dis- of the genetic structure of GeneBank col-
tributed across the whole genome, as DArT lections (Chapter 5; Xu et al., 2005). RM
markers in barley appear to have a moderate systems are still the systems of choice for
tendency to be located in hypomethylated, marker-assisted breeding (Xu, Y., 2003).
gene-rich regions in distal chromosome However, users of biodiversity are often not
areas (Wenzl et al., 2006). interested in random variation but rather in
DArT technology has been successfully variation that might affect the evolutionary
developed for Arabidopsis, cassava, bar- potential of a species or the performance of
ley, rice, wheat, sorghum, ryegrass, tomato an individual genotype. Such functional
and pigeon pea, while work is in progress variation can be tagged with neutral molec-
to establish DArT in chickpea, sugarcane, ular markers using quantitative trait loci
lupins, quinoa, banana and coconut (http:// (QTL) and linkage disequilibrium mapping
www.diversityarrays.com). For example, a approaches. Alternatively, DNA-profiling
genetic map with 385 unique DArT mark- techniques may be used that specifically
ers spanning the 1137 cM barley genome target genetic variation in functional parts
(Wenzl et al., 2004) was constructed, DArT of the genome.
markers along with AFLP and SSR mark-
ers were mapped on the wheat genome GENIC MARKERS. A wealth of DNA sequence
(Semagn et al., 2006), and a cassava DArT information from many fully characterized
genotyping array containing approximately genes and full-length cDNA clones has been
1000 polymorphic clones (Xia, L. et al., generated and deposited in online databases
2005) is now available. for an increasing number of plant species
and the sequence data for ESTs, genes
Genic and functional markers and cDNA clones can be downloaded from
GenBank and scanned for identification of
DNA markers can be classified into random SSRs. Subsequently, locus-specific primers
markers (RMs) (also known as anonymous flanking EST- or genic SSRs can be designed
or neutral markers), gene targeted mark- to amplify the microsatellite loci present in
ers (GTMs) (also known as candidate gene the genes. In maize for example, gene-derived
marker) and functional markers (FMs) SSR markers that have been developed
(Anderson and Lbberstedt, 2003). RMs from genes and their primer sequences
are derived at random from polymorphic are available at www.maizeGDB.org. Genic
sites across the genome whereas GTMs are SSRs have some intrinsic advantages over
42 Chapter 2
genomic SSRs because they can be obtained Novel markers can be developed from
quickly by electronic sorting, are present the transcriptome and specific genes. As
in expressed regions of the genome and summarized by Gupta and Rustgi (2004),
expected to be transferable across species these include EST polymorphisms (devel-
(when the primers are designed from more oped using EST databases); conserved
conserved coding regions; Varshney et al., orthologue set markers (developed by com-
2005a). The potential use of EST-SSRs devel- paring the sequences of target genomes with
oped for barley and wheat has been demon- sequences of the closely related species);
strated for comparative mapping in wheat, amplified consensus genetic markers (based
rye and rice (Yu et al., 2004; Varshney et al., on the known genes from model species);
2005a). These studies suggested that EST-SSR gene-specific tags (with primers designed
markers could be used in related species for using gene sequences); resistance gene
which little information is available on SSRs analogues (with primers designed to iden-
or ESTs. In addition, the genic SSRs are good tify consensus domains conferring resist-
candidates for the development of conserved ance); exonretrotransposon amplification
orthologous markers for the genetics and polymorphism (with primers designed to
breeding of different species. For example, a combine with a long terminal repeat retro-
set of 12 barley EST-SSRs was identified that transposon-specific primer or a randomly
showed significant homology with the ESTs selected microsatellite-containing oligonu-
of four monocotyledonous species (wheat, cleotide); and PCR-based markers target-
maize, sorghum and rice) and two dicotyle- ing exons, introns and promoter regions of
donous species (Arabidopsis and Medicago) known genes with high specificity.
which could potentially be used across these Target region amplification polymor-
species (Varshney et al., 2005a). phism (TRAP) markers are derived from a
Kumpatla and Mukopadhyay (2005) rapid and efficient PCR-based technique
examined the abundance of SSR in more which uses bioinformatics tools and EST
than 1.54 million ESTs belonging to 55 database information to generate poly-
dicotyledonous species. They found that the morphic markers around targeted candi-
frequency of ESTs containing SSR among date gene sequences (Hu and Vick, 2003).
species ranged from 2.65 to 16.82%, with This TRAP technique uses two primers of
dinucleotide repeats being most abundant 18 nucleotides to generate markers. TRAP
followed by tri- or mononucleotide repeats, markers are amplified by one fixed primer
thus demonstrating the potential of in designed from a target EST sequence in the
silico mining of ESTs for the rapid develop- database and a second primer of arbitrary
ment of SSR markers for genetic analysis sequence except for AT- or GC-rich cores
and application to dicotyledonous crops. that anneal with introns and exons, respec-
However, EST-SSRs produce high quality tively. The TRAP technique should be use-
markers but these are often less polymorphic ful in genotyping germplasm collections
than genomic SSRs (Cho et al., 2000; Eujayl and in tagging genes with beneficial traits
et al., 2002; Thiel et al., 2003). EST resources in crop plants.
are also being used to mine SNPs (Picoult-
Newberg et al., 1999; Kota et al., 2003). ESTs FUNCTIONAL MARKERS. Functional markers
provide a quantitative method of measuring (FMs) are derived from polymorphic sites
specific transcripts within a cDNA library within genes causally affecting phenotypic
and represent a powerful tool for gene dis- variation. The development of FMs requires
covery, gene expression, gene mapping and allele-specific sequences of functionally
the generation of gene profiles. The National characterized genes from which polymor-
Center for Biotechnology Information (NCBI) phic, functional motifs affecting plant phe-
database, dbEST 0900409 (http://www.ncbi. notype can be identified. Some theoretical
nlm.nih.gov/dbEST_summary.html) contains and application issues relevant to functional
the largest collection of ESTs in rice, wheat, markers in wheat have been addressed (Bagge
barley, maize, soybean, sorghum and potato. et al., 2007; Bagge and Lbberstedt, 2008).
Markers and Maps 43
Genomic coverage Low copy coding region Whole genome Whole genome Whole genome Whole genome
Amount of DNA required 5010 g 1100 ng 1100 ng 50120 ng 50 ng
Quality of DNA required High Low High Medium high High
Type of polymorphism Single base Single base Single base Changes in Single base changes,
changes, indels changes, indels changes, indels length of repeats indels
Level of polymorphism Medium High High High High
Effective multiplex ratio Low Medium High High Medium to high
Chapter 2
Inheritance Co-dominant Dominant Dominant/ Co-dominant Co-dominant
co-dominant
Type of probes/primers Low copy DNA or Usually 10 bp Specific sequence Specific sequence Allele-specific PCR
cDNA clones random nucleotides primers
Technically demanding High Low Medium Low High
Radioactive detection Usually yes No Usually yes Usually no No
Reproducibility High Low to medium High High High
Time demanding High Low Medium Low Low
Automation Low Medium High High High
Development/start-up cost High Low Medium High High
Proprietary rights required No Yes and licensed Yes and licensed Yes and some Yes and some licensed
licensed
Suitable utility in diversity, Genetics Diversity Diversity and All purposes All purposes
genetics and breeding genetics
Markers and Maps 45
telophase and cytokinesis, two new daugh- with an increased number of molecular
ter cells are formed. Each of these daughter markers in the segregated population; geno-
cells has half the chromosomes (n) of the typing each individual/line using molecu-
parental cell (2n). The second meiotic divi- lar markers; and constructing linkage maps
sion closely resembles mitosis with each of from the marker data.
the nuclei generated during the first meiotic The recombination frequency between
division splitting to form two more nuclei. two linked genetic markers can be defined
Thus, four haploid gametes are produced. in units of genetic distance known as cen-
Crossing over is the process by which tiMorgans (cM) or map units. If two mark-
homologous chromosomes exchange por- ers are found to be separated in one of 100
tions of their chromatids during meiosis, progeny, those two markers are 1 cM apart.
resulting in new combinations of genetic However, 1 cM does not always correspond
information and thus affecting inheritance to the same length of physical distance or
and increasing genetic diversity. Genes that the same amount of DNA. The amount of
are present together on the same chromo- DNA per cM is referred to as the physical
some tend to be inherited together and are to genetic distance. Areas in the genome
referred to as linked. Genes that are nor- where recombination is frequent are known
mally linked may be inherited independ- as recombination hot spots; there is rela-
ently during crossing over. tively little DNA per cM in these hot spots
The proportion of recombinant gam- and it can be as low as 200 kb/cM. In other
etes depends on the rate of crossover during areas recombination may be suppressed and
meiosis and is known as the recombination 1 cM will represent more DNA and in some
frequency (r). The maximum proportion of regions the physical to genetic distance can
recombinant gametes is 50% and in this be up to 1500 kb/cM.
case crossover between two genetic loci has
occurred in all the cells. This is equivalent
to the case of non-linked genes, i.e. the two Developing mapping populations
loci are inherited independently. In population development, several factors
The recombination frequency depends should be taken into consideration includ-
on the rate of crossovers which in turn ing the selection of parental lines and
depends on the linear distance between two population types and the determination of
genetic loci. Recombination frequencies population size.
range from 0 (complete linkage) to 0.5 (com-
plete independent inheritance).
CHOICE OF PARENTAL LINES.
Four factors should
be considered in selecting appropriate
parental lines (Xu and Zhu, 1994):
2.2.2 Genetic linkage mapping 1. DNA polymorphism: genetic polymor-
phism between parental lines usually
In order to utilize the genetic information depends on how closely related they are,
provided by molecular markers more effi- which can be determined by criteria such as
ciently, it is important to know the locations geographical distribution and morphological
and relative positions of molecular mark- and isozyme polymorphisms. In general,
ers on chromosomes. The construction of DNA polymorphism is greater in open-
genetic linkage maps using molecular mark- pollinated species than in self-pollinated
ers is based on the same principles as those species. For example, RFLP polymorphism
used in the preparation of classical genetic is very high among maize lines so that a
maps: selection of molecular markers and population derived from any two inbred
genotyping system; selection of parental lines would be desirable for RFLP mapping.
lines from the germplasm collection that are Genetic polymorphism is very low in tomato
highly polymorphic at marker loci; devel- so that only interspecific populations are
opment of a population or its derived lines sufficiently polymorphic to allow for RFLP
46 Chapter 2
50
Maximum distance between markers
Average distance between markers
40
30
cM
20
10
0
0 300 600 900 1200 1500 1800 2100 2400
Number of markers
Fig. 2.8. Average and maximum distance expected between markers on a linkage map depending
on number of random markers mapped for a genome with 1200 cM, e.g. 12 chromosomes of 100 cM
each. The maximum distance curve is for 95% confidence level. From Tanksley et al. (1988) with kind
permission of Springer Science and Business Media.
Population F2 BC DH (RIL)
Co-dominance 1 M1M1:2 M1M2:1 M2M2 1 M1M2:1 M2M2 1 M1M1:1 M2M2
M1 is dominant 3 M1_:1 M2M2 1 M1M2:1 M2M2 1 M1M1:1 M2M2
50 Chapter 2
F2 gamete frequency M1N1 (1 r)/2 M1N2 r/2 M2N1 r/2 M2N2 (1 r)/2
Fig. 2.10. Theoretical ratios in an F2 population derived from two parents M1M1N1N1 and M2M2N2N2 with
recombinant frequency r.
Fig. 2.11. Genotypes and their frequencies for three linkage combinations at two loci in F2 populations
(each frequency divided by 4).
4 2 4
cT2 = {n5 + 2(n22 + n42 + n62 + n82 )
2
cM = ((n1 + n3 + n5 )2
n 3n
dfT = 8 + 3(n2 + n4 + n6 )2 ) n
+ 4(n12 + n32 + n72 + n92 )} n dfM = 1
Markers and Maps 51
2 We have
c N2 = (2(n1 + n2 )2 + (n3 + n5 )2
n c T2 c20.05(8) = 15.5
+ 2(n4 + n6 )2 ) n dfN = 2 2
cM c0.05(2)
2
= 5.99
c2L = cT2 cA2 c2B dfL = 2 c N2 c20.05(2) = 5.99
c L2 c20.05(4) = 9.49
For linkage combination (3:1)-(3:1)
which indicates that both loci M and N
show normal Mendelian segregation and
1 2
c 2M = (n1 + n22 3n32 3n42 ) dfM = 1 are linked.
3n
cT2 =
4
{(562 + 2(62 + 52 + 42 + 32 ) We have p1 (M1_N1_) = (3 2r + r 2)/4, p2
132 (M 1 _N 2 N 2 ) = p 3 (M 2 M 2 N 1 _) = (2r r 2 )/4,
+ 4(272 + 12 + 02 + 302 )} p4 (M2M2N2N2) = (1 2r + r 2)/4, and pi = 1.
132 = 165.818 Considering the number of individuals
observed for each category, n1, n2, n3 and n4,
2 and ni = n, they have a probability distri-
2
cM = {2(27 + 6 + 1)2 + (5 + 56 + 4)2 bution of (p1+p2+p3+p4)n. For a specific set
132
of observations (n1, n2, n3 and n4), the likeli-
+ 2(0 + 3 + 30)2 } 132 = 0.045
hood function is:
2 n!
c N2 = {2(27 + 5 + 0)2 + (6 + 56 + 3)2 L (r ) = ( p1 )n1 ( p2 )n2 ( p3 )n3 ( p4 )n4
132 n1!n2!n3!n4!
+ 2(1 + 4 + 30)2 } 132 = 0.167 n!
= (1/4)n (3 2r r 2 )n1
n1!n2!n3!n4!
c2L = 165.818 0.045 0.167 = 165.606 (2r r 2 )n2 + n3 (1 2r + r 2 )n4
Fig. 2.12. Data example used for test of linkage The natural logarithm of L(r) is called sup-
for (1:2:1)-(1:2:1) in an F2 population. port or log-likelihood. Here we have
52 Chapter 2
where k 2
d 2[ln L (r )] ni dpi
C = ln
n!
n ln(1/4) dr 2
= p i
2
i dr
n1! n2! n3! n4! k
ni d 2 pi
is a constant. + p dr
i
i
2
The first partial derivative is the
slope of a function. The slope will be zero k 2
d 2[ln L (r )]
p dr
at the maximum (global/local and/or min- 1 dpi
imum). The partial derivative is set with
E
dr 2 = n
i
i
respect to r k
ni d 2 pi
d ln L(r)/dr = 0
+n p dr
i
i
2
k 2
The partial derivative of ln L(r) is usually 1 dpi
denoted as score or S =n i
pi dr
n1 2 (1 r ) 2(1 r )
S= + (n2 + n3 ) k k
3 2r + r 2 2r r 2 d 2 pi
p = 0,
d
Because = i
2(1 r ) dr 2 dr
n4 =0 i i
1 2r + r 2 k
1 dpi 2 k
i
1
That is =n =n i =I
Vr i pi dr i
n1 n + n3 n4
2 + =0
3 2r + r 2 2r r 2 1 2r + r 2 where I is the total information content and
n1 n2 + n3 n4 ii = I/n is the information derived from a
+ =0 single observation.
2 + (1 r )2 1 (1 r )2 (1 r )2
From the above formula, the variance
If (1 r)2 = k, then of r can be calculated using the information
provided in Table 2.3.
n1 n + n3 n4 To estimate k, the values of ni listed in
2 + =0
2+ k 1 k k the table are used in the formula:
therefore (see equation at bottom of page) 1927 19272 + 8 6952 1338
and the MLE is k=
2 6952
= 0.7743
r = 1 k
r = 1 0.7743 = 0.1201
According to the Rao-Cramer Unequation,
the sampling variance of r is Vr = 1.76702 105
1 d 2[lnL (r )]
= E = I Thus,
Vr dr 2
r = 0.1201 1.76702 105
2
d [ln L ( r )]
where 2 is the secondary derivative = 12.01% 0.42%
dr
Table 2.3. Calculation of the variance of recombinant frequency for two linked loci each with complete
dominance.
2
dpi 1 dpi
ii =
Group ni pi dr p i dr
2
M1_N1_ 4831 (3 2r + r 2 )/4 2(1 r )/4 (1 r )
i1 = 2
3 2r + r
2
M1_N2N2 390 (2r r 2)/4 2(1 r )/4 (1 r )
i2 = 2
2r r
2
M2M2N1_ 393 (2r r 2)/4 2(1 r)/4 (1 r )
i2 = 2
2r r
2
4(1 r )
M2M2N2N2 1338 (1 r 2)/4 2(1 r)/4 i4 = =1
2
4(1 r )
(1 r )2 2(1 r )2
Total 6952 = n 1 0
ii =
3 2r + r 2
+
2r r 2
+1
This is an example of (3:1)-(3:1) link- To simplify the calculation, the log base 10 of
age combination. Allard (1956) derived the ratio L(r)/L(1/2) known as LOD, is used
formulas for r and Vr for almost all possi-
ble linkage combinations and for different L( r )
populations. LOD = log10
L(1/2)
Likelihood ratio and linkage test With n = 6952, n1 = 4831, n2 = 390, n3 = 393,
and n4 = 1338, likelihood of odds (LOD)
In human genetics the linkage phase (repul- scores can be calculated for different r values
sion or coupling) is usually unknown thus as shown below (see (b) at bottom of page).
making it impossible to calculate recom- The result indicates that LOD scores
binant frequency based on the observable vary with r and reach the maximum when
recombinants. As a result, likelihood ratios r = 0.12.
or odds ratios (Fisher, 1935; Haldane and If M and N are linked, L(r)/L(1/2) > 1,
Smith, 1947; Morton, 1955) have been used and thus LOD is positive. When L(r)/L(1/2)
for linkage testing. The method is based < 1, LOD is negative.
on the comparison of the probability that In human genetics the likelihood ratio
observed data follow an hypothesis, for should be greater than 1000:1, i.e. LOD > 3
example two linked loci and the alternative in order to establish linkage unequivocally.
hypothesis, two independent loci. The ratio The concept of the likelihood ratio is now
of the two probabilities L(r)/L(1/2) is tested widely used in genetic mapping of other
as follows: r = 1/2 is entered into the like- organisms including plant species to judge
lihood function (see equation (a) at bottom the reliability of linkage estimation and to
of page). verify its existence.
n!
L(1 / 2) = (1 / 4)n (2.25)n1 (0.75)n2 + n3 (0.25)n4 (a)
n1 ! n2 ! n3 ! n4 !
Multi-point analysis and ordering the observed data at the converged iteration
a set of markers is 10303.28 (351.45) = 1048 times higher than that
for the initial ri = 0.05.
The methods discussed above are all based
on two-point analysis using two markers
at a time. However, when more than two Linkage mapping in the presence
markers from one chromosome are consid- of genotyping errors
ered, they can theoretically be arranged in As generating marker data is time consum-
many different orders but only one particu- ing and expensive, maximum use should be
lar order will match the genetic order on the made of the information generated. Without
chromosome and this particular order can accounting for genotyping errors, each error
be determined by multi-point analysis. in a non-terminal marker causes two appar-
Consider M1, M2, . . . , Mm genetic markers, ent recombinations in the dataset. Thus
ordered by their real locations on a chromo- every 1% error rate in a marker adds 2 cM
some for m genetic markers, there are a total of of inflated distance to the map. If there is
m!/2 possible orders. Assume the recombinant an average of one marker every 2 cM, then
frequency between two flanking markers, Mi an average of a 1% error rate will double
and Mi+1 is ri. The objective is to find r1, r2, . . . , the size of the map. There will be large
rm1 to maximize the likelihood L(r), distances between adjacent markers with
very high error rates. These cases can be
L(r) p1(r1,r2, . . . ,rm1)n1 p2(r1,r2, . . . ,rm1)n2 detected, either manually or automatically,
. . . pm(r1,r2, . . . ,rm1)nm and the markers removed. Such genotyping
errors can be identified by simply sorting
Using the natural logarithm, the par- the marker data by a given linkage order to
tial derivative is then set with respect to determine whether there are a large number
r1, r2, . . . , rm1. EM algorithm (Dempster of crossovers involved.
et al., 1977) can be used to obtain the MLE For the markers with low error levels
for r1, r2, . . . , rm1, which involves multi- that cannot be detected easily, the best
ple iteration steps of Expectation (E) and strategy is to integrate error detection with
Maximization (M). The multiple steps map-building procedure. Cartwright et al.
include: (i) providing an initial set of esti- (2007) extended the traditional likelihood
mates, r old = (r1, r2, . . . , rm1); (ii) using the model used for genetic mapping to include
intial estimates as the estimates of recom- the possibility of genotyping errors. Each
binant frequencies to obtain the E, i.e. the individual marker is assigned an error rate
expected numbers of recombinants and which is inferred from the data as are the
non-recombinants in each marker interval; genetic distances. A software package,
(iii) using these expected values as true val- TMAP, was developed to use this model to
ues to obtain the MLE for r new = (r1, r2, . . . , rm1); identify maximum-likelihood maps for
(iv) repeating steps (ii) and (iii) until the phase-known pedigrees. The methods
MLE has converged to its maximum. were tested using a data set in Vitis and a
Lander and Green (1987) provided an simulated data set, which confirmed that
example of the EM method for multi-point the method dramatically reduced the infla-
linkage analysis. Using 15 marker inter- tionary effect caused by increasing the
vals on human chromosome 7 determined number of markers and resulted in more
by 16 markers and initial recombinant fre- accurate orders.
quencies of ri = 0.05, the log-likelihood was
found to be 351.45. To reduce the difference Molecular maps in plants
of log-likelihoods between two consective
iterations to less than a given critical value Table 2.4 lists some representative molecu-
(tolerance value, T = 0.01), 12 iterations were lar maps that have been developed for major
needed which resulted in convergence at crop plants including legumes, cereals and
log-likelihood 303.28. The probability of clonal crops, which vary in marker density,
Table 2.4. Representative genetic maps in plants.
Azuki bean SSR, RFLP, AFLP; 187 BC1F1 486 markers mapped into 11 linkage groups spanning 832.1 cM with Han et al. (2005)
(JP81481 Vigna nepalensis) an average marker distance of 1.85 cM, 95% genome coverage
Barley AFLP, SSR, STS, and vrs1); 1172 markers with a total distance of 1595.7 cM, and average marker Hori et al. (2003)
95 RILs (Russia 6 H.E.S. 4) density of 1.4 cM per locus
SNP, SSR, RFLP, AFLP; three DH 1237 markers, based on three mapping populations consisted of 1237 loci, Rostoks et al. (2005)
populations with a total map length of 1211 cM and an average marker density
of 1 locus per cM
Lettuce AFLP, RFLP, SSR, RAPD; seven inter- 2744 markers assigned to nine linkage groups that spanned Truco et al. (2007)
and intraspecific populations a total of 1505 cM. The mean interval between markers is 0.7 cM
Maize SSR markers; one intermated The IBM map: 748 SSR and 184 RFLP markers with a total map length Sharopova et al. (2002)
RIL (IBM) and two immortalized F2s of 4906 cM; two immortalized F2 maps: 457 and 288 SSR markers with
total map length of 1830 and 1716, respectively
cDNA probes; two RIL populations: Framework maps: 237 and 271 loci in IBM and LHRF populations, Falque et al. (2005)
IBM (B37 Mo17) and LHRF that both maps contain 1454 loci (1056 on IBM_Gnp2004 and
(F2 F252) 398 on LHRF-Gnp2004) corresponding to 954 cDNA probes
Oat RFLP, AFLP, RAPD, STS, SSR, 426 loci (with 243 loci each) spanning 2049 cM of the oat genome Portyanko et al. (2001)
isozyme, morphological; 136 F6:7
RIL (Ogle TAM O-301)
Pearl millet RFLP and SSR; four populations A consensus genetic map: 353 RFLP and 65 SSR markers, Qi et al. (2004)
marker density in four maps ranged from 1.49 cM to 5.8 cM
Potato AFLP markers; heterozygous diploid > 10,000 AFLP loci, with marker density proportional to physical van Os et al. (2006)
potato distance and independent of recombination frequency
Rice 726 markers; 113 BC1 (BS125 WL02) 726 markers with a total distance of 1491 cM and average marker Causse et al. (1994)
BS125 density of 4.0 cM on the framework map, and 2.0 cM overall
2275 markers; 186 (Nipponbare 2275 markers with a total distance of 1521.6 cM, and average Harushima et al. (1998)
Kasalath) F2 marker density of 0. 67 cM per locus
Sorghum 2590 PCR-based markers and 137 RIL The 1713 cM map encompassed 2926 loci Menz et al. (2002)
(BTx623 IS3620C)
RFLP probes; 65 F2 (Sorghum bicolor The S. bicolor S. propinquum map is composed of 2512 loci, Bowers et al. (2003a)
Sorghum propinquum) spanning 1059.2 cM, a marker per 0.4 cM
Sweet potato AFLP; (Tanzania Bikilamaliya) 632 (Tanzania) and 435 (Bikilamaliya) AFLP markers, with Kriegner et al. (2003)
F2 population a total of 3655.6 cM and 3011.5 cM, and a marker per 5.8 cM
and 6.9 cM, respectively
Wheat SSR and DArT markers; 152 RILs from a 14 linkage groups, 690 loci (197 SSR and 493 DArT markers), Peleg et al. (2008)
cross between durum wheat and wild spanning 2317 cM, a marker per 7.5 cM
emmer wheat
56 Chapter 2
and genomic coverage. For example, crops can be integrated with the molecular link-
such as barley, maize, potato, rice, sorghum age map by using the same population for
and wheat have high-density genetic maps both conventional and molecular markers.
while cassava, Musa, oat, pearl millet, sweet As only very few morphological markers
potato and yam have less saturated maps. can segregate simultaneously in one popu-
The large variation in map length results lation, integration of many of these mark-
from differences in the number of chro- ers requires multiple populations each with
mosomes and total size of the genomes as an available preliminary molecular map. If
well as from the use of different numbers of a complete linkage map for morpholgical
markers (increasing the number of markers markers is available, the positions of these
will generally give a larger total map length markers relative to molecular markers can
up to a certain threshold), the inclusion of be inferred from the linkage relationship
skewed markers (that tend to exaggerate map revealed by both morphological and molec-
distances) and the use of different mapping ular markers. In addition, morphological
software (which vary in estimates of genetic markers, including some traits of agronomic
distances). In addition, many published importance, can be mapped much more
maps report more linkage groups than the precisely if they are integrated with a dense
basic chromosome number of that species. molecular map and this has now become
This is frequently the result of insufficient an integral step in trait and gene mapping.
marker density as most saturated maps can Integration of conventional and molecu-
be directly aligned with the basic chromo- lar maps has been very successful for crop
some complement (Tekeoglu et al., 2002). plants for which relatively complete genetic
The sophistication of molecular map linkage maps are available as a result of the
construction has developed from the RFLP use of morphological markers.
maps of the 1980s to PCR-based markers Some representative examples of such
of the 1990s to more integrated maps, as maps include rice, maize, tomato and soy-
a result of the use of different types of bean. In rice, 39 morphological markers and
molecular markers including genic mark- 82 RFLP markers were mapped together
ers, over the past decade. Linkage maps based on the segregation analysis of 19 F2
have been used in gene mapping for major populations derived from the crosses between
genes and QTL (Chapters 6 and 7), MAS indica cultivar IR24 and japonica lines with
(Chapters 8 and 9) and map-based gene different morphological markers (Ideta et al.,
cloning (Chapter 11). 1996). In tomato, a number of morphologi-
cal and isozyme markers were mapped with
respect to RFLP markers by orienting the
2.2.3 Integration of genetic maps molecular linkage map to both morphologi-
cal and cytological maps. An integrated high-
Integration of conventional density RFLP-AFLP map of tomato based on
and molecular maps two independent Lycopersicon esculentum
Lycopersicon pennellii F2 populations was
During the period 19801990 molecular constructed (Haanstra et al., 1999), which
maps were developed for many plant species. spanned 1482 cM and contained 67 RFLP
The first generation of molecular maps have and 1175 AFLP markers. Integrated maps
been integrated with conventional genetic were also developed for maize (Neuffer et al.,
maps constructed using morphological and 1997; Lee et al., 2002) and soybean (Cregan
isozyme markers through cytological mark- et al., 1999).
ers and markers shared by different maps.
The 12 molecular linkage groups in rice Integration of multiple molecular maps
(McCouch et al., 1988) were assigned to clas-
sical linkage groups using trisomics for each For many crop plants, several molecular
of the 12 rice chromosomes. Shared markers maps have been constructed using differ-
and those which segregate in the population ent populations. These populations are of
Markers and Maps 57
variable size and structure and maps have Integration of genetic and physical maps
been created using different numbers and
types of markers. To build an integrated Integrated genetic and physical genome
reference or consensus map, the order and maps are extremely valuable for map-
genetic distance between specific markers based gene isolation, comparative genome
is compared across populations and maps. analysis and as sources of sequence-ready
Stam (1993) developed a computer pro- clones for genome sequencing projects.
gram, JOINMAP, for the construction of genetic A well-defined correlation between the
linkage maps for several types of mapping physical and genetic maps will greatly
populations: BC1, F2, RILs, DHs and out- facilitate molecular breeding efforts
breeder full-sib family. JOINMAP can be used through associating candidate genes with
to combine (join) data derived from several important biological or agronomic traits,
sources into an integrated map. positional cloning and comparative analy-
For each crop all the molecular maps sis across populations and species, and
developed from different populations will whole genome sequences, which will in
finally be integrated into a consensus map. turn facilitate the development of various
This process has been very successful for molecular breeding tools.
several major crops and it can be expected Various methods have been developed
that it will be extended to all crops when for assembling physical maps of complex
sufficient maps become available. In wheat, genomes and integrating them with genetic
an SSR consensus map was constructed by maps. To create an integrated genetic and
fusing several genetic maps to maximize the physical map resource for maize, a compre-
integration of genetic mapping information hensive approach was used that included
from different sources (Somers et al., 2004). three core components (Cone et al., 2002).
In cotton, chromosome identities were The first was a high-resolution genetic
assigned to 15 linkage groups in the RFLP map that provided essential genetic anchor
joinmap developed from four intraspecific points for ordering the physical map and
cotton (Gossypium hirsutum L.) popula- for utilizing comparative information from
tions with different genetic backgrounds other smaller genome plants. The physical
(Ulloa et al., 2005). In maize, two popula- map component consisted of contigs (sets
tions of intermated RILs (IRILs) were used of overlapping fingerprint clones) assem-
to build a consensus map, the first panel bled from clones from three deep-coverage
(IBM) was derived from B73 Mo17 and genomic libraries. The third core compo-
the second panel (LHRF) from F2 F252. nent was a set of informatics tools designed
Framework maps of 237 loci were built from to analyse, search and display the mapping
the IBM panel and 271 loci from the LHRF data. In rice, most of the genome (90.6%)
panel. Both maps were used to locate 1454 was anchored genetically by overgo hybrid-
loci (1056 on map IBM_Gnp2004 and 398 ization, DNA gel blot hybridization and
on map LHRF_Gnp2004) that corresponded in silico anchoring (Chen et al., 2002).
to 954 previously unmapped cDNA probes In wheat, the geneticphysical map rela-
(Falque et al., 2005). In barley, Wenzl et al. tionship of microsatellite markers was
(2006) built a high-density consensus link- established using the deletion bin system
age map from the combined data sets of ten (Sourdille et al., 2004). In sorghum, Klein
populations, most of which were simultane- et al. (2000) developed a high-throughput
ously typed with DArT and SSR, RFLP and/ PCR-based method for building bacterial
or STS markers. The map comprised 2935 artificial chromosome (BAC) contigs and
loci (2085 DArT, 850 other loci), spanned locating BAC clones on the genetic map
1161 cM and contained a total of 1629 bins in order to construct an integrated genetic
(unique loci). The arrangement of loci was and physical map. It was found that 30%
very similar to, and almost as optimal as, of the overlapping BACs aligned by AFLP
the arrangement of loci in component maps analysis provided information for merg-
created for individual populations. ing contigs and singletons that could not
58 Chapter 2
be joined using fingerprint data alone. In automated matching of BACs were then
the grasses Lolium perenne and Festuca anchored on to IBM2 and IBM2 neighbour
pratensis, the physical map was integrated maps. In the Gramene database, a web-
with a genetic map using genomic in situ based tool, CMAP, was developed to allow
hybridization, which was composed of 104 users to view comparisons of genetic and
F. pratensis-specific AFLPs. The integrated physical maps (Ware et al., 2002). In addi-
map demonstrated the large-scale analy- tion, an integrated bioinformatic tool, the
sis of the physical distribution of AFLPs Comparative Map and Trait Viewer (CMTV),
and variation in the relationship between was developed to construct consensus
genetic and physical distance from one part maps and compare QTL and functional
of the F. pratensis chromosome to another genomics data across genomes and exper-
(King et al., 2002). iments (Sawkins et al., 2004). All these
An integrated genetic and physi- tools can be used to build integrated maps
cal mapping tool has been developed by based on shared markers and a reference
the Maize Mapping Project, Columbia, map to initiate the process. The integra-
Missouri, USA (http://www.maizemap. tion of genetic, cytological and physical
org/iMapDB/iMap.html). Contigs that maps is illustrated in the example shown
were assembled by fingerprinting and the in Fig. 3.6.
3
Molecular Breeding Tools:
Omics and Arrays
The success of molecular breeding depends sis (2DE). The proteins can be identified by
upon the various tools that can be used for excising the spot from the gel, digesting
the efficient manipulation of genetic varia- the polypeptide into smaller peptide frag-
tion. All kinds of omics, arrays and high- ments with specific proteases, and sequenc-
throughput technologies make it possible to ing the peptides directly or analysing them
carry out more large-scale genetic analyses by mass spectrometry (MS). Although this
and breeding experiments than ever before. method is still useful and widely used, it
These technologies have been incorpo- is limited in sensitivity, resolution, and the
rated into many novel genetic and breeding range of abundance of the different proteins
processes, some of which were described in the sample (Zhu et al., 2003; Baginsky
in Chapter 2. In this chapter, microarrays, and Gruissem, 2004). For example, abun-
high-throughput technologies and several dant proteins in the sample dominate the
aspects of genomics will be briefly discussed gel whereas less abundant proteins might
to provide some of the fundamental know- not be visible. New approaches involve
ledge required for molecular breeding. both improved separation methods and
advanced detection equipment, and several
other new technologies are available for use
3.1 Molecular Techniques in Omics in proteomic research (Kersten et al., 2002;
Zhu et al., 2003; De Hoog and Mann, 2004).
New detection methods and proteomic
Developments in molecular techniques have
technologies are also being developed in an
contributed to the various fields of omics,
array format, which is increasingly being
which include genomics, transcriptomics,
focused on proteinprotein interactions,
proteomics, metabalomics and phenomics.
post-transcriptional modification, and
These underlying developments include
elucidation of three-dimensional protein
advanced gel, hybridization and expression
structure.
systems, cell imaging by light and electron
microscopy, high density microarrays and
array experiments, and genetic readout
experiments. 3.1.1 2-Dimensional gel electrophoresis
Using proteomics as an example, clas-
sical techniques used in proteomics involve 2DE is a form of gel electrophoresis com-
the use of two-dimensional gel electrophore- monly used to analyse proteins. Mixtures of
proteins are separated by two properties in proteins are separated in one dimension by
two dimensions in 2DE. During the early isoelectric point and in the second dimen-
years of proteomics and until recently, sion by mass. In one-dimensional electro-
profiling of protein expression relied phoresis, proteins (or other molecules)
primarily on the use of two-dimensional are separated in one dimension, so that all
polyacrylmide gel electrophoresis (2D the proteins/molecules in one lane will
PAGE), which was later combined with be separated from one another according
MS. The basic procedure is to solubilize to the differences in a particular property
the protein contents of an entire cell popu- (e.g. isoelectric point) between each com-
lation, tissue or biological fluid, followed ponent. The result is a gel with proteins
by separation of the protein components separated out on its surface (Fig. 3.1a).
in the lysate using 2DE and visualization The proteins can then be visualized by a
of the separated proteins with silver stain- variety of staining methods, the most com-
ing. This approach allows only a limited monly used stains are silver nitrate and
display of the total protein content and Coomassie blue. By combining electro-
can identify only the relatively abundant phoresis with MS, individual proteins can
proteins. be profiled (Fig. 3.1b, c) and theoretical
2DE begins with one-dimensional and acquired MS profiles can be matched
electrophoresis and then separates the by a database search.
molecules by a second property in a direc- An important development in 2D PAGE
tion at 90 to the first. In this technique is the use of immobilized pH gradients
(a) pl
10 9 8 7 6 5 4 3
100
Molecular weight
80
Trypsin
60
40
12 14 16
20 Time
Peptides Separate peptides
0
LLEAAAQSTK
516.27 (2+)
400
y7 y8
Fig. 3.1. Standard protein analysis by two-dimensional electrophoresis followed by mass spectrometry
proteomics. (a) Protein is separated by two-dimensional electrophoresis: in one dimension by
isoelectronic point (pI) and in the second dimension by mass (molecular weight). Individual peptides
are obtained using trypsin to cleave peptide chains. (b) Peptides are separated by chromatography and
then peptides are ionized using electospray ionization (ESI): they pass through the first quadrupole (q1)
and collision chamber (q2). (c) Individual ions are separated based on their mass-to-charge (m/z) by a
mass analyser. (d) From the MS spectrum, an individual peptide ion (516.27 (2+)) is selected for MS/MS
analysis to produce peptide ion fragmentation patterns. Letters S, Q, A, A, E, L and L represent amino
acids in the selected peptide and a2, b2, y3, etc. represent different ions.
Omics and Arrays 61
There are many types of mass analys- ally coupled to TOF analysers that measure
ers which use static or dynamic fields and the mass of intact peptides, whereas ESI
magnetic or electric fields. Each analyser has mostly been coupled to ion traps and
type has its strengths and weaknesses. Four triple quatrupole instruments and used to
basic types of mass analyser used in pro- generate fragment ion spectra (collision-
teomic research are: ion trap, time-of-flight induced spectra) of selected precursor ions
(TOF), quadrupole and Fourier transform (Aebersold and Goodlett, 2001). ESI creates
mass spectrometry (FT-MS) analyser. In ion- ions by application of a potential to a flow-
trap analysers, the ions are first captured or ing liquid causing the liquid to charge and
trapped for a certain time interval and are subsequently spray. The electrospray creates
then subjected to MS or tandem MS (MS/ very small droplets of solvent-containing
MS) analysis. Ion traps are robust, sensitive analyte. Solvent is removed by heat or some
and relatively inexpensive. A disadvantage other form of energy (e.g. energetic collisions
is their relatively low mass accuracy, due in with a gas) as the droplets enter the mass
part to the limited number of ions that can spectrometer and multiply-charged ions are
be accumulated at their point-like centre formed in the process. ESI ionizes the ana-
before space-charging distorts their distribu- lytes out of a solution and is therefore read-
tion and thus the accuracy of the mass meas- ily coupled to liquid-based (for example,
urement. The linear or two-dimensional ion chromatographic and electrophoretic) sepa-
trap is a recent development where ions ration tools (Fig. 3.1). MALDI creates ions
are stored in a cylindrical volume that is by excitation of molecules that are isolated
considerably larger than that of the tradi- from the energy of the laser by an energy-
tional, three-dimensional ion traps, allow- absorbing matrix. The laser energy strikes
ing increased sensitivity, resolution and the crystalline matrix to cause rapid excita-
mass accuracy. The FT-MS instrument is tion of the matrix and subsequent ejection of
also a trapping mass spectrometer, although matrix and analyte ions into the gas phase.
it captures the ions under high vacuum in MALDI-MS is normally used to analyse
a high magnetic field. It measures mass by relatively simple peptide mixtures in cases
detecting the image current produced by where integrated liquid-chromatography
ions cyclotroning in the presence of a mag- ESI-MS systems (LC-MS) are preferred for
netic field. Its strengths are high sensitiv- the analysis of complex samples.
ity, mass accuracy, resolution and dynamic Key developments leading to improved
range. In spite of the enormous potential, detection of proteins include TOF MS and
the expense, operational complexity and relatively non-destructive methods for con-
low-peptide-fragmentation efficiency of verting proteins into volatile ions (Zhu et al.,
FT-MS instruments has limited their rou- 2003). MALDI and ESI have made it possible
tine use in proteomic research (Aebersold to analyse large molecules such as peptides
and Mann, 2003). The TOF analyser uses an and proteins. Although MALDI-TOF MS is a
electric field to accelerate the ions through relative high-throughput method compared
the same potential and then measures the with ESI, the latter is more easily coupled
time they take to reach the detector. with separation techniques such as LC or
Techniques for ionization have been key high pressure LC (HPLC) (Zhu et al., 2003).
to determining what types of samples can This has provided an attractive alternative
be analysed by MS. Electrospray ionization to 2DE, because even low-abundance pro-
(ESI; Fenn et al., 1989) and matrix-assisted teins and insoluble transmembrane proteins
laser desorption/ionization (MALDI; Karas can be detected (Ferro et al., 2002; Koller
and Hillenkamp, 1988) are two techniques et al., 2002). Other MS techniques include
most commonly used to volatize and ion- gas chromatographymass spectrometry
ize proteins or peptides for MS analysis (GC-MS), and ion mobility spectrometry/
while inductively coupled plasma sources mass spectrometry (IMS/MS). All MS-based
are used primarily for metal analysis on a techniques require a substantial and search-
wide array of sample types. MALDI is usu- able database of predicted proteins, ideally
Omics and Arrays 63
representing the entire genome. Protein called bait) is screened against a library of
identification is possible by comparing the activation-domain hybrids (prey) to select
deduced masses of the resolved peptide interaction partners (Phizicky et al., 2003).
fragments with the theoretical masses of The key advantages of the Y2H assay
predicted peptides in the database. are its sensitivity and flexibility (Phizicky
Mass spectrometers are restricted in the et al., 2003). The sensitivity derives in part
number of ions that can be detected at any from overproduction of protein in vivo, their
point in time. Pre-fractionation of proteins designed direction to the nuclear compart-
on the basis of isolation of specific cell types ment where interactions are monitored,
or subcellular organelles is often necessary the large number of variable inserts of the
to reduce the complexity (Lonosky et al., interacting proteins that can be examined at
2004). Another method of fractionating a once, and the potency of the genetic selec-
complex sample is to introduce a chromato- tions. This sensitivity leads to the detection
graphic technique before MS analysis. This of interactions with dissociation constants
method, referred to as multidimensional around 107 M which is in the range of most
protein identification technology (MudPIT) weak protein interactions found in the cell
(Whitelegge, 2002) has been used to conduct and is more sensitive than co-purification.
a shotgun survey of metabolic pathways in It also allows detection of certain transient
the leaves, roots and developing seeds of interactions that might affect only a subpop-
rice (Koller et al., 2002). Compared with ulation of the hybrid proteins. Flexibility of
2DE-MS, each method identifies unique pro- the assay is provided by calibration to detect
teins, supporting the complementary nature interactions of varying affinity by altering the
of the different proteomic technologies. expression levels of the hybrid proteins, the
number and nature of the DNA-binding sites
and the composition of the selection media.
3.1.3 Yeast two-hybrid system Some disadvantages of the Y2H assay
include the unavoidable occurrence of false
The yeast two-hybrid assay (Fields and negatives and false positives (Phizicky et al.,
Song, 1989) provides a genetic approach 2003). False negatives include proteins
to the identification and analysis of pro- such as membrane proteins and secretory
teinprotein interactions. Yeast two-hybrid proteins that are not usually amenable to
(Y2H) systems detect not only members of nuclear-based detection systems, proteins
known complexes but also weak or tran- that failed to fold correctly and interactions
sient interactions (Jansen et al., 2005). The dependent on domains occluded in the
Y2H assay makes use of the molecular fusions or on post-translational modifica-
organization found in many transcription tions. False positives include colonies not
factors that have a DNA-binding domain resulting from a bona fide protein interac-
and activation domains that can function tion, as well as colonies resulting from a
independently, but when these domains are protein interaction not indicative of an
fused to two proteins that interact, the abil- association that occurs in vivo.
ity of the domains to control transcriptional There are several variations of the Y2H
activity is reconstituted. In this assay hybrid system. In the reverse Y2H system, induced
proteins are generated that fuse a protein X URA3 expression leads to 5-FOA being con-
to the DNA-binding domain and protein Y verted into the toxic substance 5-fluorouracil
to the activation domain of a transcription by Ura3p, leading to growth prohibition.
factor (Fig. 3.2a). Interaction between X Mutated or fragmented genes are created and
and Y reconstitutes the activity of the tran- then subjected to analysis and only loss-of-
scription factor and leads to expression of interaction mutants are able to grow in the
a reporter gene with a recognition site for presence of 5-FOA. In the one-hybrid sys-
the DNA-binding domain. In the typical tem, the bait is a target DNA fragment fused
practice of this method, a protein of interest to a reporter gene. Preys that are able to bind
fused to the DNA-binding domain (the so- to the DNA fragmentreporter fusion will
64 Chapter 3
(a)
X Y
(b) (c)
X Screened
against
Y1
Screened
X
Screened against
X
against Y2
Screened
X
against
Yn
(d) (e)
X1
Y1
Screened Screened
X
against against
X96
Y96
Fig. 3.2. Yeast two-hybrid approaches. (a) The yeast two-hybrid system. DNA binding and activation
domains (circles) are fused to two proteins X and Y, the interaction of X and Y leads to reporter gene
expression (arrow). (b) A standard two-hybrid search. Protein X, present as a DNA binding domain hybrid,
is screened against a complex library of random inserts in the activation domain vector (shown in square
brackets). (c) A two-hybrid array approach. Protein X is screened against a complete set of full length open
reading frames (ORFs) present as activation domain hybrids (shown as yeast transformant spotted on to
microtitre plates). (d) A two-hybrid search using a library of full length ORFs. The set of ORFs as activation-
domain hybrids (microtitre plates in square brackets) is combined to form a low-complexity library.
(e) A two-hybrid pooling strategy. Pools of ORFs as both DNA-binding domain and activation domain
hybrids (in square brackets) are screened against each other. From Phizicky et al. (2003) reprinted by
permission from Macmillan Publishers Ltd.
lead to activation of the reporter genes (lacZ, bait and prey proteins requires the presence
HIS3 and URA3). In the repressed transac- of a third interacting molecule to form a
tivator system, the interaction of baitDNA complex. The third interacting molecule can
binding domain fusion proteins and the be a protein used with a nuclear localization
preyrepressor domain fusion proteins can acting as a bridge between bait and prey to
be detected by repression of the reporter cause transcriptional activation.
URA3. The interaction of bait and prey ena- Different genome-wide two-hybrid
bles cells to grow in the presence of 5-FOA, strategies have been used to analyse protein
whereas non-interactors are sensitive to interactions in Saccharomyces cerevisiae.
5-FOA as a result of Ura3p production. In One approach involved screening a large
the three-hybrid system, the interaction of number of individual proteins against a
Omics and Arrays 65
AAAAA
TTTTT
AAAAA
GTAC TTTTT
AAAAA
GTAC TTTTT
AAAAA
GTAC TTTTT
Divide in half
Ligate to linkers (A + B)
GGATGCATGXXXXXXXXX GGATGCATGOOOOOOOOO
CCTACGTACXXXXXXXXX CCTACGTACOOOOOOOOO
TE AE Tag TE AE Tag
GGATGCATGXXXXXXXXXOOOOOOOOOCATGCATCC
CCTACGTACXXXXXXXXXOOOOOOOOOGTACGTAGG
Ditag
Cleave with anchoring enzyme
Isolate ditags
Concatenate and clone
CATGXXXXXXXXXOOOOOOOOOCATGXXXXXXXXXOOOOOOOOOCATG
GTACXXXXXXXXXOOOOOOOOOGTACXXXXXXXXXOOOOOOOOOGTAC
Tag 1 Tag 2 Tag 3 Tag 4
AE AE AE
Ditag Ditag
70
wild type Pti4
60
50
# Tags
40
30
20
10
0
Ca/b
Pti4
PDF1.2
Di19
Lhcb5
TIP
Catalase
Oxygen-
evolving protein
Germin1
TF
MYB60
BAC clone
T18N14
ATPase
Chrom.
5 clone
Peroxidase
Genes
Fluorescence
DNA oligonucleotide probes that fluoresce
0.6
when hybridized with a complementary
DNA (cDNA).
Real-time RT-PCR uses fluorophores in 0.4
order to detect levels of gene expression. As
mRNA becomes translated at the ribosome to
0.2
produce functional proteins, mRNA levels tend
0 10 20 30 40
to roughly correlate with protein expression. Cycle number
In order to adapt PCR to the measurement of
RNA, the RNA sample first needs to be reverse B 4
10 copies
3.0
transcribed to cDNA via an enzyme known as
a reverse transcriptase. The original RT-PCR 2.5 10 copies
technique required extensive optimization
Fluorescence
the same source as sample to be tested) and Transcriptional analysis may also be
driver cDNA (from a normal sample) to carried out by inserting a reporter gene such
obtain shorter fragments; (iii) divide tester as lacZ or GFP (green fluorescent protein)
cDNA into two portions and ligate each to downstream from the promoter under study.
a different adaptor, while driver cDNA has lacZ encodes -galactosidase and its expres-
no adaptors; (iv) hybridization kinetics lead sion is detected by the blue colour obtained
to equalization and enrichment of differ- in the presence of X-Gal. GFP is a protein
entially expressed sequences among single containing a chromophore which fluoresces
strand tester molecules; and (v) ultimately under blue light (395 nm). These reporters
generate templates for PCR amplification are used to evaluate the expression levels and
from differentially expressed sequences. identify the tissues in which the normal gene
As a result, only differentially expressed is expressed under the chosen promoter.
sequences are amplified exponentially.
highly repetitive sequences, while prokary- concentration and time required to pro-
otes have small genomes, single and cir- ceed to the half way of re-association. It is
cular chromosomes (few linear) with no directly related to the amount of DNA in the
centromere or telomere, high gene density genome.
without introns and very few or no repeti- The DNA content of haploid genomes
tive sequences. The genome size refers to ranges from 5 103 for viruses to 1011 bp for
the haploid genome since different cells flowering plants. Within mammals, there is
within a single organism can be of differ- only a two fold difference between the larg-
ent ploidy. Germ cells are usually haploid est and smallest C-value. However, there
and somatic cells diploid. The size of the is up to a 100-fold variation in size within
genome is known as the C-value and is flowering plants. The minimum genome
measured by re-association kinetics. After size found in each phylum increases from
denaturation, the rate of re-association is prokaryotes to mammals (Fig. 3.5).
dependent on genome size. The larger the Among the most important food crops,
genome, the more repeated DNA sequences rice has the smallest genome (389 Mb)
and the longer time to re-anneal, the higher (IRGSP, 2005) and wheat the largest
the C-value. C0 t1/2 is the product of the DNA (15,966 Mb). According to Arumuganathan
Flowering plants
Birds
Mammals
Reptiles
Amphibians
Bony fish
Cartilaginous fish
Echinoderms
Crustaceans
Insects
Molluscs
Worms
Fungi
Algae
Bacteria
Mycoplasmas
Virus
103 104 105 106 107 108 109 1010 1011
DNA content (bp)
Fig. 3.5. DNA contents of organisms. Modified from Primrose (1995) and Arumuganathan and Earle (1991).
70 Chapter 3
and Earle (1991), other crops can be called selfish DNA). Some of the sequences
grouped into seven classes: Musa, cowpea are found to cause insertional or deletion
and yam (873 Mb); sorghum, bean, chick- mutations such as Alu.
pea and pigeonpea (673818 Mb); soy-
bean (1115 Mb); potato and sweet potato
(15971862 Mb); maize, pearl millet and 3.2.2 Physical mapping
groundnut (23522813 Mb); pea and barley
(43975361 Mb); and oat (11,315 Mb). Physical mapping entails constructing a
Genome size is often correlated with physical map which consists of continuous
plant growth and ecology and extremely overlapping fragments of cloned DNA that
large genomes may be limited both eco- has the same linear order as found in the
logically and evolutionarily. The manifold chromosomes from which they are derived.
cellular and physiological effects of large A series of overlapping clones or sequences
genomes may be a function of selection of that collectively span a particular chromo-
the major components that contribute to somal region and form a contiguous segment
genome size such as transposable elements is called a contig. Recommended references
and gene duplication (Gaut and Ross-Ibarra, for physical mapping include Zhang and
2008). Wing (1997), Brown (2002), Meyers et al.
(2004) and Lolle et al. (2005).
Sequence complexity
because the fixed capacity of the phage head mediated transformation. A similar vec-
prevents genomes that are too long being tor called TAC, was developed and used
packaged into progeny particles. Cosmids to complement a mutant phenotype in
are one type of hybrid vector that replicate Arabidopsis (Liu et al., 1999). Table 3.1
like a plasmid but can be packaged in vitro provides characteristics of several artificial
into l phage coats. The vector can accom- chromosome vectors.
modate DNA inserts as large as 45 kb.
The YAC vector was developed in ISOLATION OF HIGH MOLECULAR WEIGHT DNA.
which an insert up to 1000 kb can be main- Preparing quality high molecular weight
tained. The YAC cloning system includes (HMW) DNA (most of the DNA > 1 Mb)
Tel yeast telemeres, ARS1 autonomously suitable for large insert library construc-
replicating sequence, CEN4 centromere tion can be one of the most difficult
from yeast chr.4, URA3 (Uracil) and TRP1 steps in constructing a large-insert plant
(tryptophan) yeast selection marker genes, genomic library. There are four predom-
Amp ampicillin-resistance gene and Ori inant problems involved in isolating
origin of replication of pBR322. Although plant nuclear DNA: (i) plant cell walls
the YAC clones have played a major role must be physically broken or enzymati-
in several genome projects and map-based cally digested without damaging nuclei;
cloning of many genes in the early 1990s, (ii) chloroplasts must be separated from
the following four problems have prevented nuclei and/or preferentially destroyed,
their further use in genome studies: (i) high an important process since copies of the
percentage of chimaeric clones; (ii) dif- chloroplast genome may comprise the
ficulty in DNA preparation and storage; majority of the DNA within a plant cell;
(iii) low transformation efficiency; and (iv) (iii) volatile secondary compounds such
instability of some inserts in yeast. In the as polyphenols must be prevented from
rice cultivar Nipponbare for example, 40% interacting with the nuclear DNA; and
of the clones in the YAC library alone were (iv) carbohydrate matrices that often form
chimaeric thus limiting its use for genome after tissue homogenization must be pre-
sequencing or map-based cloning. vented from trapping nuclei.
The BAC cloning system is based on Several different isolation methods
the E. coli single copy F factor (Shizuya have been developed. The first method
et al., 1992). It is easy to manipulate, screen was to isolate the protoplast from leaf tis-
and maintain the cloned DNA. It is non- sue and then embed the protoplast in low-
chimaeric, and has high transformation melting point agarose in the forms of a plug
efficiency. or bead. This method is expensive and
To facilitate gene identification in plant time consuming. In addition, chloroplast
species, second-generation BAC vectors DNA is not separated. The development of
such as BIBAC were constructed (Hamilton methods to isolate nuclei from leaf tissue
et al., 1996). A 150-kb human DNA frag- has dramatically improved the procedure
ment in the BIBAC vector was transferred and quality of the HMW DNA for library
into the tobacco genome by Agrobacterium- construction.
PREPARATION OF INSERT DNA FOR LIGATION. The gerprinting; chromosome walking; sequence
average size of DNA fragments produced tagged site (STS) mapping; and fluorescent
by complete digestion with restriction in situ hybridization (FISH). In restriction
enzymes with four- or six-base recogni- fragment fingerprinting, individual clones
tion sequences is too small for large insert are first digested with different restriction
library construction. To obtain relatively enzymes. The digested DNA is then labelled
HMW restriction fragments (100300 kb), with radioactive or fluorescent dye and run
the popular method is to partially digest the on a sequence gel. The fingerprint data is
target DNA with a four-base-cut enzyme. collected and analysed for contig assembly.
Partial DNA digestion not only yields frag- During the procedure, markers with known
ments of the desired size but also fragments map position are used as probes to screen
the genome randomly without exclusion of the large insert library. Clones hybridized
any sequence. with the same single copy marker are con-
To determine the conditions that yield sidered to be overlapping. PCR amplifica-
a maximum percentage of fragments between tion of DNA pools using primers derived
100 and 300 kb, a series of partial digestions from DNA markers with known position
are carried out by using different amounts can also be used for physical map construc-
of restriction enzyme for a specific diges- tion. The disadvantages of this method are
tion period. Once the optimal conditions that it is labour intensive and filling the
for producing fragments between 100 and gaps is difficult.
300 kb are determined, a mass digestion STS mapping uses a sequenced tagged
using several plugs is carried out to obtain site (STS) which is a short region of DNA about
sufficient DNA for size selection. Partially 200300 bases long whose exact sequence
digested HMW DNA is then subjected to is found nowhere else in the genome.
pulsed field gel analysis. Two or more clones containing the same STS
If there is no size selection of partially must overlap and the overlap must include
digested DNA, a random library will have a the STS. There are two disadvantages to this
preponderance of small inserts since small method: it is still very labour intensive and
fragments ligate more efficiently and clones the primer synthesis is expensive.
with small inserts transform with higher FISH uses synthetic polynucleotide
efficiency. Contour-clamped homogeneous strands that bear sequences known to be
electrical field (CHEF) is the most common complementary to specific target sequences
method for separating large DNA molecules. at specific chromosomal locations. The poly-
It uses a hexagonal array of fixed electrodes nucleotides are bound via a series of linked
and a homogeneous electrical field is gen- molecules to a fluorescent dye that can be
erated for enhancing DNA resolution. After detected with a fluorescence microscope.
two-size selection using CHEF Mapper, In addition, physical mapping can
the HMW restriction fragments must be be achieved by a combination of finger-
removed from surrounding agarose before printing, molecular linkage mapping, STS
they can be used in ligation reactions. After mapping, end sequencing and FISH map-
developing the high insert library, a number ping. A by-product of physical mapping
of random clones can be selected to confirm is the integration of genetic, physical and
the successful cloning of the inserts and the sequence maps as shown in Fig. 3.6.
average insert size. The average insert size
will then determine how many clones are
needed to achieve the desired amount of
3.2.3 Genome sequencing
genome coverage.
The sequencing of DNA in laboratories
Physical mapping first began in 1978. The first genome of a
multicellular eukaryote, Caenorhabditis
There are five physical mapping methods: elegans, was published in 1998. The ration-
optical mapping; restriction fragment fin- ale behind genome sequencing includes
74 Chapter 3
Human chromosome 16
Cytogenetic
map
Site of hybridization
FRA16D
FRA16B
CY180
CY165
Somatic cell
CY14
23HA with labelled probe
CY19
CY11
CY13
CY15
CY12
CY8
CY7
CY2
CY4
hybridization
map
(from cultured Region of interest
humanmouse D16 S159
D16 S150
D16 S149
D16 S160
D16 S144
between breakpoints
16AC 6.5
D16 S85
D16 S60
D16 S48
D16 S40
hybrid cells) CY8 and CY7
Genetic
linkage map Region of interest
between genetic
Region of interest can be localized either markers 16AC6.5
on physical map (somatic cell hybrid map) and D16S150
or genetic map.
BAC and/or
PAC contigs
STS GATCAAGGCGTTACATGA
AGTCAAACGTTTCCGGCCTA
Fig. 3.6. Example of physical mapping and integration of genetic, cytological and physical maps.
identification of all the genes in the sequenced DNA sequencers; and (iii) PCR. Until the
genome, elucidation of the functions and the late 1970s, obtaining the DNA sequences
interactions of genes in the genome, func- of even five to ten nucleotides was dif-
tional analysis of orthologues in related ficult and very laborious. The develop-
complex genomes, evolutionary analysis of ment of two new methods in 1977, that
genes or genomes and product development of Maxam and Gilbert (chemical sequenc-
and commercial application. As the next- ing method) and the other by Sanger and
generation sequencing technologies contin- Coulson (enzymatic sequencing), made it
ued to facilitate genome sequencing, new possible to sequence large DNA molecules.
applications and new assay concepts (e.g. Later refinements of Sangers chain termi-
Huang et al., 2009) have emerged that are nation method made it the preferred proce-
vastly increasing our ability to understand dure since it has proven to be technically
genome function, including sequence census simpler.
methods for functional genomics (Wold and The modified Sanger sequencing
Myers, 2008; Varshney et al., 2009). method or chain terminator procedure capi-
talizes on two properties of DNA polymer-
Technical developments in DNA sequencing ases: (i) their ability to synthesize faithfully
a complementary copy of a single-stranded
There are three major milestones in DNA DNA template; and (ii) their ability to use
sequencing: (i) the invention of sequenc- 3'-dideoxynucleotides as substrates. Once
ing reactions; (ii) automated fluorescent the analogue is incorporated at the growing
Omics and Arrays 75
point of the DNA chain, the 3' end lacks a and opening up many new possibilities
hydroxyl group and is no longer a substrate (Kahvejian et al., 2008; Shendure and Ji,
for chain elongation. Thus, the dideoxynu- 2008). There are three commercial next-
cleotides act as chain terminators. generation DNA sequencing systems avail-
The development of labelling and able (Schuster, 2008) which promise vastly
detection techniques have contributed to more sequencing capability (> 1 Gb of
an acceleration of sequencing procedures, sequence per run) than standard capillary-
which include 33P labelled primer (1970s); based technology can produce. A high-
33
P or 35S labelled primer with sharper throughput DNA sequencing technique
image and lower radiation (early 1980s); using a novel massively parallel sequenc-
and fluorescently labelled primers and ing-by-synthesis approach called pyrose-
dyes in four different reactions (1986). quencing was developed more recently by
DNA sequencing became automated in the 454 Life Sciences (Margulies et al., 2005;
late 1980s when the primer used for each www.454.com). 454 Sequencing employs
reaction was labelled with a differently clonal DNA fragment amplification on
coloured fluorescent tag. This technology beads in droplets of an aqueousoil emul-
allowed thousands of nucleotides to be sion, followed by loading the beads into
sequenced in a few hours and the sequenc- nanoscale ( 44 m) wells of a PicoTiterPlate
ing of large genomes then became a reality. which is a fibre optic chip. In each reac-
With ABI PRISM technology, up to four tion cycle, one of the four deoxynucleotide
different dyes can be used to label DNA triphosphates (dNTPs) is delivered to the
each of which can be differentiated when reactor along with DNA polymerase, ATP
run together in the same lane of a gel or sulfurylase and luciferase. Incorporation,
injected into a capillary. For DNA sequenc- which is accompanied by a chemolumins-
ing, this means that the four different dyes cent signal, is detected by a high-resolution
representing each of the DNA bases (A, C, charge-coupled device (CCD) sensor. 454
G and T) can be electrophoresed together. Sequencing is capable of sequencing roughly
The improvement of polyacrylamide 100 Mb of raw DNA sequence per 7-h run
gel electrophoresis (in the late 1980s and with their 2007 sequencing machine, the GS
early 1990s) led to high resolution, thin- FLX Genome Analyzer.
ner gels and a sharper image. Capillary 454 Sequencing allows large amounts
electrophoresis (CE) (1998) offers a number of DNA to be sequenced at low cost
of performance advantages such as faster compared to the Sanger chain-termina-
runs, small sample volumes and the abil- tion methods; G-C rich content is not as
ity to eliminate manual gel pouring and much of a problem, and the lack of reli-
sample loading tasks. Walk-away automa- ance on cloning means that unclonable
tion reduces instrument-associated labour segments are not skipped; it is also capa-
time by more than 80% over slab-gel sys- ble of detecting mutations in an amplicon
tems. The introduction of CE resulted in the pool at a low sensitivity level. However,
availability of automated electrophoresis each read of the 2005 sequencing machine
instruments with much lower cost per sam- GS20 is only 100 bp long, resulting in
ple (Amershams MegaBACE and Applied some problems when dealing with highly
Biosystems ABI3700, 3730, etc.). High- repetitive genomes, as repetitive regions
throughput sequencing can also incorporate of over 100 bp cannot be bridged and
full automation in colony picking, 96-well thus must be left as separate contigs. Also,
plasmid isolation and purification, PCR the nature of the technology lends itself
reactions, sample loading and sequence to problems with long homopolymer runs.
data analysis. As one of the projects using 454 sequenc-
The new generation of high-through- ing, Project Jim determined the first
put sequencing technologies promises to sequence of an individual, the complete
transform the scientific enterprise, poten- genome sequence of James Dewey Watson,
tially supplanting array-based technologies in May 2007.
76 Chapter 3
The second high-throughput sequenc- in a DNA strand offers the prospect of third
ing technique is Solexa (Illumina, Inc.; generation instruments that will sequence a
http://www.illumina.com) which depends diploid mammalian genome for US$1000
on sequencing by synthesis. Diluted DNA in 24 h (Branton et al., 2008).
templates are attached to a solid planar sur-
face and then amplified clonally. Sequencing Sequencing strategies
is performed by delivering a mixture of four
differentially labelled reversible chain ter- There are two general genome sequencing
minators along with DNA polymerase. The strategies: (i) clone-by-clone or hierarchical
resulting signal is detected at each cycle sequencing (International Human Genome
and a new cycle can be initiated after termi- Sequencing Consortium, 2001); and (ii) whole
nator removal (Bennet et al., 2005). Current shotgun sequencing (Venter et al., 2001).
average read lengths are about 3040 bases After constructing the complete physical
with 1 Gb per run. map, clone-by-clone sequencing can be
The third high-throughput sequenc- started in any specific region. Clone-by-clone
ing technique is SOLiD System which or hierarchical sequencing strategy has the
enables massively parallel sequencing of following advantages: (i) the ability to fill
clonally-amplified DNA fragments linked gaps and re-sequence the uncertain regions;
to beads. The SOLiD sequencing method- (ii) the ability to distribute the clones to
ology is based on sequential ligation with other laboratories; and (iii) the ability to
dye-labelled oligonucleotides. The SOLiD check the produced sequence by restriction
technology provides unmatched accu- enzymes. The main disadvantages are that
racy, ultra-high throughput and applica- it is expensive and time consuming for the
tion flexibility. It delivers advancements in construction of a physical map and experi-
throughput approaching 20 Gb per run. The enced personnel are required.
flexibility of two independent flow cells, The shotgun sequencing strategy
each capable of running 1, 4 or 8 samples, consists of making small insert librar-
allows multiple experiments to be con- ies (110 kb) from the genomic DNA of an
ducted in a single run. With unparalleled organism, sequencing a large number of
throughput and greater than 99.9% overall clones (six to eight times redundancy) and
accuracy, the SOLiD System enables large- assembling contigs using bioinformatics
scale sequencing and tag-based experiments software. It has no physical map construc-
to be completed more cost effectively than tion and less risk of recombinant clones. It
previously possible. is cost effective and fast and ideal for small
There are several emerging sequencing genome sequencing. However, it is difficult
methods: sequencing by hybridization; mass to fill gaps and re-track all the sequenced
spectrophotometric techniques; direct visu- plasmids and the resulting data is less use-
alization of single DNA molecules by atomic ful for positional cloning. Figure 3.7 com-
force microscopy; single-molecule sequenc- pares the two sequencing methods.
ing strategies. The intense drive towards
developing technology that can sequence a COMBINING CLONE-BY-CLONE AND SHOTGUN SEQUENC-
complete human genome for under US$1000 ING STRATEGIES. In 1997 The Institute of
will ensure that the speed and cost of Genome Research (TIGR) launched the ini-
sequencing will continue to improve rap- tiative of a whole-genome shotgun approach
idly (Schuster, 2008). For example, a nano- for the human genome. But BACs, BAC
pore-based device provides single-molecule end sequences and STS markers were used
detection and analytical capabilities that extensively in assembling the sequencing
are achieved by electrophoretically driving data from shotgun clones. The first draft of
molecules in solution through a nano-scale the human genome was completed within 3
pore. Further research and development to years compared with the 12 years taken by
overcome current challenges to nanopore the Human Genome Project which is funded
identification of each successive nucleotide by government agencies.
Omics and Arrays 77
3. Take subset
of clones,
fragment and
sequence
U-unitigs
Rock 50 kb Mates
Scaffold
Stones
Gap
Link mapped
scaffold to
existing map
STSs
Fig. 3.7. Comparison of two sequencing strategies: assembly of a mapped scaffold. U-unitigs are
assembled into scaffolds using mate-pair information to bridge gaps between two U-unitigs, and by
linking unitigs to rock, which are less-well supported unitigs that nevertheless fit in place according
to at least two independent large insert mate pairs. Stones are single short contigs whose position is
supported by only a single read. Gaps are filled in the finishing stage by further site-directed sequencing.
Scaffolds are placed against existing genetic and physical maps by sequence tagged site (STS) matches
and against the cytological map by fluorescent in situ hybridization (FISH).
high C0t and MF, Martienssen et al. (2004) Plant genomic sequences
generated up to twofold coverage of the
gene space with less than one million The first complete plant genome to be
sequencing reads and simulations using sequenced was that of Arabidopsis. The
sequenced BAC clones predicted that sequenced regions cover 115.4 Mb of the
5 coverage of gene-rich regions, accompa- 125-Mb genome and extend into centro-
nied by less than 1 coverage of subclones meric regions. The evolution of Arabidopsis
from BAC contigs, will generate a high qual- involved a whole genome duplication fol-
ity mapped sequence that meets the needs lowed by subsequent gene loss and extensive
of geneticists while accommodating unu- local gene duplications. The genome contains
sually high levels of structural polymor- 25,498 genes encoding proteins from 11,000
phism. Haberer et al. (2005) selected 100 families (The Arabidopsis Genome Initiative,
random regions averaging 144 kb in size, 2000). Arabidopsis contains many families of
representing about 0.6% of the genome, to new proteins but also lacks several common
define their content of genes and repeats protein families. The proportion of predicted
for characterizing the structure and archi- Arabidopsis genes in different functional cat-
tecture of the maize genome. Combining egories is provided in Fig. 3.8. The complete
CBCS with genome filtration can greatly genome sequence provides the foundation
reduce the cost while retaining the high for more comprehensive comparison of con-
coverage of genic regions. An alternative served processes in all eukaryotes, identifying
approach is the identification of gene-rich a wide range of plant-specific gene functions
regions on a detailed physical map and and establishing rapid systematic methods
sequencing large-insert clones from these of identifying genes for crop improvement
regions. (Varshney et al., 2009).
Unclassified Metabolism
10% 11%
Cellular organization
5%
Intracellular traffic
3%
Protein destination
12%
Rice was the first crop to be fully (University of Missouri), Mark Vaudin
sequenced because of its importance as one (Monsanto) and Steve Rousley (Cereon);
of the major cereals and also because of its the other included Jeff Bennetzen (Purdue
small genome size, small number of chromo- University), Karel Schubert and Roger Beachy
somes (n = 12), well characterized genetic (Danforth Center), Cathy Whitelaw and John
and genomic resources and availability of Quackenbush (TIGR) and Nathan Lakey
a large number of DNA markers and a high (Orion). These two pioneer programmes have
density genetic linkage map. Two draft been extended by a massive US programme
sequences were completed in 2002 (Goff et from the National Science Foundation (NSF),
al., 2002; Yu et al., 2002) and a complete USDA and the Department of Energy (DOE)
sequence was published in 2005 (IRGSP, led by Rick Wilson (Washington University).
2005) which is available in the National The sequencing strategy is a hybrid between
Center for Biotechnology Information (NCBI) a BAC-by-BAC approach and a whole-
database. genome shotgun.
Many sequencing projects for impor-
tant crop species are currently ongoing. The
US Department of Energys Joint Genome 3.2.4 cDNA sequencing
Institute (JGI) is providing funding and
technical assistance to decode the genomes Why cDNA sequencing
of several major plants, including cassava
(Manihot esculenta), cotton (Gossypium), Large-scale DNA sequencing can be car-
foxtail millet (Setaria italica), sorghum, soy- ried out on genomic DNA or cDNAs. There
bean and sweet orange (Citrus sinensis L.) are four advantages to performing cDNA
(http://www.jgi.doe.gov/sequencing/). sequencing. First is the cost of sequencing
Other plants for which there are ongo- a whole genome. Although DNA sequenc-
ing genome sequencing projects include ing costs have fallen more than 50-fold over
Medicago truncatula (http:///www.medi the past decade, it still costs around US$10
cago.org/genome), Lotus japonicum (http:// million to sequence three billion base pairs.
www.kazusa.or.jp), poplar, tomato (http:// It will take years to realize the goal to lower
www.sgn.cornell.edu) and grapevine. the cost of sequencing a mammalian-sized
The International Wheat Genome genome to US$100,000 and ultimately to
Sequencing Consortium (IWGSC) has been cut the cost of whole-genome sequencing to
formed to advance agricultural research for US$1000 or less.
wheat production and utilization by develop- Secondly, the interpretation of the
ing DNA-based tools and resources that result genomic sequence of eukaryotes is not
from the complete sequencing of the expressed straightforward in contrast to prokaryotes:
genome of common (hexaploid) bread wheat coding regions are separated by non-coding
and to ensure that these tools and the sequences regions; introns and alternative splic-
are available for all to use without restriction ing occurs; one gene can lead to multiple
and without cost (Gill et al., 2004; http://www. mRNAs and gene products; a significant
wheatgenome.org/). A Global Musa Genomics fraction of genomic DNA does not code for
Consortium (GMGC) is decoding the Musa proteins (non-coding sequences).
genome (http://www.newscientist.com/article. Thirdly, cDNA sequencing helps in
ns?id-dn1037). A Global Cassava Partnership, annotation and identification of exons and
an alliance of the worlds leading cassava introns. Estimates of the number of human
researchers and developers, has proposed that genes vary from 30,000 to 80,000. The accu-
sequencing the cassava genome should be a racy of the Arabidopsis genome annotation
priority (Fauquet and Tohme, 2004). varied from 50 to 70% in the first draft.
To sequence the maize genome, two Many Arabidopsis genes are still not accu-
consortia in the USA began a pilot study: rately annotated.
one with Jo Messing (Rutgers University), Fourthly, sequencing cDNAs helps
Rod Wing (Arizona University), Ed Coe gain information about the transcriptome.
80 Chapter 3
mRNA populations are variable among efficiency of full-length cDNA cloning using
cells. The transcriptome is dynamic and a cap trapper method (biotinylated cap) and
constantly changing. Cells adapt to envi- thermoactivation of reverse transcriptase
ronmental, developmental and other sig- (cDNA synthesis at 60C: RNA secondary
nals by modulating their transcriptome. structures are melted). Some normalization
mRNA populations form an important and subtraction methods also allow enrich-
level of regulation between signal per- ment of full-length cDNAs.
ception and response. Genetically identi- For a given mRNA, multiple expressed
cal cells can exhibit distinct phenotypes. sequence tags (ESTs) can be obtained.
cDNA sequencing allows direct insight Depending on the extent of sampling, ESTs
into mRNA populations and allows the may or may not overlap. EST process-
dissection of the transcriptome which ing is needed to remove vector sequences,
genomic sequencing alone does not pro- linker sequences, check the quality using a
vide. Sequencing of random cDNA clones sequence quality filter, clean up the contam-
prepared from different tissues also allows inants and chimaeric sequences and store in
analyses of mRNA abundance. databases. To construct EST contigs, there
are two commonly used programs: Phrap/
cDNA libraries consed and TIGR assembler. These programs
generate a unigene set (contigs or Tentative
When constructing a representative cDNA Consensus): a consensus sequence for all
library, the source of the mRNA for the overlapping ESTs that (supposedly) corre-
cDNA library is critical and will vary spond to a single mRNA.
depending on the goal of the study. To esti- Several factors affect the quality of
mate the diversity of mRNAs expressed in EST contigs: contaminating sequences, bad
a given plant, the mRNA should represent quality sequences, non-overlapping ESTs
most plant tissues and organs. On the other from the same mRNA, alternative splicing
hand, to define the diversity of mRNAs resulting in one gene with multiple mRNAs
represented in a specific tissue, organ or and closely related genes (chimaeric con-
developmental stage, the library should tigs). EST annotation can be carried out
be prepared from the most highly defined using similarity searches against Genbank
source feasible. As indicated by Nunberg and other databases, e.g. protein motif data-
et al. (1996), it is better to invest the time bases, to assign a putative function or iden-
to harvest sufficient quantities of scarce tis- tify functional categories. This process can
sue for a library rather than using materials be automated or manual (usually a combi-
which will contain a significant proportion nation of the two).
of extraneous messages. Non-random (normalized or subtracted)
If large quantities of RNA are available, cDNA libraries are needed in order to over-
it is possible to create a plasmid library come some of the problems with redundant
directly. This is particularly feasible since ESTs in order to saturate EST databases when
electroporation transformation efficiencies budget is limited or when there is a specific
are so high. Plasmid libraries may or may interest in a particular stage. Hybridization-
not be directional and are easily arranged based methods are most commonly used
in an ordered array. Constructing plas- to decrease redundancy (reduce represen-
mid libraries directly avoids any sequence tation of abundant cDNAs and increase
bias, including internal deletion and trans rare cDNAs). Normalized cDNA libraries
recombination that may occur during the are used when gene discovery is the main
excision process. objective of the EST project.
The frequency of full-length cDNAs
depends on the length of transcript (the cDNA sequencing
longer the transcript the lower the frequency
of obtaining full-length cDNAs). Carninci Strategies for cDNA sequencing include
and Hayashizaki (1999) discussed the high- single-pass cDNA sequencing (ESTs),
Omics and Arrays 81
be laborious to clone full-length cDNA) and indicated by Busch and Lohmann (2007),
simple gene identification that is limited the limited length of the sequenced tags
by sequences that are already in a database precludes the use of MPSS for de novo
(otherwise the corresponding gene must be sequencing but makes it a very powerful
cloned). tool for expression profiling of organisms
Several alternative technologies have with pre-existing sequence information.
emerged for measuring transcript abun- By contrast, two other high-throughput
dance in a parallel fashion. Essentially, these sequencing techniques as described previ-
methods can be divided into three catego- ously, 454 and Solexa, are ideally suited
ries according to their underlying principle, for expression-profiling purposes. Short
namely PCR-, sequencing- or hybridization- tags are sufficient to identify a transcript
based technologies. Therefore, strategies unambiguously and therefore problems
that are currently available for analysis of arising from assembling short tags into
transcriptomes include RT-PCR (qualitative larger contigs can be ignored.
and quantitative), hybridization methods PCR product-based arrays were heavily
(northern blots, macroarrays, DNA micro- used in the early days of global transcriptome
arrays, oligonucleotide microarrays), analysis. However, the low level of stand-
cDNA fingerprinting (differential display, ardization among laboratories, high levels
cDNA-AFLP), cDNA sequencing (full-length of noise and experimental variation and
cDNAs, subtracted cDNAs, normalized cross-hybridization between homologous
cDNA libraries, SAGE, massive parallel sig- transcripts have eroded the attractiveness of
nature sequencing MPSS) and combina- these arrays. Oligonucleotide-based micro-
tions of the above techniques. arrays are now becoming the most popular
The most straightforward and unbi- technology for large-scale expression pro-
ased method of analysing an RNA popu- filing because they allow the simultaneous
lation is the sequencing of cDNA libraries detection of tens of thousands of transcripts
and quantitative analysis of the result- at a reasonable cost. The expression level
ing ESTs. Traditionally, ESTs with read- of any gene represented on the array can
lengths of about 200900 nucleotides have be deduced from the fluorescence inten-
been produced by Sanger-sequencing but sity of the corresponding probe. However,
the associated costs have severely limited microarrays only offer linear expression
the resolution of this approach (Busch and measurements over a range of three orders
Lohmann, 2007). Deep sequencing has of magnitude compared to quantitative
become a viable alternative for unbiased RT-PCR which has a dynamic range of five
large-scale expression profiling because orders of magnitudes. Microarrays perform
of the development of new protocols and with less precision and sensitivity than
entirely new sequencing techniques. Non- other techniques when used for measuring
gel-based sequencing techniques promise low abundance transcripts in particular and
to deliver greatly increased throughput this is manifested in their greater inter-assay
and a considerable cost reduction. MPSS variability (Busch and Lohmann, 2007).
combines in vitro cloning of millions of Another major limitation of microarrays
template tags on separate microbeads designed for expression analysis is that they
with ligation-mediated sequence detec- rely on current genome annotations, which
tion. In each reaction cycle, a four-base precludes the identification of novel or very
overhang is produced on every tag to small transcription units.
which a fluorescently labelled adaptor of Microarrays and quantitative RT-PCR
defined sequence is ligated. The position have dominated expression profiling to date
and fluorescence of every microbead is but deep sequencing and whole-genome
monitored by a high resolution camera in tiling arrays will become increasingly
each of the reaction cycles, allowing the important because these techniques are
sequences of the 17-nucleotide tags to be not limited to the detection of known tran-
reconstructed (Brenner et al., 2000). As scripts. Tiling arrays, on which the entire
Omics and Arrays 83
genome is represented by evenly spaced only a rough estimate of its level of expres-
probes, provide a novel means of transcript sion into a protein. An mRNA produced
identification. In Arabidopsis, tiling arrays in abundance may be degraded rapidly or
have been used to map transcriptionally translated inefficiently, resulting in a small
active regions by profiling four different tis- amount of protein. Secondly, many proteins
sues (Yamada et al., 2003). experience post-translational modifications
The interaction transcriptome is the that profoundly affect their activities; for
sum of all microbe and host transcripts that example some proteins are not active until
are produced during the interaction. The they become phosphorylated. Methods
challenges in studying interaction transcrip- such as phosphoproteomics and glycopro-
tomes include how to discriminate patho- teomics are used to study post-translational
gen from host ESTs, similarity searches modifications. Thirdly, many transcripts
to genome/cDNA sequences, GC analyses give rise to more than one protein through
and determination of hexamer frequency alternative splicing or post-translational
(windows of 6 bp). Systems genomics/tran- modifications. It is generally supposed that
scriptomics can be used to analyse complex if genomes contain tens of thousands of gene
transcriptomes, for example the mixtures of sequences, the proteome comprises several
mRNAs from different species (e.g. infected hundred thousand proteins as a result of
tissue, environmental samples such as soil alternative slicing and post-translational
or seawater, etc.). One challenge is to iden- modifications. Finally, many proteins form
tify the species of origin in the mixtures. complexes with other proteins or RNA mol-
ecules and only function in the presence of
these molecules.
3.3.2 Proteomics Proteomics has become an important
approach for investigating cellular proc-
Proteomics is the study of the identification, esses and network functions. Significant
function and regulation of complete sets improvements have been made in technolo-
of proteins in a tissue, cell or subcellular gies for high-throughput proteomics, both at
compartment. Such information is crucial the level of data analysis software and mass
to understanding how complex biological spectrometry (MS) hardware (Baginsky and
processes occur at a molecular level and Gruissem, 2006). In this section, proteom-
how they differ in various cell types, stages ics will be briefly discussed. For further
of development or environmental condi- details, readers are referred to the follow-
tions (Bourgualt et al., 2005). Proteomics is ing review articles: van Wijk (2001), Molloy
important as proteins are active agents in and Witzmann (2002), de Hoog and Mann
cells and they execute the biological func- (2004), Saravanan et al. (2004), Baginsky and
tions encoded by genes. Sequences of genes Gruissem (2006), Cravatt et al. (2007) and
(or genomes) and transcriptome analyses Zivy et al. (2007).
are not sufficient to elucidate biological
functions. Proteomics complements tran- Protein extraction
scriptomics by providing information about
the time and place of protein synthesis Obtaining high quality protein is the first step
and accumulation, as well as identifying in proteomic research. Extracting protein
those proteins and their post-translational from plant tissue requires tissue disrup-
modifications. Gene expression does not tion by grinding and sonication, separation
necessarily indicate whether a protein is of proteins from unwanted cell materials
synthesized, how fast it is turned over or (cell wall, water, salt, phenolics, nucleic
which possible protein isoforms are synthe- acids) by centrifugation after precipitation
sized (Mathesius et al., 2003). In some cases, of proteins with acetonetrichloroacetic
the correlation between gene expression acid, resolubilizing protein in a solution
and protein presence is as low as 0.4. First, that dissolves the maximum number of dif-
the level of transcription of a gene gives ferent proteins and inactivation of protease
84 Chapter 3
by acetonetrichloroacetic acid treatment or tion can be calculated for all the known
specific protease inhibitors.Pre-fractionation sequence proteins of a given organism (Zivy
of tissue is optional for the analysis of pro- et al., 2007). These masses will depend on
teins from different organelles or micro- the length of peptides and their composi-
somal fractions. Solubilization requires urea tion since most amino acids have differ-
or, for more hydrophobic proteins, thiourea, ent masses. Thus, masses predicted from
as a chaotrope which solubilizes, denatures sequences stored in databases can simply be
and unfolds most proteins. Non-ionic zwit- compared with masses effectively measured
ter detergents, e.g. 3-[3-cholamidopropyl- by the MALDI-TOF equipment. The greater
dimethyl-ammonio]-1-propane sulfonate the number of positive mass matches the
(CHAPS), Triton-X, or amidosulfobetaines more likely it is that the peptides originate
are used to solubilize and separate proteins from the same protein thus facilitating the
in a mixture. Sodium dodecyl sulphate rapid identification of proteins.
(SDS) is also a strong detergent and used to
solubilize membrane proteins. However, it Protein profiling
renders a negative charge to proteins and,
therefore, interferes with isoelectric focus- Protein mixtures of considerable complexity
ing (Mathesius et al., 2003). Reducing agents can now be routinely characterized in some
(usually dithiothreitil [DDT], 2-mercapto- detail. One measure of technical progress is
ethanol or tributyl phosphine) are needed the number of proteins identified in each
to disrupt disulfide bonds. study. Such numbers can now reach the
thousands for suitably complex samples.
Protein identification and quantification Large-scale proteomic studies are needed
to solve three types of biological problem
N- or C-terminal sequencing has made pro- (Aebersold and Mann, 2003): (i) the genera-
tein identification possible on a small scale tion of proteinprotein linkage maps; (ii)
although with limitations. Improvements the use of protein identification technol-
in MS have made it possible to identify ogy to annotate and, if necessary, correct
proteins faster, on a larger scale, using genomic DNA sequences; and (iii) the use
smaller amounts of protein. In addition, of quantitative methods to analyse protein
post-translational modifications can be expression profiles as a function of the
determined by MS/MS analysis and pro- cellular state as an aid to inferring cellular
teins can be identified even when bound function.
to other proteins in complexes. A standard The sequences of many mature pro-
technique for protein identification with teins in higher eukaryotes after processing
MALDI-TOF MS is peptide mass finger- and splicing are often not directly apparent
printing. Protein spots in a gel can be vis- from their cognate DNA sequences. Peptide
ualized using a variety of chemical stains sequence data of sufficient quality provides
or fluorescent markers. Proteins can often unambiguous evidence of translation of a
be quantified by the intensity with which particular gene and can in principle, dif-
they stain. Once proteins have been sepa- ferentiate between alternatively spliced or
rated and quantified, they can be identi- translated forms of a protein (Aebersold
fied. Individual spots are cut out of the gel and Mann, 2003). Thus, it might be tempt-
and cleaved into peptides with proteolytic ing to systematically analyse the proteins
enzymes. These peptides can then be iden- expressed by a cell or tissue, that is, to gen-
tified by MS, specifically MALDI-TOF MS. erate comprehensive proteome maps.
The MALDI-TOF analysis will measure very The more common and versatile use
precisely (< 0.1 Da) the mass of peptides of large-scale MS-based proteomics has
formed by this digestion. Since the cleav- been to document the expression of pro-
age sites are known, the digestion can be teins as a function of cell or tissue state.
simulated by informatics, that is, the masses Aebersold and Mann (2003) argued that to
of all the peptides produced by this diges- be meaningful, such data must be at least
Omics and Arrays 85
semi-quantitative and that a simple list of There are many important charac-
proteins detected in the different states is teristics of a proteinprotein interaction.
insufficient. This is because analyses of Obviously, it is important to know which
complex mixtures are often not comprehen- proteins are interacting. In many experi-
sive and therefore the non-appearance of a ments and computational studies, the focus
particular sequence in the list of identified is on interactions between two different
peptides does not indicate that the peptide proteins. However, one protein can interact
or protein was not originally present in the with other copies of itself (oligomerization)
sample. Additionally, it is often impossible or with three or more different proteins.
to prepare a certain cell type, cell fraction The stoichiometry of the interaction is also
or tissue in completely pure form without important, that is, how many of each pro-
trace contamination from other fractions. tein involved are present in a given reac-
And because the ion current of a peptide is tion. Some protein interactions are stronger
dependent on a multitude of variables that than others because they bind together more
are difficult to control, this measure is not tightly. The strength of binding is known
a good indicator of peptide abundance. If as affinity. Proteins will only bind to each
stable-isotope dilution has not been used, a other spontaneously if it is energetically
rough relative estimate of the quantity of a favourable. Energy changes during bind-
protein can be obtained by integrating the ing are another important aspect of protein
ion current of its peptide-mass peaks over interactions. Many of the computational
their elution time and comparing these tools that predict interactions are based on
extracted ion currents between states, pro- the energy of interactions.
vided that highly accurate and reproducible Protein interaction maps represent
methods are used. Increasingly, stable-iso- essential components of the post-genomic
tope dilution and LC-MS/MS are used to tool kits needed for understanding biologi-
accurately detect changes in quantitative cal processes at a systems level. Over the
protein profiles and to infer biological func- past decade, a wide variety of methods have
tion from the observed patterns (Aebersold been developed to detect, analyse and quan-
and Mann, 2003). tify protein interactions, including surface
plasmon resonance spectroscopy, nuclear
Proteinprotein interactions magnetic resonance (NMR), Y2H screens,
peptide tagging combined with MS and
Proteinprotein interactions occur among fluorescence-based technologies. Lalonde
most proteins and there are six types of et al. (2008) and Miernyk and Thelen (2008)
interfaces found in proteinprotein inter- reviewed the latest techniques and cur-
actions: domaindomain, intra-domain, rent limitations of biochemical, molecular
hetero-oligomer, hetero-complex, homo- and cellular approaches for the detection
oligomer, and homo-complex. The analysis of proteinprotein interactions. In vitro
of proteinprotein interactions can be either biochemical strategies for identifying and
qualitative or quantitative. Traditional bio- characterizing interacting proteins include
chemical methods such as co-purification co-immunoprecipitation, blue native gel
and co-immunoprecipitation have been electrophoresis, in vitro binding assays, pro-
used to identify the members of protein tein cross-linking and rate-zonal centrifuga-
complexes. Proteomics-based strategies tion. Fluorescence techniques range from
have been used to determine the composi- co-localization to tags which may be limited
tion of complexes and to establish interac- by the optical resolution of the microscope,
tion networks. The systematic, large-scale, to fluorescence resonance energy transfer
high-throughput approaches now being (FRET)-based methods that have molecular
taken to build maps of the interactions resolution and can also report on the dynam-
between proteins predicted by genome ics and localization of the interactions within
sequence information have become known a cell. Proteins interact via highly evolved
as interactomics (Causier et al., 2005). complementary surfaces with affinities that
86 Chapter 3
can vary over many orders of magnitude. strate. For example, drugs can be used as
Some of the techniques such as surface plas- affinity baits in the same way as proteins to
mon resonance provide detailed information define their cellular targets and small mol-
regarding the physical properties of these ecules such as cofactors can be used to iso-
interactions. To analyse protein complexes late interesting sub-proteomes (MacDonald
systematically at a sub- or full-genome level, et al., 2002).
several methods have been adapted for high- The Y2H system has become one of
throughput screens using robotics: (i) Y2H the standard laboratory techniques for the
systems; (ii) the mating-based split-ubiquitin detection and characterization of protein
system (mbSUS); and (iii) affinity purifica- protein interactions. It can be used to map
tion of protein complexes followed by iden- individual amino acid residues involved
tification of proteins by MS (AP-MS). in a specific proteinprotein interaction.
One of the first questions usually asked It can also be used to identify novel inter-
about a new protein, apart from where it is actions from complex libraries of expressed
expressed, is to what proteins does it bind? proteins. The Y2H system has been widely
To study this question by MS, the protein used for determination of protein interac-
itself is used as an affinity reagent to isolate tion networks within different organisms.
its binding partners. Compared with two- In plants, the Y2H system has been suc-
hybrid and array-based approaches, this cessfully applied to detect interactions
strategy has the advantages that the fully with phytochromes, cryptochomes, tran-
processed and modified protein can serve scription factors, proteins involved in self-
as the bait, that the interactions take place incompatibility mechanisms, the circadian
in the native environment and cellular loca- clock and plant disease resistance (Causier
tion and that multi-component complexes et al., 2005). Taken together with the recent
can be isolated and analysed in a single progress made in the development of large-
operation (Ashman et al., 2001). However, scale Y2H screening procedures, the time is
because many biologically relevant interac- now ripe for large-scale Y2H screens to be
tions are of low affinity, transient and gen- applied to organisms such as Arabidopsis
erally dependent on the specific cellular and rice.
environment in which they occur, MS-based Another potential method to detect
methods in a straightforward affinity experi- proteinprotein interactions involves the
ment will detect only a subset of the protein use of FRET between fluorescent tags on
interactions that actually occur (Aebersold interacting proteins. FRET is a non-radio-
and Mann, 2003). Bioinformatics methods, active process whereby energy from an
correlation of MS data with those obtained excited donor fluorophore is transferred to
by other methods or iterative MS measure- an acceptor fluorophore that is within 60
ments possibly in conjunction with chemi- of the excited fluorophore (Wouters et al.,
cal crosslinking (Rappsilber et al., 2000) 2001). After excitation of the first fluoro-
can often help to further elucidate direct phore, FRET is detected either by emis-
interactions and overall topology of multi- sion from the second fluorophore using
protein complexes. appropriate filters or by alternation of the
The ability of quantitative MS to detect fluorescence lifetime of the donor. Two
specific complex components within a fluorophores that are commonly used are
background of non-specifically associated variants of GFP: cyan fluorescent protein
proteins increases the tolerance for high (CFP) and yellow fluorescent protein (YFP)
background and allows for fewer purifica- (Tsien, 1998). The potential of FRET is con-
tion steps and less stringent washing condi- siderable, for two reasons (Phizicky et al.,
tions, thus increasing the chance of finding 2003). First, it can be used to make meas-
transient and weak interactions. The same urements in living cells, which allows the
methods can be used to study the interac- detection of protein interactions at the
tion of proteins with nucleic acids, small location in the cell where they normally
molecules and in fact with any other sub- occur in the presence of the normal cellular
Omics and Arrays 87
change in environmental conditions on par- The global study of the structure and
ticular metabolites (Verdonk et al., 2003). dynamics of metabolic networks has been
Sample preparation is focused on isolating hindered by a lack of techniques that iden-
and concentrating the compound of inter- tify metabolites and their biochemical
est to minimize detection interference from relationship in complex mixtures. Recent
other components in the original extract. advances in ultra-high mass accuracy MS
Metabolite profiling refers to a qualitative provide two advantages that can enable ab
and quantitative evaluation of metabolite initio determination of metabolic networks:
collections, for example those found in a (i) the ability to identify molecular formu-
particular pathway, tissue or cellular com- lae based on exact masses; and (ii) the infer-
partment (Burns et al., 2003). Finally, meta- ence of biosynthetic relationships between
bolic fingerprinting focuses on collecting masses directly from the mass spectrum.
and analysing data from crude extracts to Mass spectrometers with the necessary per-
classify whole samples rather than separat- formance parameters (mass accuracy around
ing individual metabolites (Johnson et al., 1 ppm and resolution above 100,000 m/m)
2003; Weckwerth, 2003). are now within the reach of many research-
In stark contrast to transcriptomics and ers and will change the way we think about
proteomics, metabolomics is mainly spe- metabolomics (Breitling et al., 2006). The
cies-independent, which means that it can recent application of Fourier transform
be applied to widely diverse species with ion cyclotron resonance MS (FTICR-MS)
relatively little time required for re-optimiz- to metabolomic analysis suggests a way to
ing protocols for a new species. Metabolite tackle the problem. A lower-cost alterna-
profiling can monitor variation in the accu- tive to high-field FTICR-MS, the Orbitrap
mulation of metabolites in plant cells in mass analyser, promises accelerated activ-
culture which are ectopically expressing ity in this area. These two analysers are able
transcription factors, as a hypothesis-gener- to achieve high resolution and mass accu-
ating tool to establish the possible pathways racy in the 1-ppm range for biomolecular
regulated by particular regulatory proteins. samples. In both instruments, the ionized
The first step consists of generating a trans- metabolite mixture is trapped in an orbital
genic cell line expressing the regulator from trajectory. The frequency of their orbit
a constitutive or inducible promoter. The depends on the mass-over-charge ratio of the
second step is to subject extracts from trans- ions and can be measured precisely, which
formed and control cells to various meta- is the basis of the exceptional accuracy. In
bolic profiling approaches to determine the FTICR-MS, trapping is achieved in a strong
qualitative and quantitative differences in magnetic field which exerts a force on the
metabolite accumulation. A more practical charged particles that is perpendicular to
approach to monitoring and purifying indi- their direction of motion and thus confines
vidual metabolites is to profile hundreds or them to a circular path. The Orbitrap traps
thousands of small molecules biochemically ions without a magnetic field and ions are
and to screen for changes in the relative trapped in a radial electrical field between
levels of these compounds. By comparing a central and an outer cylindrical electrode.
two conditions, a profile of the differences Theoretically, the resolving power of the
can be obtained that is then used as a blue- FTICR-MS and Orbitrap is sufficiently high
print to identify the individual compounds to resolve even the most complex metabo-
affected (Dias et al., 2003). The immense lite mixtures using direct infusion.
chemical diversity of small biomolecules Gas chromatography (GC)-MS or
makes comprehensive metabolome screens LC-MS is the tool of choice for generating
difficult. The lack of unifying principles high-throughput data for identification and
such as genetic codes that would assist mol- quantification of small-molecular-weight
ecule identification, comparison and causal metabolites (Weckwerth, 2003). Capillary
connection is another important challenge electrophoresis (CE) is an alternative
(Breitling et al., 2006). method which separates particular types of
90 Chapter 3
compound more efficiently and can be cou- centration in a single NMR experiment with
pled with MS or other types of detectors. excellent reproducibility.
NMR, infrared (IR), ultraviolet (UV) and HPLC and GC are the most widely used
fluorescence spectroscopy can be used as analytical techniques for the separation of
alternative means of detection, often in par- small metabolites. GC is used to separate
allel with MS (Weckwerth, 2003). TOF MS compounds on the basis of their relative
technology has also been used in metabo- vapour pressure and affinities for the sta-
lite analysis and provides a means of high tionary phase in the chromatographic col-
sample-throughput. In the end, a combina- umn. It offers very high chromatographic
tion of methods enables analysis of a broad resolution but requires chemical derivatiza-
range of metabolites. tion for many biomolecules: only volatile
NMR is a spectroscopic technique that chemicals can be analysed without deriva-
exploits the magnetic properties of the tization. Some large and polar metabolites
atomic nucleus (Macomber, 1998). In NMR, cannot be analysed by GC. GC tends to give
the sample is immersed in a strong external much greater chromatographic resolution
magnetic field and transitions between the than HPLC but has the disadvantages of
nuclear magnetic energy levels are induced being limited to compounds that are vola-
by a suitably oriented radiofrequency field. tile and heat stable. A big advantage of GC
In theory, any molecule containing one is that it can be easily combined with MS,
atom with a non-zero nuclear spin (I) is which greatly increase its utility for multi-
potentially visible by NMR. Considering component profiling because of its inherent
the isotopes with a non-zero nuclear spin high specificity, high sensitivity and positive
such as 1H, 13C, 14N, 15N and 31P, all biologi- peak confirmation (Dias et al., 2003). HPLC
cal molecules have at least one NMR signal. is a form of column chromatography used
There is wide variation in the sensitiv- frequently in biochemistry and analytical
ity of the experiment for different nuclei, chemistry. It is used to separate components
hence 1H NMR remains the best choice for in a mixture by using a variety of chemical
metabolite profiling by NMR mainly due to interactions between the substance being
its natural abundance (99.8%) and sensitiv- analysed (analyte) and the chromatography
ity (Moing et al., 2007). The NMR spectrum column. Compared to GC, HPLC has lower
generally consists of a series of discrete chromatographic resolution but it does have
lines (resonances) which are character- the advantage that a much wider range of
ized not only by the familiar spectroscopic analytes can potentially be measured.
quantities of frequency (chemical shift), The generation of reproducible and
intensity and line shape, but also by relaxa- meaningful metabolomic data requires great
tion times. Although less sensitive than GC care in the acquisition, storage, extraction
or LC-MS, proton NMR spectroscopy is a and preparation of samples (Fiehn, 2002).
powerful complementary technique for the The true metabolic state of samples must be
identification and quantitative analysis of maintained and additional metabolic activ-
plant metabolites either in vivo or in tissue ity or chemical modification after collec-
extracts (Krishnan et al., 2005). Typically, tion must be prevented. Depending on the
2040 metabolites have been identified in type of sample and the analysis performed,
metabolite profiling of plant extracts and this can be achieved in various ways. The
the number of metabolites quantified can be most common strategies are freezing in liq-
increased with higher field strength (increas- uid nitrogen, freeze-drying, and heat dena-
ing spectral resolution) and by using micro- turation to halt enzymatic activity (Fiehn,
probes for small quantity samples together 2002). Metabolomic experiments are typi-
with cryogenic probe heads (increasing cally conducted by comparing experimen-
sensitivity). One of the main advantages of tal plants possessing an expected metabolic
1
H-NMR is that structural and quantitative modification (i.e. because of the introduc-
information can be obtained on numerous tion of a transgene or exposure to a particu-
chemical species with a wide range of con- lar treatment) to control plants. Statistically
Omics and Arrays 91
is accounted for. Map comparisons between based on inferred protein matches between
closely related species are largely unaffected 26,028 genes. A total of 34 non-overlapping
because most duplications pre-date them. chromosomal segment pairs were identified
Comparative maps lay the groundwork for consisting of 23,177 (89%) Arabidopsis genes
asking questions about whether specific (Bowers et al., 2003b). To relate this alpha
linkage blocks or gene arrangements are sta- duplication to the angiosperm family tree, all
tistically associated with increased fitness or duplicated syntenic Arabidopsis gene pairs
have a relationship between polyploidy and were compared to individual genes from
plant adaptation. For example, comparative pine, rice, tomato, Medicago, cotton and
linkage mapping and chromosome painting Brassica. It was determined whether inferred
in the close relatives of Arabidopsis have protein sequences were from duplicated
inferred an ancestral karyotype of these spe- syntenic gene pairs. Arabidopsis genes were
cies. In addition, comparative mapping to more similar to one another than to the heter-
Brassica has identified genomic blocks that ologous protein in another species.
have been maintained since the divergence
of the Arabidopsis and Brassica lineages RELATIVE AGE OF CHROMOSOMAL DUPLICATION EVENTS.
(Schranz et al., 2007). It was concluded that the alpha duplication
event pre-dated divergence from Brassica
An example: Arabidopsistomato about 14.520.4 million years ago but post-
comparative map dated divergence from cotton about 8386
million years ago.
DEVELOPMENT OF ARABIDOPSISTOMATO COMPARA-
About 50% (4964%) of Brassica
TIVE MAP TO DETECT MACROSYNTENY. Fulton et al.
sequences were more similar to one dupli-
(2002) identified over 1000 conserved cated Arabidopsis sequence than was the other
orthologous sequences (COS) between Arabidopsis sequence to its paralogue. Only
tomato and Arabidopsis by comparison of 619% of cotton, rice, pine, etc. sequences
Arabidopsis genomic sequence with 130,000 clustered internally to the Arabidopsis syn-
tomato ESTs (representing 27,000 unigenes tenic duplicates (Bowers et al., 2003b).
or approximately 50% of the tomato gene
content). For 1025 COS markers developed,
POLYPLOID ANCESTRY OF MOST PLANT SPECIES. As
927 were screened against tomato DNA
using Southern analysis to classify them as more data accumulates, the history of
single, low or multiple copy, among which angiosperms emerges as a history of genome-
85% were considered to be single or low wide duplication followed by massive gene
copy (> 95% hybridization signal assigned loss (and return to diploidy). Only 30% of
to three or fewer restriction fragments) and Arabidopsis genes have retained syntenic
50% matched a gene of unknown function copies in less than 86 million years since
(Gene Ontology classification). A total of 550 the alpha duplication. In contrast, mam-
COS markers was mapped on to the tomato mals appear to harbour fewer polyploidiza-
genome. The size of conserved segments was tion events and less cycling of duplicated
generally smaller than 10 cM. Results indi- genes; 70% of human and mouse proteins
cated that multiple polyploidization events show conserved synteny after 100 million
punctuate the evolution of Arabidopsis and years of evolution.
tomato. Distinguishing orthologues from
paralogues is difficult due to reciprocal loss
of genes and chromosome segments follow- 3.5.2 Collinearity
ing polyploidization events.
Orthology and paralogy
PHYLOGENETIC ANALYSIS OF CHROMOSOMAL DUPLI-
The
CATION EVENTS TO DETECT MICROSYNTENY. Figure 3.9 shows the concepts of orthology
Arabidopsis genome sequence was used and paralogy. Orthologues and paralogues
to analyse internal duplication events are two types of homologous sequence.
96 Chapter 3
gene families and, accordingly, it is often dif- revealed excellent conservation between
ficult to determine if a gene mapped in the the overall structure and gene order of sor-
second species is orthologous or paralogous ghum chromosome 3 and rice chromosome
to that in the first species. Fourthly, the col- 1 but also indicated several rearrangements.
linearity of gene order and content observed Together, these studies indicate a general
at the recombinational map level is often conservation of large syntenic blocks within
not observed at the level of local genome cereals but with many more rearrangements
structure (Bennetzen and Ramakrishna, and synteny breakdowns than originally
2002). Finally, in most early studies, no anticipated.
statistical analysis was used to evaluate This trend is even more obvious when
whether the presence of a few markers in synteny is analysed at the sequence level.
the same order on two chromosomal seg- Rearrangements may occur that involve
ments in two species occurs by chance or is regions smaller than a few centimorgans and
truly significant. would be missed by most recombinational
The genome collinearity of several mapping studies. Comparative sequence
Cammelineae and Brassicaceae species analysis involving large genomic segments
have been recently compared to that of can detect these rearrangements. Such anal-
A. thaliana by comparative genetic link- yses reveal the composition, organization
age mapping and comparative chromosome and functional components of genomes and
painting (Schranz et al., 2007). A compre- provide insight into regional differences
hensive study identified 21 syntenic blocks in composition between related species.
that are shared by Brassica napus and Recently, the sequencing of genomic seg-
A. thaliana genomes, corresponding to 90% ments in the cereals has enabled microcol-
of the B. napus genome (Parkin et al., 2005). linearity across genes or gene clusters to
be investigated. Sequencing of the domes-
Microcollinearity tication locus Q in Triticum monococcum
revealed excellent collinearity with the
Using the rice genome sequence as the ref- bread wheat genetic map (Faris et al., 2003).
erence to compare with molecular marker Following the sequencing of the leaf-rust-
information of other cereals gave a result resistance locus Rph7 from barley, it was
which indicated many more rearrangements observed that this locus is flanked by two
than had been expected from Gale and HGA genes. The orthologous locus in rice
Devoss (1998) concentric circles model. chromosome 1 consists of five HGA genes.
One such comparison involved more than In barley, only four of the five HGA genes
2600 mapped sequenced markers in maize are present, one is duplicated as a pseudo-
among which only 656 putative ortholo- gene and six additional genes have been
gous genes could be identified (Salse et al., inserted in between the HGA genes. These
2004). The comparison of the wheat genetic six genes have homologues on eight dif-
map with the rice sequence also suggests ferent rice chromosomes (Brunner et al.,
numerous rearrangements between the two 2003). The most striking rearrangement
genomes with a high frequency of break- was revealed by the comparison of 100 kb
downs in collinearity (Sorrells et al., 2003). around the Bronze locus of two maize lines.
Extensive comparisons have also been made Not only does the retrotransposon distribu-
between sorghum and rice (Klein et al., tion differ between the two lines but the
2003; The Rice Chromosome 10 Sequencing genes themselves could also be different (Fu
Consortium, 2003). To align the sorghum and Dooner, 2002). Comparison of the low
physical map with the rice map, sorghum molecular weight glutenin locus between
BAC clones were selected from the mini- T. monococcum and Triticum durum also
mum tiling path of chromosome 3. Unique revealed dramatic rearrangements: more
partial sequences were obtained from each than 90% of the sequence diverged because
BAC clone and could be directly compared of retro-element insertions and because dif-
with the rice sequence. This approach ferent genes are present at this locus (Wicker
98 Chapter 3
et al., 2003). Therefore collinearity can be for identifying regions of cereal genomes
lost very rapidly within two genomes from that are prone to rapid evolution. Similar
the same species. comparative analyses of Arabidopsis acces-
With the sequencing of long regions, sions have shown that both the relocation
several studies in cereals have demon- of genes and the sequence polymorphisms
strated incomplete microcollinearity at the between accessions (in both coding and
sequence level. Song et al. (2002) identified non-coding regions) are common in the
orthologous regions from maize, sorghum Arabidopsis genome (The Arabidopsis
and two subspecies of rice. It was found Genome Initiative, 2000). Intraspecific vio-
that gross macrocollinearity is maintained lation of collinearity has also been identified
but microcollinearity is incomplete among in maize (Fu and Dooner, 2002). Han and
these cereals. Deviations from gene colline- Xue (2003) also discovered significant num-
arity are attributable to micro-rearrangement bers of rearrangements and polymorphisms
or small-scale genomic changes such as gene when comparing indica and japonica
insertions, deletions, duplications or inver- genomes in rice. The deviations from col-
sions. In the region under study, the orthol- linearity are frequently due to insertions or
ogous region was found to contain six genes deletions. Intraspecific sequence polymor-
in rice, 15 in sorghum and 13 in maize. In phisms commonly occur in both coding and
maize and sorghum, gene amplification non-coding regions. These variations often
caused a local expansion of conserved genes affect gene structures and may contribute to
but did not disrupt their order or orienta- intraspecific phenotypic adaptations.
tion. As indicated by Bennetzen and Ma
(2003), numerous local rearrangements dif- Implications of genome collinearity
ferentiate the structures of different cereal
genomes. On average, any comparison of a Genomics would be much simpler if the
ten-gene segment between rice and a dis- order of genes were common (syntenic)
tant grass relative such as barley, maize, across the major groups of plants. The
sorghum or wheat shows one or two rear- usefulness of the collinearity between the
rangements that involve genes. A simple genomes of model plants and important
extrapolation to the rice genome of about crops can be assessed by the number of
40,000 genes (Goff et al., 2002) suggests that failures or successes in its exploitation. For
about 6000 genic rearrangements occurred example, the analysis of the Arabidopsis
which differentiate rice from any of the sequence provides information that will
other cereals. Most of these rearrangements facilitate the annotation of the rice sequence
appear to be tiny and thus would not inter- and likewise sequencing Medicago provides
fere with the macrocollinearity observed by a resource for research on important crop
recombinational mapping. There are excep- legumes. Furthermore, the effort put into
tions however, which include chromosomal sequencing and annotating the rice genome
arm translocations and movements of single has also been rewarded, as this annotation
genes to different chromosomes (Bennetzen will be transferred to related sequences and
and Ma, 2003). used repeatedly in the future. The synteny
As expected, there is a high degree of between the monocots will help decipher
gene conservation between the two shot- the structure and function of the more
gun-sequenced subspecies of rice, japonica complex genomes. A fully assembled rice
and indica, which diverged more than 1 sequence allows more accurate assessment
million years ago. On careful inspection, of the macro- and microsynteny of rice with
however, narrow regions of divergence can other cereals (Xu et al., 2005).
be found in these genomes (Song et al., The advent of technologies for map-
2002). These regions correspond to areas of ping genomes directly at the DNA level has
increased divergence among rice, sorghum made comparative genetic mapping among
and maize, suggesting that the alignment sexually incompatible species possible.
of the two rice subspecies might be useful Extensive comparative maps for marker
Omics and Arrays 99
genes have been constructed for a number of of divergence among grass species. When
plant taxa, including species in the Poaceae evaluating 124 CISPs across rice, sorghum,
(rice, maize, sorghum, barley and wheat), millet, Bermuda grass, teff, maize, wheat
Solanaceae (tomato, potato and pepper) and barley, about 18.5% of them seemed
and Brassicaceae (Arabidopsis, cabbages, to be subject to rigid intron size constraints
mustard, turnip and rape). As a result, the that were independent of per-nucleotide
concept of a single genetic or ancestral DNA sequence variation. Likewise, about
map for all grasses, with species-specific 487 conserved non-coding sequence motifs
modifications, is emerging (Moore et al., were identified in 129 CISP loci. As pointed
1995). The extensive collinearity of wheat, out by Feltus et al. (2006), CISP provides the
rye, barley, rice and maize suggests that it means to effectively explore poorly char-
may be possible to reconstruct a map of the acterized genomes for both polymorphism
ancestral cereal genome. These conserved and non-coding sequence conservation on
gene orders and the possibility of sharing a genome-wide or candidate gene basis and
DNA probes and PCR primers across spe- also to anchor points for comparative genom-
cies will greatly extend the power of map- ics across a diverse range of species. After
ping analysis by facilitating the molecular the whole genomes of the major food crops
analysis of the corresponding chromosomal have been sequenced, plant breeders will be
regions in different species and allowing able to access new gene tools that will facili-
information, and perhaps DNA sequences tate the selection of outstanding individu-
and genes, to be transferred quickly and als characterized by resistance to biotic and
efficiently between different species. abiotic stresses and good seed quality, thus
The challenge of finding which map, enabling breeders to produce new cultivars
sequence and eventually functional genomic in addition to those currently available.
information from one species can be accessed, As a fundamental tool in biology, com-
compared and exploited across all plant spe- parative analysis has been extended from
cies will require the identification of a subset being focused on a specific field to biology
of plant genes that have remained relatively as a whole. With the growing availability of
stable in both sequence and copy number phenotypic and functional genomic data,
since the radiation of flowering plants from comparative paradigms are now also being
their last common ancestor. Identification of extended to the study of other functional
such a set of genes would also facilitate taxo- attributes, most notably gene expression.
nomic and phylogenic studies in higher plants Microarray techniques present an alterna-
that are presently based on a very small set of tive method of studying differences between
highly conserved sequences, such as those closely related genomes. Advances in micro-
of chloroplast and mitochondrial genes. The array-based approaches (see Section 3.6)
conserved orthologue set of markers, identi- have enabled the main forms of genomic var-
fied computationally and experimentally, iation (amplifications, deletions, insertions,
may further studies on comparative genomes rearrangements and base-pair changes) to be
and phylogenetics and elucidate the nature of detected using techniques that can easily be
genes conserved throughout plant evolution. undertaken in individual laboratories using
Completed genome sequences provide simple experimental approaches (Cresham
templates for the design of genome analysis et al., 2008).
tools in orphan species lacking sequence Tirosh et al. (2007) reviewed recent
information. For example, Feltus et al. studies in which comparative analysis was
(2006) designed 384 PCR primers to con- applied to large-scale gene expression data-
serve exonic regions flanking introns using bases and discussed the central principles
sorghum and millet EST alignments to the and challenges of such approaches. As differ-
rice genome. These conserved-intron scan- ent functional properties often co-evolve and
ning primers (CISP) amplified single-copy complement one another, their combined
loci with 3780% success rates; i.e. sampling analysis reveals additional insights. Unlike
most of the approximately 50 million years sequence-based genetic map information
100 Chapter 3
however, most functional properties are ogy. Depending on the type of molecules that
condition-dependent, a property that needs are arrayed, microarrays can also be based on
to be accounted for during interspecies com- proteins, tissues or carbohydrates.
parisons. Furthermore, functional proper- An array is an orderly arrangement of
ties often reflect the integrated function of samples. It provides a medium for match-
multiple genes, calling for novel methods ing known and unknown molecular samples
that allow network-centred rather than gene- based on base-pairing (i.e. A-T and G-C for
centred comparisons. Finally, one of the DNA; A-U and G-C for RNA) or hybridiza-
main challenges in comparative analysis is tion and automating the process of identify-
the integration of different data types which ing the unknowns. From its origin as a new
is becoming particularly important as addi- technique for large-scale DNA mapping and
tional data types are being accumulated. The sequencing and initial success as a tool for
lack of appropriate descriptors and metrics transcript-level analyses, microarray technol-
that succinctly represent the new informa- ogy has spread into many areas by adapting
tion originating from genomic data is one of the basic concept and combining it with other
the roadblocks on this path. Galperin and techniques. Microarray-based processes,
Koller (2006) outlined recent trends in com- either mature or under development, include
parative genomic analysis and discussed transcriptional profiling, genotyping, splice-
some new metrics that have been used. This variant analysis, identification of unknown
issue is related to the ontology concept and is exons, DNA structure analysis, chromatin
discussed in detail in Chapter 14. immunoprecipitation (ChIP)-on-chip, protein
binding, proteinRNA interaction, chip-based
comparative genomic hybridization, epige-
3.6 Array Technologies in Omics netic studies, DNA mapping, re-sequencing,
large-scale sequencing, gene/genome syn-
It is widely believed that thousands of genes thesis, RNA/RNAi synthesis, proteinDNA
and their products (i.e. RNA and proteins) in interaction, on-chip translation and universal
any given living organism function in a com- microarrays (Hoheisel, 2006).
plicated and orchestrated manner. However, In this section, the basic procedures
traditional methods in molecular biology of arraying will be introduced and several
generally work on a one gene in one experi- major microarray technologies and plat-
ment basis which means that the through- forms will be briefly described. The two
put is very limited and the whole picture volumes of DNA Microarrays (Kimmel and
of gene function is difficult to obtain. In the Oliver, 2006a, b) provide a comprehensive
late 1990s, a new technology known as a coverage of all the related fields from tech-
biochip or DNA microarray, attracted great nologies and platforms to data analysis.
interest among biologists. This technology The reader is also referred to Zhao and Bruce
promised to monitor the whole genome on a (2003), Amratunga and Cabrera (2004),
single array so that researchers would have Mockler and Ecker (2004), Subramanian
a better picture of the interactions among et al. (2005), Allison et al. (2006), Hoheisel
thousands of genes at the same time. (2006) and Doumas et al. (2007).
Various terminologies have been used in
the literature to describe this technology; for
DNA microarrays these include, but are not 3.6.1 Production of arrays
limited to, biochip, DNA chip, DNA micro-
array and gene array. Affymetrix, Inc. owns Complementary strands of DNA and nucleic
a registered trademark, GeneChip, which acids in general can pair in a duplex via non-
refers to its high density, oligonucleotide- covalent binding. This fundamental charac-
based DNA arrays. However, in some articles teristic is used in all DNA array techniques.
appearing in professional journals, popular Amaratunga and Cabrera (2004), Arcellana-
magazines and on the Internet, the term gene Panilio (2005) and Doumas et al. (2007)
chip(s) has been used as a general terminol- describe the principles of DNA miroarray
ogy that refers to DNA microarray technol- technology and how they are prepared and
Omics and Arrays 101
used. First, two terms related to microarrays, resulting in a dramatic increase in through-
probe and target, should be introduced. The put. In GeneChips (http://www.affymetrix.
gene-specific DNA spotted on to the array com/) the probe array was designed using
is referred to as the probe and the sample to an optimal set of oligonucleotides selected
be tested that will hybridize with the probe using computer algorithms and manu-
is referred to as the target. The same probe factured using Affymetrix light-directed
spotted on to the array can be repeatedly chemical synthesis. Fluorescent labels were
hybridized with many different targets (sam- used for hybridization and detection and
ples). An experiment using a single DNA the Affymetrix software suite was used for
chip can provide researchers with informa- data analysis and database management.
tion on thousands of genes simultaneously, Figure 3.10 illustrates a flowchart showing
EST database or
cDNA library
Treatment 1 Treatment 2
PCR inserts
from EST clones RNA 1 RNA 2
Hybridization
Spotting
Laser Wash
Dry
scanning
10000 10000
1000 1000
100 100
10 10
1 1
1 10 100 1000 10000 1 10 100 1000
Treatment 1 Treatment 2 Treatment 1 Treatment 2
10
Mean fold-decrease compared to 0h
10
Mean fold-increase compared to 0h
7.5 7.5
5.0 5.0
4.0 4.0
3.0 3.0
2.0 2.0
1.0 1.0
Treatment 1 Treatment 2
numbers of identical DNA copies can be ing oligos close to the 3' end might also boost
generated by growing them in bacteria. signal intensity.
The DNA spots on a microarray are
produced either by synthesis in situ or by Slide substrates
deposition of the pre-synthesized product.
DNA synthesis in situ methods have largely Glass microscope slides are the solid sup-
been within the purview of commercial port of choice and they should be coated
companies. In this method, 2025-bp long with a substrate that favours binding of the
gene-specific oligonucleotides are gener- DNA. Development of substrates on atomi-
ated in situ on a silicon surface by combin- cally flat slide surfaces and minimum back-
ing a standard DNA synthesis protocol with ground for higher signal-to-noise ratios has
phosphoramidite reagents modified with contributed to the improvement of data
photolabile 5'-protecting groups (Doumas quality (Arcellana-Panilio, 2005). Different
et al., 2007). The activation for oligonucle- versions of silane, amine, epoxy and alde-
otide elongation is achieved using a mask hyde substrates which attach DNA by either
(Affymetrix; http://www.affymetrix.com) ionic interaction or covalent bond forma-
or maskless (NimbleGen; www.nimblegen. tion are commercially available.
com) method. Alternatively, the reagents
can be delivered to each spot using ink-jet Arrays and spotting pins
technology (Agilent; http://www.agilent.
com). Ongoing research and development The physical process of delivering the DNA
efforts ensure the optimum design of the to pre-determined coordinates on the array,
DNA content and continued technologi- involves spotting pens or pins carried on a
cal advancements enable the production of print head that is controlled in three dimen-
increasingly higher-density arrays. sions by gantry robots with sub-micron pre-
cision. A total of 30,000 features of 90-m
diameter can easily be spotted on to a 25 75-
Array content mm slide with a maximum spotting density
of over 100,000 features per slide. There are
The choice of DNA type to print is funda- several DNA arraying technologies, includ-
mental. The sequence of the cDNA could be ing high speed robotic printing of DNA
several hundred to a few thousand base pairs fragments on glass (usually PCR amplified
long. The DNA spotted on oligonucleotide cDNAs), high speed robotic printing of long
arrays consist of synthesized chains of oligo- oligonucleotides (70-mers; Agilent technol-
nucleotides corresponding to part of a known ogy and many academic facilities), synthesis
gene or putative ORF; each oligonucleotide is of oligonucleotides (25-mers) on micro-chips
usually about 2570 bp long. In an oligonu- using photolithographic masks (Affymetrix
cleotide array, a gene is generally represented GeneChips) and synthesis of oligonucle-
by several different oligonucleotides and otides (2570-mers) on microchips using
they are carefully chosen for maximal specif- maskless aluminium mirrors (NimbleGen
icity. Longer stretches of DNA such as those GeneChips). Improvements in arraying sys-
obtained from PCR of cDNA clones produce tems have included shorter printing times
robust hybridization signals but less specifi- and longer periods of walk-away operation.
city. Short oligonucleotides (2430 nt) have Arrayers are invariably installed within
greater discrimination and are also suitable controlled-humidity cabinets to maintain an
for assessing single-nucleotide changes. Long optimum environment for printing.
oligonucleotides (5070 nt) afford an excel-
lent compromise between signal strength and
specificity and their use has increased among
academic core facilities (Arcellana-Panilio, 3.6.2 Experimental design
2005). Choosing oligos corresponding to the
3' untranslated regions (3'UTR) increases the Careful experimental design is required
likelihood of their being specific and design- to determine the type of array to run; how
104 Chapter 3
many replicates to use; and which samples to 1025 g total RNA for cDNA spots and
will be hybridized to obtain meaningful long oligonucleotide arrays. In some cir-
data amenable to statistical analysis, upon cumstances it becomes necessary to amplify
which sound conclusions can be drawn. the RNA in the sample to obtain adequate
A biological question must first be framed amounts for labelling and hybridization to
and a microarray platform then chosen, fol- an array.
lowed by a decision on biological and tech- To prepare the labelled sample, the
nical replicates and the design of a series of first step is to purify mRNA from total cel-
hybridizations. lular contents. There are several challenges
Microarray experimental design is usu- involved: (i) mRNA accounts for only a
ally governed by the aim of the experiment. small fraction (less than 3% of all RNA in a
An important aspect of experimental design cell) so isolating mRNA in sufficient quan-
is deciding how to minimize variation which tity for an experiment (12 g) can be a chal-
can be thought of as occurring in three lay- lenge. Common mRNA isolation methods
ers: biological variation, technical variation take advantage of the fact that most mRNAs
and measurement error. Replication is the have a poly-adenine, poly(A), tail. These
easy answer to dealing with variation. To poly(A) mRNAs can be purified by captur-
make the best use of available resources, it ing them using complementary oligodeoxy-
is important to know what to replicate and thymidine (oligo(dT) ) molecules bound to
how many replicates to apply. Hybridization a solid support such as a chromatographic
of two samples to the same slide is made column or a collection of magnetic beads.
possible by labelling each sample with (ii) The more heterogeneous the cells, the
chemically distinct fluorescent tags. This more difficult it is to isolate mRNA specific
also provides the opportunity to make direct to the study. (iii) Captured mRNA degrades
comparisons between samples of primary very quickly and the mRNA has to be imme-
interest (Arcellana-Panilio, 2005). Using a diately reverse-transcribed into more stable
common reference becomes more efficient cDNA (for cDNA microarrays). The reverse
when a large number of samples need to transcription reaction usually starts from
be compared. When an experiment is test- the poly(A) tail of the mRNA and moves
ing the effect(s) of multiple factors, a well- toward its head; such a reaction is described
thought-out design is extremely critical so as oligo(dT)-primed.
that resources are not wasted on eventually
useless comparisons.
3.6.4 Labelling
coloured fluor is used for each sample so cDNA whose sequence is complementary
that the two samples can be differentiated to the DNA on a given spot, that cDNA
on the array. will hybridize to the spot where it will be
The cDNA or mRNA can be labelled detectable by its fluorescence. In this way,
either directly or indirectly. In the direct every spot on an array is an independent
labelling procedure, fluorescently labelled assay for the presence of a different cDNA.
nucleotide is incorporated into the cDNA Hybridization is achieved by pouring the
products as it is being synthesized. With labelled sample on to the array and allow-
this method, a difference in the steric hin- ing it to diffuse uniformly. It is then sealed
drance conferred by different label moie- in a hybridization chamber and incubated
ties causes some labelled nucleotides to at a specific temperature for a period of time
be more efficiently used than others, pro- sufficient to allow hybridization reactions
ducing a dye bias in which one sample is to complete. The experimental conditions
labelled at a higher level overall than the should ensure that all areas of the array are
other. Cyanine 3 (Cy3) and cyanine 5 (Cy5) exposed to a uniform amount of labelled
are large molecules that reduce reverse sample throughout the hybridization.
transcriptase efficiency of long transcripts Hybridizations are processed directly
and certain sequences. Cy3-nucleotide on the slides after target synthesis. The
tends to be incorporated at a higher fre- hybridization step is literally where every-
quency than Cy5 although this does not thing comes together, i.e. the labelled mol-
necessarily translate into a better labelled ecules find their complementary sequences
target. To prevent the dye bias, the indirect on the array and form double stranded
labelling approach was developed where hybrids which are strong enough to with-
RNA is reverse transcribed in the pres- stand stringent washes. As in the hybridiza-
ence of an amino allyl-modified nucle- tion of classical Southern and northern blots,
otide that enables the chemical coupling the objective is to favour the formation of
of fluorescent labels after the cDNA is hybrids and the retention of those which are
synthesized. If the coupling reaction goes specific. Hybridization conditions depend
to completion, the frequency of labelling on the length of probes arrayed on the slide
becomes independent of the fluorophore and need to be extensively tested before
(Arcellana-Panilio, 2005). analysis. As an example, probe melting tem-
The labelled sample is the target for the peratures range from 42 to 70C depending
experiment. The number of fluor molecules on the nature of the buffer: the presence of
that label each cDNA depends on its length formamide exerts a positive effect on buffer
and also possibly its sequence composi- stringency in Denhardt-type buffers which
tion. For an RNA sample, either total RNA are used at 42C, whereas Sarkosyl-based
or mRNA is typically isolated and labelled buffers are commonly used around 70C.
using a first strand cDNA synthesis step Exogenous DNA (e.g. salmon sperm and
either by direct incorporation of a fluores- Cot-1 DNA) reduces background by block-
cent dye or by coupling the dyes to a modi- ing areas of the slide with a general affinity
fied nucleotide. For non-expression-based for nucleic acid or by titrating out labelled
experiments, DNA rather than RNA can be sequences that are non-specific. Denhardts
labelled and hybridized to the array. reagent (containing equal parts of Ficoll,
polyvinylpyrrolidone and bovine secum
albumin) is also used as a blocking agent.
Detergents such as SDS reduce surface
3.6.5 Hybridization and tension and improve mixing while help-
post-hybridization washes ing to lower background at the same time.
Temperature is an important factor that can
The array holds hundreds or thousands be manipulated during the hybridization
of spots, each of which contains a differ- and post-hybridization washes of microar-
ent DNA sequence. If a sample contains a rays and here much can be learned from
106 Chapter 3
what has already been established for end models enable excitation at several
Northern or Southern blots. For microar- wavelengths and offer dynamic focus, lin-
rays to be useful as a means of quantifying ear dynamic range over several orders of
expression the target has to be present in lim- magnitude and options for high-throughput
iting concentrations and the probe must be scanning. The objective of the scanning pro-
present in sufficient excess so as to remain cedure is to obtain the best image, where the
virtually unchanged even after hybridiza- best is not necessarily the brightest (to avoid
tion (Arcellana-Panilio, 2005). One impor- over-saturation beyond the signal range) but
tant feature of fluorescence detection is that is the most faithful representation of the
it allows the simultaneous hybridization of data on the slide.
two to several targets that have been differ- Although it is only supposed to pick up
ently labelled. the light emitted by the target cDNAs bound
The quality of the hybridization can to their complementary spots, the scanner
be assessed by spotting the sample with a will inevitably pick up light from various
set of hybridization control genes, spiking other sources, including the labelled sam-
the labelled sample with a known amount ple hybridizing non-specifically to the glass
of these controls prior to exposure to the slide, residual (unwanted) labelled probe
array and verifying that these control genes adhering to the slide, various chemicals
are indeed showing up as having been used in processing the slide and even the
hybridized. slide itself. This extra light creates back-
ground signals. Once signal and background
values are clearly defined, which is specific
to each experiment, data can be extracted
3.6.6 Data acquisition and quantification from the image by counting the pixels with
each probe and background area and record-
Once the wet phase (e.g. slide hybridization ing this in a computer readable format.
and washing off any excess labelled sample) Data extraction from the image involves
is completed, signal detection of each of the several steps (Arcellana-Panilio, 2005):
hybridization targets can be captured, that (i) gridding or locating the spots on the
is, the array must be scanned to determine array; (ii) segmentation or assignment of
how much of each labelled sample is bound pixels either to foreground (true signal)
to each spot. The signal is acquired using or background; and (iii) intensity extrac-
array scanners, either a charge-coupled tion to obtain a new value for foreground
device (CCD) or a confocal microscope, and background associated with each spot.
typically equipped with lasers to excite the Subtracting the background intensity from
fluorophores at a specific wavelength and the foreground yields the true spot intensity
photo-multiplier tubes to detect the emitted which can be used as an approximation of
light. Spots with more bound sample will relative gene expression.
have more reporters and will therefore fluo-
resce more intensely. Whatever the scanner
resolution, the microarray spot diameter
needs to be five to ten times larger than 3.6.7 Statistical analysis and data mining
the scanner resolution which can be as lit-
tle as 5 m for the most recent models. The Huge data sets are generated by microar-
end-product of a microarray experiment is ray experiments. For example, 20 hybridi-
a scanned grey scale image whose inten- zation experiments with the Arabidopsis
sity measurements range from 0 to 216. The GeneChip generates a set of 2,624,000 data
image is usually stored in a 16-bit tagged points (8200 genes 16 oligonucleotides
image file format (tiff, for short). The most 20 hybridizations). Such a massive amount
basic scanner models offer excitation and of data prohibits any manual treatment.
detection of the two most commonly used Also experimental variability is generally
fluorophores (Cy3 and Cy5) whereas higher- significant and has to be managed in order
Omics and Arrays 107
to exploit the data properly. Allison et al. spots and background can be difficult espe-
(2006) examined five key components of cially when the spots fade gradually around
microarray analysis: (i) design (the develop- their edges. Detection efficiency might not
ment of an experimental plan to maximize be uniform across the slide, leading to exces-
the quality and quantity of information sive red intensity on one side of the array
obtained); (ii) pre-processing (processing and excessive green on the other.
of the microarray image and normaliza- Data normalization addresses system-
tion of the data to remove systematic vari- atic errors that can skew the search for
ation. Other potential pre-processing steps biological effects. One of the most com-
include transformation of data, data filtering mon sources of systematic error is the
and in the case of two-colour arrays, back- dye bias introduced by the use of differ-
ground subtraction); (iii) inference (testing ent fluorophores to label the target. Print-
statistical hypotheses, e.g. which genes tip differences can also lead to sub-grid
are differentially expressed); (iv) classifi- biases within the same array while scanner
cation (analytical approaches that attempt anomalies can cause one side of an array to
to divide data into classes with no prior seem brighter than the other. Normalization
information or into predefined classes); and across multiple slides to remove bias can be
(v) validation (the process of confirming the accomplished by scaling the within-slide
veracity of the inferences and conclusions normalized data. In practice, examining the
drawn in the study). box plots of the normalized data of individ-
Reproducible and reliable microarray ual arrays for consistency of width can usu-
results can be only achieved through quality ally indicate whether normalization across
control starting with data generation. Good arrays is required.
laboratory proficiency and appropriate data Spatial plots can locate background
analysis practices are essential (Shi et al., problems and extreme values. The shape
2008). Numerous software packages, both and spread of scatter plots and the height
free and commercial, are available for quan- and width of box plots give an overall view
tifying microarray data. Typically, the inter- of data quality that can give clues about the
preted array data will highlight a relatively effects of filtering and different normaliza-
small number of spots that deserve further tion strategies. Gene expression profiling
investigation. Alternatively, the overall pat- will be taken as an example for the rest
tern of profiling can be used as a finger- of this section. Clustering algorithms are
print to characterize specific phenotypes. means of organizing microarray data accord-
The quantified data from the images ing to similarities in expression patterns. In
are obtained in typical form of tab-delim- this case, co-expressed genes must be co-
ited text files. First, dust artefacts, comet regulated, and a logical follow-up to this
tails and other spot anomalies should be analysis is the search for regulatory motifs
identified and flagged so that they will not and the common upstream or downstream
enter the analysis. Pre-processing the quan- factors that may tie these co-expressed
tified data before formal analysis includes genes together. Treatments can be clustered
the flagging of ambiguous spots with inten- based on similarity in gene expression pro-
sities lower than a threshold defined by the files. Genes can be clustered based on simi-
mean intensity plus two standard devia- larity in expression patterns across profiles.
tions of supposedly negative spots (no Two mathematical approaches are often
DNA, buffer and/or non-homologous DNA used, hierarchical or k-means clustering
controls). (Stanford) and self organizing maps (SOMs)
Interpreting the data from a micro- (Whitehead Institute).
array experiment can be challenging. A strategy for identifying differentially
Quantification of the intensities of each spot expressed genes is to compute the t-statis-
is subject to noise from irregular spots, dust tic and correct for multiple testing using
on the slide and non-specific hybridization. adjusted P-values. The B-statistic, derived
Deciding the intensity threshold between using an empirical Bayes approach, has
108 Chapter 3
been shown in simulations to be far supe- array. Compared with DNA microarrays, the
rior to either log ratios or the t-statistic development of protein-based approaches
for ranking differentially regulated genes poses technical problems for several rea-
(Lonnstedt and Speed, 2002). The twofold sons (Bernot, 2004): (i) proteins consist of
change continues to be a benchmark for 20 distinct amino acids while there are only
researchers perusing lists of microarray data four bases in DNA; (ii) depending on their
in order to validate the data by PCR, which amino-acid composition, proteins may be
can provide independent confirmation of hydrophilic, hydrophobic, acidic or basic
the expression patterns of specific genes. (while DNA is always hydrophilic and neg-
However, fold change has become more of atively charged); and (iii) proteins are often
a secondary criterion for the selection of post-translationally modified (by glycosyla-
candidates for follow-up from a list of genes tion, phosphorylation, etc.).
ranked according to more reliable measures Although detection of protein micro-
of differential expression (Arcellana-Panilio, arrays can be carried out using general
2005). After preliminary data mining and detection methods as described above, the
statistical analysis, validation and follow- problem is that protein concentrations in
up experiments can be designed. a biological sample may be many orders of
There are many examples of the array magnitude different from that of mRNAs.
technologies described in this section. In Therefore, protein array detection methods
yeast, 260,000 oligonucleotides correspond- must have a much larger range of detection.
ing to all the genes in yeast have been syn- The preferred method of detection is cur-
thesized on to a 1.28 cm2 chip. These chips rently fluorescence detection. Fluorescent
have allowed the identification of genes detection is safe, sensitive and can produce
expressed in various mutants under differ- high resolution. The fluorescent detection
ent culture conditions or at different stages method is compatible with standard micro-
of growth. Numerous genes of unknown array scanners but some minor alterations
function have thus been recognized, regu- to software may need to be made.
lated in a manner similar to or opposite to Protein microarrays have been made
that of genes of known function; transcrip- in the following manner (Macbeath and
tion of the genome is thus incorporated into Schreiber, 2000; Bernot, 2004). Proteins
a vast combinatorial network. In plants, are deposited on to a support and subse-
Affymetrix has commercialized microchips quently fixed to it. Thus 1600 distinct
to evaluate the expression of Arabidopsis proteins may be arranged per cm2. These
genes, allowing the identification of genes arrays are ordered so that it is known which
that are active during pathogen infection protein is represented by any given spot.
or during treatment with herbicides, fun- The microarrays are then incubated with
gicides or insecticides. This also facili- other ligands (fluorescently labelled) and
tates the determination of which genes are the result of the hybridization is analysed
transcribed in which tissues under which by confocal microscopy (it is also possible
conditions or during which stages of devel- to employ radioactively labelled ligands).
opment. Commercial microarrays are also The protein recognized may be identified
available from Affymetrix for several other using the signal localization data obtained.
crop plants such as maize and tomato. The intensity of the signal obtained is pro-
portional to the level of ligandprotein
interaction.
Except for the most frequently used
3.6.8 Protein microarrays and others DNA and protein microarrays discussed
above, other microarrays include those built
A protein chip or microarray is a piece of using tissues (cells) and carbohydrates.
glass on which different molecules of pro- Similar to other microarrays, a tissue chip
tein have been affixed at separate locations or microarray is a piece of glass on which
in an ordered manner to form a microscopic different tissues have been affixed, while
Omics and Arrays 109
ddTTP ddGTP
ddCTP D-form 5
ddATP gene-specific
primer L-form
zip-code
A
T Genomic DNA
Molecule separation
Base discrimination by primer extension in solution on zip-code array
Genotyping
ProteindsDNA interactions
ProteinssDNA
Epigenetic studies interactions
Protein selection
CGH or attachment
by aptamers
Transcriptional profiling
D-form 5
Primer extension gene-specific
and labelling primer L-form
zip-code
AAAAAA-3
Sample 1
D-form 5 Hybridization to
Primer extension gene-specific
L-form
zip-code array
and labelling primer
zip-code
AAAAAA-3
Sample 2
Fig. 3.11. The concept of universal microarray. dsDNA, double-stranded DNA; SSDNA, single-stranded
DNA; CGH, comparative genomic hybridization.
data have led to the advent of high-density genomic content and should provide a dra-
DNA oligonucleotide-based whole-genome matic improvement in the understanding
tiling microarrays (WGAs) which can be of numerous biological processes. WGAs
employed to interrogate a full genomes comprise relatively short (< 100-mer) oligo-
worth of sequence data in a single experi- nucleotide features. Furthermore, they can
ment. This technology allows a more be created with > 6,000,000 discrete fea-
complete understanding of an organisms tures, each comprising millions of copies
Omics and Arrays 111
2006). Perlegen designed SNP-discovery morphism were found across diverse rice
arrays to include all possible SNP variations accessions.
with multiple levels of redundancy. In soybean, the GoldenGate assay, which
Edwards et al. (2008) developed a micro- is capable of multiplexing from 96 to 1536
array platform for rapid and cost-effective SNPs in a single reaction, has been tested
genetic mapping using rice as a model. In to determine the success rate of converting
contrast to methods employing genome til- verified SNPs into working arrays (Hyten
ing microarrays for genotyping, the method et al., 2008). Allelic data were successfully
is based on low-cost spotted microarray generated for 89% of the 384 SNP loci when
production, focusing only on known poly- it was used in three recombinant inbred line
morphic features. A genotyping microarray (RIL) mapping populations. Using the same
was produced comprising 880 SFP elements system, two panels of 1536 SNP markers
derived from indels identified by aligning have been developed in maize through col-
genomic sequences of the japonica cultivar laboration between Cornell, CIMMYT and
Nipponbare and the indica cultivar 93-11. Illumina, one with SNPs developed from
The SFPs were experimentally verified by candidate genes relevant to drought toler-
hybridization with labelled genomic DNA ance and the other with SNPs randomly
prepared from the two cultivars. Using the distributed on the maize genome (Yan et al.,
genotyping microarrays, high levels of poly- 2009).
4
Populations in Genetics
and Breeding
these crosses are then genetically analysed. females, to produce crosses of all possible
The mating design is as follows: combinations.
Parent P1 P2 P3 Pn Cultivar 1 2 3 n1
P1 n1+1
P2 n1+2
P3 n1+3
Pn n1+n2
A full diallel analysis will include all NCIII: n individuals are selected from
one-way hybrids and parents while a partial an F2 population to backcross with two par-
or incomplete diallel analysis may contain ents, P1 and P2:
just half the diallel without reciprocals or
parents. Diallel crosses are usually used to F2 individual 1 2 3 n
estimate general combining ability for the
parents and special combining ability for P1
specific crosses, providing information for P2
producing hybrids.
TRIPLE TESTCROSS (TTC) AND SIMPLIFIED TTC
NORTH CAROLINA DESIGNS. There are three North (STTC). TTC is an extension of NCIII, where
Carolina designs, denoted by NCI, NCII, and n individuals (n > 20) are selected from an F2
NCIII. These designs are most often used in population to backcross with both parental
cross-pollinated crops and to study broad- lines, P1 and P2, and the F1 (P1 P2):
based populations. Their use in self-pollinated
crops usually involves many inbred lines that
can reasonably be considered to represent a F2 individual 1 2 3 n
large, reference population, e.g. late matur-
ing soybean adapted to a geographical belt of P1
P2
USA. To simplify the description, however,
F1
inbreds are taken as an example.
NCI: two inbred lines are crossed to
produce F2, and then some individuals In sTTC: n cultivars or strains (n > 20)
are randomly selected from the F2 popu- are selected from the germplasm pool to cross
lation as males to intermate with other with two cultivars or strains, PH and PL, which
randomly selected females. The offspring show extreme phenotypes (with the highest
derived from this intermating will be used and lowest phenotypic values), respectively.
in genetic analysis. The design can be
described as below: Strain 1 2 3 n
Males 1 2 3 PH
PL
Mather and Jinks (1982) or sections discuss- et al. (2007) reviewed various approaches
ing quantitative genetics in plant breeding for haploid production in plants. Forster
texts for details regarding the genetic infor- and Thomas (2004) and Szarejko and
mation that can be derived from the study Forster (2007) reviewed the use of DHs in
of hybrids or families formed using each of genetic studies and plant breeding. Recent
these mating designs. Some of these designs reviews on specific crop species are avail-
have also been used in genetic mapping of able for tomato (Bal and Abak, 2007) and
quantitative traits. nutraceutical species (Ferrie, 2007).
Inbreeding populations
4.2.1 Haploid production
This type of population includes segregat-
ing populations such as F2 and F3 popula- There are several approaches to haploid
tions which are derived from selfing or production. Naturally occurring haploids
sibmating an F1 hybrid, BC populations that have been reported in a number of species
are derived from backcrossing the F1 to one including tobacco, rice and maize. In bar-
of the parents or advanced BC populations ley, the hap initiator gene was reported to
derived by multiple backcrossings of the F1 control haploidy and spontaneous haploids
to one of the parents. were recovered at high frequency (Hagberg
Populations used in genetic studies and and Hagberg, 1980), with up to 8% haploid
plant breeding can be derived from any of the offspring being recovered when a cultivar
mating designs discussed above. For breed- that was homozygous for the hap allele was
ing purposes, the sizes of populations that used as the female parent to cross with other
will be maintained can be much smaller than cultivars, but none were produced from the
those used in genetic studies because breed- reciprocal cross. In maize, the indeterminate
ers only need to retain the populations with gametophyte gene (ig) results in a monoploid
desirable traits. For genetic studies, however, embryo either from the sperm cell or the egg
geneticists need to maintain as large a popu- cell (Kermicle, 1969). Although DHs can be
lation as possible and all types of segregates recovered from such spontaneous haploids,
including those with undesirable traits. their frequencies are usually too low for
genetics and breeding purposes.
With the recognition of the importance of
4.2 Doubled Haploids (DHs) DHs in plant breeding, extensive efforts have
been made to induce haploid embryogenesis
Cells or plants that contain a single com- and increase the frequency at which DHs
plete set of chromosomes are called hap- can be recovered. The benefits of DHs have
loid. Haploids derived from diploids are already been demonstrated in many research
called monoploid, while haploids derived and breeding programmes. This progress has
from polyploids are called poly-haploid. led to DH cultivars for commercial produc-
Diploids produced from chromosome dou- tion and DH populations for genetics and
bling of haploids are called doubled or breeding studies. In barley, over 100 culti-
double haploid (DH). The DH approach vars have been released and similar numbers
has several advantages that make it useful of rice and rapeseed DH cultivars have been
in genetics and plant breeding. DHs can be listed (Forster and Thomas, 2004). DHs have
produced via in vivo and in vitro systems. also been used successfully in recalcitrant
Haploid embryos are produced in vivo by species such as oat (Kiviharju et al., 2005)
parthenogenesis, pseudogamy or chromo- and rye (Tenhola-Roininen et al., 2006).
some elimination after extensive crossing. Maluszynski et al. (2003) edited a
The haploid embryo is rescued, cultured manual presenting a set of protocols for the
and chromosome-doubling produces DHs. production of DH in 22 major crop plant spe-
The in vitro methods include gynegenesis cies including four tree species. The manual
(ovary and flower culture) and androgene- contains various protocols and approaches
sis (anther and microspore culture). Forster to DH production that have been success-
Populations in Genetics and Breeding 117
fully used for different germplasm resources endosperm. Chromosome or genome prefer-
in each species. The protocols describe in ential or uniparental elimination arises as a
detail all the steps in DH production, from result of certain crosses; fertilization occurs
donor plant growth conditions, through in but soon afterwards the genome of one par-
vitro procedures, media composition and ent is preferentially eliminated. Haploids
preparation to regeneration of haploid plants can be produced by interspecific hybridi-
and methods for chromosome doubling. The zation followed by chromosome elimina-
manual enables the researcher to choose the tion. In barley, this extensive hybridization
most suitable method for production of DH method consists of crossing cultivated bar-
for their particular laboratory conditions and ley, Hordeum vulgare (2n = 2x = 14) with
plant materials, e.g. microspore versus anther the wild, diploid cross-pollinated peren-
culture, wide hybridization or gynogenesis. nial Hordeum bulbosum (2n:::: 2x = 14).
The manual also contains information on Most progeny (95%) are barley haploids,
the organization of a DH laboratory, basic while the remainder is made up by diploid
DH media and associated simple cytogenetic hybrids. This technique, called the bulbo-
methods for ploidy level analysis. An excel- sum method, has been extensively utilized
lent overview of haploid induction and the for the production of haploids in barley.
application of doubled haploids is provided Haploids can also be produced in hexaploid
for Brassicaceae, Poaceae and Solanaceae wheat (var. Chinese Spring) by chromosome
in Haploids in Crop Improvement II elimination following hybridization of wheat
(Biotechnology in Agriculture and Forestry) with H. bulbosum (both 2x and 4x). A fre-
edited by Palmer et al. (2005). quency of 13.7% grain set with 2x bulbosum
There are now five methods generally and 43.7% grain set with 4x bulbosum were
applicable to the production of haploids in obtained (Barclay, 1975). During formation
plants with frequencies that are useful for of the embryo the chromosomes of H. bulbo-
genetics and breeding programmes (Palmer sum are eliminated. The immature embryos
and Keller, 2005): are cultured in vitro and plantlets from these
monoploid embryos can be induced via an
Extensive hybridization crosses fol-
efficient chromosome doubling technique to
lowed by chromosome elimination
produce fertile flowers bearing homozygous
from one parent of a cross, usually the
hexaploid seeds.
pollination parent.
The production of embryos as a result
Gynogenesis: cultured unfertilized
of wheat maize crosses was first reported
isolated ovules and ovaries of flower
by Zenkteler and Nitzsche (1984). Laurie
buds develop embryos from cells of the
and Bennett (1986) cytologically exam-
embryo sac.
ined embryos produced via this system and
Androgenesis: cultured anthers or iso-
found maize chromosomes to be preferen-
lated microspores undergo embryogen-
tially eliminated during the first three cell
esis or organogensis directly or through
divisions, leaving a haploid complement of
intermediate callus.
wheat chromosomes. This method was used
Parthenogenesis: development of an
in wheat haploid production and applied
embryo by pseudogamy, semigamy or
with some success in generating genetic
apogamy.
and mapping populations (Laurie and
Inducer-based approach: haploid-induc-
Reymondie, 1991). Mean frequencies of fer-
ing lines are used to produce haploids.
tilization, embryo formation, embryo germi-
nation and haploid regeneration of 83, 20,
Chromosome or genome elimination 45 and 8%, respectively have been reported
(Chen et al., 1999). Significant differences
Haploid embryos can be produced in plants in the percentage of embryo germination
after pollination by distantly related spe- and haploid regeneration were observed
cies. In most cases, normal double fertiliza- among crosses suggesting that the efficacy
tion takes place to form a hybrid zygote and of haploid production could be improved by
118 Chapter 4
hybrid embryo development in plants: for some 3 of H. vulgare are responsible for chro-
example, differences in timing of essential mosome elimination, although their effect
mitotic processes due to asynchronous cell may be neutralized or offset if a sufficient
cycles or asynchrony in nucleoprotein syn- dose of bulbosum chromosomes is available.
thesis leading to a loss of the most retarded
chromosomes. Other hypotheses propose Ovary culture or gynogenesis
the formation of multipolar spindles, spatial
separation of genomes during interphase Ovary culture involves production of a hap-
and metaphase, parent-specific inactiva- loid individual by culture of unfertilized
tion of centromeres and by analogy with the ovaries to obtain haploid plants from egg
host-restriction and modification systems of cells or other haploid cells of the embryo;
bacteria, degradation of alien chromosomes the process is known as gynogenesis. Under
by host-specific nuclease activity. Gernand the appropriate culture conditions the
et al. (2005) provide evidence for a novel unfertilized cell of the embryo sac develops
chromosome elimination pathway in wheat into an embryo by as yet unknown mecha-
pearl millet hybrids that involves the for- nisms. Haploid plants generally originate
mation of nuclear extrusions during inter- from egg cells in most species (in vitro par-
phase in addition to post-mitotically formed thenogenesis) but in some species, e.g. rice,
micronuclei. They found that the chroma- they arise chiefly from the synergids; in at
tin structure of nuclei and micronuclei was least Allium tuberosum even antipodal cells
different and heterochromatinization and produce haploid plants (in vitro apogamic)
DNA fragmentation of micronucleated pearl (Mukhambetzhanov, 1997).
millet chromatin was the final step during Gynogenesis may occur either via
haploidization. embryogenesis or plantlet regeneration
The mechanism of chromosome elimi- from callus. In rice 2-methyl-4-chlorophen-
nation in Hordeum hybrids was studied oxyacetic acid (MCPA) generally leads to a
by Subrahmanyam and Kasha (1975) and small amount of protocorm-like callus for-
Bennett et al. (1976) and the following con- mation from which shoots and roots regen-
clusions were drawn: (i) normal double fer- erate, while picloram promotes embryo
tilization occurs in interspecific crosses as regeneration. In contrast, sugarbeet usually
confirmed by cytological study; and (ii) after shows embryo development while in sun-
fertilization there is a gradual and selective flower embryos regenerate following a cal-
elimination of H. bulbosum chromosomes lus phase. In general, regeneration from a
from nuclei of endosperm as well as embryo callus phase appears, at least for the present,
cells so that eventually haploid embryos are to be easier than direct embryogenesis.
produced. A sudden shortage of proteins in Generally, gynogenesis has two or
the developing embryo and endosperm and more stages and each stage may have dis-
the better ability of vulgare chromosomes to tinct requirements. In rice, two stages, i.e.
form spindle attachments relative to bulbo- induction and regeneration, are recognized.
sum chromosomes, may be responsible for During induction, ovaries are floated on a
elimination of the bulbosum chromosomes. liquid medium containing low auxin levels
Other possible causes such as differences in and kept in the dark, while for regeneration
mitotic cycle, congression during mitosis, they are transferred on to an agar medium
etc. were ruled out by the authors. containing a higher auxin concentration
It has also been demonstrated that the and incubated in the light.
elimination of bulbosum chromosomes is Depending on the species, unfertilized
under genetic control (Subrahmanyam and ovules, ovaries or flower buds can be cultured.
Kasha, 1975). The above-mentioned authors In some members of the Chenopodiaceae,
used primary trisomics and monotelotri- Liliaceae and Cucurbitaceae, gynogenesis is
somics in crosses with tetraploid H. bul- the main route to DH production (Palmer and
bosum and concluded that both arms of Keller, 2005). Even where anther or micro-
chromosome 2 and the short arm of chromo- scope culture is successful, gynogenetic
120 Chapter 4
haploids have been produced, e.g. in barley, at lower levels somatic calli and somatic
maize, rice and wheat. embryos were also produced. Ovaries are
San Noeum (1976) was the first to generally cultured in the light but in some
demonstrate that gynogenesis can be species at least, e.g. sunflower and rice,
induced under in vitro conditions. She incubation in the dark favours gynogenesis
obtained gynogenic haploids using an and minimizes somatic callusing; in rice
ovary culture of H. vulgare. Subsequently, light may lead to the degeneration of gyno-
success has been obtained with many genic pro-embryos.
species, e.g. wheat, rice, maize, tobacco, Ovary culture has two main limita-
petunia, gerbera, sunflower, sugarbeets, tions: (i) it is not successful in all species;
onions, rubber, etc. About 0.26% of the and (ii) the frequency of responding ova-
cultured ovaries show gynogenesis and ries and the number of plantlets per ovary
one or two plantlets, rarely up to eight, is usually low. Therefore, anther culture is
originate from each ovary. preferred over ovary culture; only in those
Embryogenic frequency is low in many cases where anther culture fails, e.g. sugar-
cases, but relatively high frequencies have beet and for male sterile lines, ovary culture
been reported in some cases (Alan et al., assumes significance.
2003; Martinez, 2003). The rate of success
varies considerably with species and is Anther culture or androgenesis
markedly influenced by explant genotype
so that some cultivars do not respond at Anther culture or androgenesis is a proc-
all. In rice, japonica genotypes are far more ess by which a haploid individual develops
responsive than indica cultivars. In most from a pollen grain. Anther culture is often
cases, the optimum stage for ovule culture the method of choice for DH production
is the nearly mature embryo sac, but in rice in crop plants (Sopory and Munshi, 1996).
ovaries at the free nuclear embryo sac stage Good aseptic techniques are required but
are the most responsive. the methods are generally simple and appli-
The culture response is still genotype cable to a wide range of crops (Maluszynski
dependent (Alan et al., 2003; Bohanec et al., et al., 2003). In general, haploid plants are
2003). Generally, for culture of whole flow- generated in vitro from the microspores
ers, ovary and ovules attached to placenta contained in the anther and require chro-
respond better, but in gerbera and sunflower mosome doubling treatments. The number
isolated ovules give a better response. Cold of chromosomes in haploid plants can be
pretreatment (2448 h at 4C in sunflower doubled either naturally or by colchicine
and 24 h at 7C in rice) of the inflorescence treatment.
before ovary culture enhances gynogenesis. The process involved in anther cul-
The composition of the culture medium ture is poorly understood. Investigations
and stage of embryo sac development are have been hampered by the presence of
important considerations for successful the sporophytic anther wall that presents
culture (Keller and Korzun, 1996). Growth direct access to the microspores contained
regulators are crucial in gynogenesis and at within. This has become an important issue
higher levels they may induce callusing of because although many species respond to
somatic tissues and even suppress gynogen- anther culture, responsive genotypes can
esis. Growth regulator requirements seem be a limiting factor thus making it neces-
to depend on species. For example, in sun- sary to study, understand and manipulate
flower growth regulator-free medium is the microspore embryogenesis in order to
best and even a low level of MCPA induces develop genotype-independent methods
somatic calli and somatic embryos. But in (Forster et al., 2007). Many factors influ-
rice, 0.1250.5 mg l1 MCPA is optimal for ence the production of anther-culture-
gynogenesis. The sucrose level also appears derived plants including the physiological
to be critical; in sunflower 12% sucrose status of the donor plants, pre-treatment of
leads to gynogenic embryo production while anthers, developmental stage of the pollen,
Populations in Genetics and Breeding 121
to other cereals such as wheat yielded a low feasible means for production of haploids in
frequency of green plants. Although a high cotton (Zhang and Stewart, 2004).
frequency of green plants is produced for There are many examples of DH lines
most barley crosses, androgenesis still poses developed from cultivars and intra- and
some problems that need to be addressed. interspecific hybrids between upland cot-
There are barley genotypes which are ton (Gossypium hirsutum L.) and American
extremely recalcitrant to microspore divi- Pima cotton (Gossypium barbadense L.)
sion and/or with a high rate of albinism. using semigamy. The semigametic trait has
The rate of embryogenesis is still low and also been transferred into different cotton
poorly-developed embryos are formed very cytoplasms to facilitate rapid replacement
frequently. New methods are needed that of nuclei. Stelly et al. (1988) proposed a
reduce the cost of DH production and are scheme called hybrid elimination and hap-
effective for all genotypes. loid production system using a cotton strain
Future objectives in plant androgen- with semigamy (Se), lethal gene (Le2dav),
esis include the development of efficient virescent (v7) and male sterility or glandless
androgenesis protocols for a wide range of (gl2gl3).
genotypes, a better understanding of the Semigametic lines can produce 3060%
biological processes involved in the stress haploids when self-pollinated and about
pre-treatment, the study of the influence 0.71.0% androgenic haploids when used
of different micronutrients on the induc- as female parents in crosses with normal
tion of gametic embryogenesis and possi- non-semigametic cottons (Turcotte and
ble gametophytic selection. Identification Feaster, 1967). A unique feature of semi-
of genetic loci associated with the anther gamy is that the inheritance of the gene is
culture response process will facilitate the conveyed by both male and female gametes
understanding of the mechanisms underly- but expression of the trait in terms of hap-
ing androgenesis. Identification and locali- loid production occurs only in the female
zation of molecular markers linked to the parent. As a consequence, for example, in
yield of green plants per anther and the reciprocal crosses between SeSe and sese
evaluation of their potential use for the parents, haploids will be produced only
prediction of the anther culture response when SeSe or Sese is the female parent.
of genotypes will also help to optimize the The results reported by Zhang and
production of DHs. Stewart (2004) verified that semigamy in
cotton is controlled by one gene, previously
Semigamy designated Se. The gene functions sporo-
phytically and gametophytically resulting
Semigamy is a form of parthenogenesis and in an incomplete dominance mode of action.
occurs when the nucleus of the egg cell and Consistent with the difference between the
the generative nucleus of the germinated two parental isogenic lines, semigametic
pollen grain divide independently, resulting F2.3 lines had significantly lower chloro-
in a haploid chimera (a plant whose tissues phyll content than non-semigametic F2.3
are of two different genotypes). Semigamy is lines, an observation that was confirmed by
a type of facultative apomixis in which the a significant association between haploid
male sperm nucleus does not fuse with the production and chlorophyll content. The Se
egg nucleus after penetrating the egg in the gene and the gene for reduced chlorophyll
embryo sac. Subsequent development can content could be either the same or closely
give rise to an embryo containing haploid linked.
chimaeral tissues of paternal and maternal
origins. In cotton, the semigametic phe- Inducer-based approach
nomenon was first reported by Turcotte and
Feaster (1963), who developed the Pima line Haploid inducing lines have been used
57-4 that produced haploid seeds at a high in maize to produce haploids by develop-
frequency. Currently semigamy is the only ment of the unfertilized egg cells (Eder and
Populations in Genetics and Breeding 123
Chalyk, 2002). A haploid induction rate (iii) improved chromosome doubling sys-
of up to 2.3% was detected by Coe (1959) tems using colchicine that gave a doubling
in crosses with the inbred line Stock 6. rate of greater than 10%.
A higher rate (about 6%) was later obtained A scheme to show in vivo haploid
by Sarker et al. (1994) and Shatskaya et al. induction includes the following steps:
(1994) in progenies of crosses between Creating new variation by intercrossing
Stock 6 and Indian and Russian germplasm,
with selected lines.
respectively. Inducer lines are now available In-vivo haploid induction in generation
with haploid seed induction rates of 812%
F1.
in temperate maize germplasm (Melchinger Chromosome doubling of haploid seed-
et al., 2005; Rber et al., 2005).
lings:
Segregation studies (Lashermes and
selection of haploid kernels;
Beckert, 1988; Deimling et al., 1997) and
germination of kernels;
quantitative trait loci (QTL) analysis (Rber,
cutting of coleoptile;
1999) demonstrated that in vivo haploid
doubling procedure: treatment of
induction in maize is a quantitative trait
seedlings with colchicine;
under the control of an unknown large
planting of treated seedlings in
number of loci. Individual QTL explained
greenhouse;
only small parts of the genetic variation.
transplanting DH plants at the
Compared with other methods of DH
three-leaf stage to the field and
production such as anther culture, the
selfing (generation D0); and
inducer-based approach is rather efficient,
formation of testcross hybrids.
less dependent on the genotype and can be Evaluation of testcrosses in multi-envi-
practised in almost every maize breeding
ronment yield trials (two stages).
programme without access to expensive lab-
oratory facilities (Rber et al., 2005; http://
www.uni-hohenheim.de/ipspwww/350a/
linien/indexl.html). 4.2.2 Diploidization of haploid plants
Requirements for in vivo DH produc-
tion in practical breeding include: (i) avail- As described above, haploids can be pro-
ability of inducer genetic stocks; (ii) high duced through various approaches. Haploid
induction rate; (iii) the inducer is a good plants may grow normally under in vitro or
pollinator; (iv) reproducibility with rea- greenhouse conditions up to the flowering
sonable seed quantities; (v) availability of stage, but viable gametes are not formed
a marker system that is independent of the due to the absence of one set of homologous
genetic background of the female and of chromosomes and consequently, there is no
environmental effects and can be used for seed set.
effective and unambiguous identification The only mechanism for perpetuating
of haploid kernels; and (vi) availability of the haploids is by duplicating the chro-
an artificial chromosome doubling system mosome complement in order to obtain
with high doubling rates that is safe, simple homozygous diploids. In pollen-derived
and cost-effective. plants duplication of chromosomes may
Since the late 1990s, these requirements occur spontaneously in cultures. However,
have been partially met in maize with: the spontaneous chromosome doubling
(i) inducer lines (e.g. RWS and UH400 devel- rate of haploids is usually low. In maize,
oped at the University of Hohenheim) with for example, the rate ranges from 0 to 10%
improved induction rates of 10% or higher; (Chase, 1969; Beckert, 1994; Deimling et al.,
(ii) a combination of two dominant mark- 1997; Kato, 2002). Therefore, it is neces-
ers (anthocyanin colour of endosperm and sary to diploidize the haploids by chemical
embryo for identification of haploids and means. Thus, artificial chromosome dou-
anthocyanin coloration of stalk for identi- bling (diploidization) is necessary for the
fication of false positives in the field); and efficient large-scale use of haploid plants.
124 Chapter 4
in the latter case can be either genetic or the formation of multi-polar spindles on
epigenetic in origin. Typical genetic altera- chromosomes lagging at anaphase cause the
tions are: changes in chromosome numbers development of cell lines with haploid, tri-
(polyploidy and aneuploidy), chromo- ploid or other uneven ploidy status.
some structure (translocations, deletions Many studies have indicated that cryp-
and duplications) and DNA sequence (base tic structural modification of individual
mutations). chromosomes is more likely to cause soma-
clonal variation than modification induced
GENETIC VARIATION ARISING FROM SOURCE by ploidy changes in many tissue-cultured
PLANTS. The source plants used to initiate plants. Chromosomal changes occurring
cultures are likely to be heterogeneous with during tissue culture include transposition
respect to the state of differentiation, ploidy of mobile genetic elements (transposons),
level and age. These explant-related factors chromosome breakage and repositioning
will affect the genetic make-up of the cells of chromosome segments.
produced in the culture and thus the cal- As summarized by Taji et al. (2002),
lus arising from such a group of cells with several mechanisms have been proposed to
diverse genetic make-up will inevitably lead explain the genetic variability that occurs
to a mixed population of cells. Depending in tissue culture. The most possible causes
on the cell types from which the plants are are:
originated, those regenerated from such a 1. Reduced regulatory control of mitotic
genetically mosaic callus will undoubtedly events in culture: the ploidy status of plants
be of different genetic make-up. Taji et al. generated from callus, cell suspension or
(2002) indicated that such genetic mosaic- protoplast cultures of certain species differ
ness seems to occur commonly in polyploid significantly despite the fact that the cul-
plants rather than in diploids or haploids. tures originate from a highly homogenous
genetic background. This indicates a lack of
GENETIC VARIATION ARISING DURING CULTURE. tight regulation of cell-cycle-related controls
Although a significant degree of genetic during cell proliferation in culture.
variability can be traced to the genetically 2. Use of growth regulators: plant growth
heterogeneous cell types of explant at least regulators, particularly synthetic auxins
in polyploid species, there is substantial such as 2,4-D, are considered to be the
evidence to indicate that much of the vari- major cause of genetic variability in cul-
ability observed in generated plants stems ture. For example, cytokinins at low con-
from the culture process itself. Aneuploids, centrations have been shown to reduce the
polyploids or cells with structurally altered range of ploidy in culture while low levels
chromosomes may arise in culture. Many of both auxins and cytokinins appear to
differentiated cells when induced to divide preferentially activate the division of cyto-
in culture, undergo endoduplication of logically stable meristematic cells enabling
chromosomes resulting in the production the regeneration of genetically uniform
of tetraploid or octaploid cells with distinct plantlets.
phenotypes. 3. Medium components: some of the min-
Various phenomena have been eral nutrients influence the establishment of
observed in tissue culture of various plant genetic variability in culture. For instance,
species which explain the production of by altering the levels of phosphate and nitro-
cells with unusual ploidy levels (Bhojwani, gen as well as the form of nitrogen in the
1990). Occurrence of multi-polar spindles medium, the genetic composition (ploidy
due to failure of spindle formation during level) of the cultured cells can be controlled
cell division is one of the contributing fac- to a considerable extent. A marked increase
tors. Absence of spindle formation during in chromosome breakage has been observed
mitosis results in the appearance of cells in plant cell cultures grown with different
with doubled chromosome number while levels of magnesium or manganese.
126 Chapter 4
4. Culture conditions: some culture con- systems could thus be attributed to tissue-
ditions, such as incubation temperatures culture-induced methylation or demethyla-
above 35C and long duration of culture, tion of DNA. The activity of transposons and
have been implicated in inducing genetic retrotransposons induced by tissue culture
variability in regenerated plants. could also be responsible for some of the
5. Inherited genomic instability: molecular genetic and epigenetic variability observed
studies indicate the existence of certain in culture.
regions of genome that are more susceptible
to tissue-culture-induced structural alterna-
tions, although the reason for the increased
4.2.4 Quantitative genetics of DHs
susceptibility of these genomic loci known
as hot spots is not fully understood.
DH lines that are derived randomly from
an array of gametes produced by F1 plants
CAUSES OF EPIGENETIC VARIATION IN TISSUE CULTURE. are very useful in quantitative genetics.
Any culture-induced changes which are sta- Compared with diploid genetic models
ble but not heritable have frequently been for populations such as F2, F3 or BC, there
considered as epigenetic variation. However, are no dominance or dominance-related
a greater understanding of genetic and epi- epistasis effects involved in the genetic
genetic alterations in tissue culture in the model of DH populations. As a result, addi-
recent past has led to a clear distinction tive, additive-related epsitasis and linkage
between these two types of variation. For effects can be investigated properly. As a
instance, genetic mutations occur randomly permanent population, DH lines can be
and at a much lower rate than epigenetic replicated as many times as desired across
variations. Genetic changes are usually sta- different environments, seasons and labo-
ble and heritable. Epigenetic variation may ratories, providing endless genetic material
also lead to stable traits; however, reversal for phenotyping and genotyping particu-
can occur at high rates under non-selective larly for understanding the genotype-by-
conditions. Epigenetic traits are often trans- environment interaction. In DH populations,
mitted through mitosis in a stable manner the additive component of genetic variance
but rarely through meiosis and the level is larger than that of diploid populations
of induction of epigenetic traits is directly such as F2 and BC. Choo et al. (1985) dis-
related to the selection pressure experi- cussed in detail the quantitative genetics
enced by the cells. Epigenetic changes are associated with DH populations, including
generally assumed to reflect alteration in detection of epistasis, estimation of genetic
expression rather than in the information variance components, linkage test, estima-
content of genes. tion of gene numbers, genetic mapping of
As Taji et al. (2002) summarized, the polygenes and tests of genetic models and
epigenetic variation observed in cultured hypotheses. Rber et al. (2005) compared
cells or regenerated plants is mainly due to the expected gain from selection for DH
three cellular events: (i) gene amplification; lines and other populations and implica-
(ii) DNA methylation; and (iii) increased tions of epistatic effects, which is briefly
activity of transposable elements. In plants, described here.
nearly 25% of the genome can be methylated
at cytosine residues but the significance of
Expected gain from selection
this cytosine methylation is not apparent.
It has been suggested that methylation (and As is well known from quantitative genet-
demethylation) of DNA is one of the ways of ics (see e.g. Falconer and Mackay (1996)
controlling transcriptional activity and that and also Chapter 1), the expected gain from
this process can be affected by the tissue selection can be described by G = i hx rG sy,
culture process. The non-heritable genetic where i is the selection intensity, hx the
variability observed in many tissue culture square root of the heritability of the selection
Populations in Genetics and Breeding 127
criterion, rG the genetic correlation between for DH lines this correlation is 1. Thus com-
selection criterion and gain criterion and sy pared with S2 lines, the correlation of DH
the standard deviation of the gain criterion. lines is 1: 0.75 = 1.15 times stronger.
In long-term breeding programmes, the deci-
sive gain criterion for evaluating selection Implications of epistatic effects
progress in hybrid breeding is the general
combining ability (GCA) of the improved Epistatic gene action may positively or neg-
lines. At the beginning of a breeding cycle atively affect hybrid performance (Lamkey
the test units are the DH lines per se and and Edwards, 1999). In most cases, epi-
later on in the cycle their testcrosses. static effects have been reported to cause
Strong selection (large i) leads to a small a decrease in the testcross performance
effective population size and consequently of segregating generations (Lamkey et al.,
to a loss of genetic variance due to random 1995) or to penalize three-way and double
drift. To keep this loss within certain lim- crosses compared to their non-parental sin-
its, a minimum number of lines should be gle crosses (Sprague et al., 1962; Melchinger
recombined after each breeding cycle. This et al., 1986). These effects are commonly
number depends on the inbreeding coeffi- referred to as recombinational loss and
cient (F) of the candidate lines. The number may be explained by a disruption during
should be (2F) times larger for inbred lines meiosis of co-adapted gene arrangements
than for non-inbred genotypes. Assuming assorted by prior natural and artificial selec-
that S2 lines (F = 0.75) are recombined in tion. Marker-based analyses of QTL partially
conventional breeding, the number of DH corroborate this hypothesis (Stuber, 1999).
lines (F = 1) would have to be increased To avoid recombinational loss and still offer
1:0.75 = 1.33-fold to preserve an equiva- a chance to select for new positive interac-
lent level of genetic variation. This means tions, a balance between recombination and
that the selection intensity must be reduced fixation of gene arrangements is needed. The
accordingly when using DH lines. DH-line approach might offer the method
In contrast to the selection intensity, for achieving this goal as homozygosity can
hx and rG increase when using DH lines. be reached in one cycle of recombination
This increase is particularly large in the when F1 is used for DH development or in
first testcross stage. Neglecting epistasis, the different cycles when segregating popula-
GCA variance of inbred lines is equal to 1/2 tions of different generations are used.
F sA2 (Falconer and Mackay, 1996), where sA2
is the additive variance of the base popu-
lation. Thus the GCA variance of DH lines 4.2.5 Applications of DH populations
is 1:0.75 = 1.33 times larger than that of S2 in genomics
lines. This leads to better differentiation
among the testcrosses and consequently to In genetics, DHs may serve to recover
higher heritability. Seitz (2005) compared recessives. Using DHs, linkage data can be
three sets of S2 and S3 lines each with DH obtained directly by sampling gametes as
lines derived from the same crosses and monoploids. DHs are ideal for the study
evaluated the same testers in the same envi- of mutation frequencies and spectra. As
ronments. On average, the estimated genetic DHs represent homozygous, immortal and
testcross variances for grain yield (bu. acre1) true breeding lines, they can be repeatedly
amounted to 50, 94 and 124 for S2, S3 and phenotyped and genotyped so phenotypic
DH lines, respectively. and genotypic information can be accumu-
The genetic correlation between selec- lated over years and across laboratories. In
tion and gain criterion (rG) also increases genomics, DHs are therefore ideal for study-
with the degree to which the tested lines ing complex traits that are quantitatively
have been inbred. For example, the correla- inherited which may require replicated tri-
tion between St lines and their homozygous als over many years and locations for accu-
progenies for GCA is equal to Ft whereas rate phenotyping.
128 Chapter 4
AB P2 ab
P1 ab
AB
AB
F1
ab
Gamete AB Ab aB ab
Haploid AB Ab aB ab
AB Ab aB ab
Double haploid
AB Ab aB ab
1991). A genetic map was constructed using assess their true breeding potential for target
55 RFLP markers and two known genes and traits. They have the following advantages
is the first complete molecular map to be and clear beneficial applications (Melchinger
constructed using DH populations in crops. et al., 2005; Rber et al., 2005; Longin et al.,
Since then, many DH populations have been 2006; W. Schipprack, University of
developed using the different approaches Hohenheim, personal communication):
described above and have been used for map providing the quickest possible route to
construction and genetic mapping.
complete homozygosity;
giving an immediate product of stable
4.2.6 Application of DHs recombinants from species crosses;
in plant breeding no masking effects because of the high
homogeneity attained in the first gen-
The benefits of DHs in plant breeding have eration of DH populations;
been widely reviewed; readers should refer
increased performance per se due to selec-
to Forster and Thomas (2004), Forster et al. tion pressure in the haploid phase and/or
(2007), and the five volumes on In Vitro during the first generation of DHs;
Haploid Production in Higher Plants edited
complete genetic variance accessible
by Jain et al. (19961997). from the very beginning of the selec-
Application of DHs in plant breeding tion process;
can be described by comparison of the time
easy integration of line/hybrid develop-
required to obtain fixed inbreds relative to ment with recurrent selection;
inbreeding, starting from a heterozygote:
reduced efforts in the nursery after the
first multiplication of DH lines compared
to a conventional breeding nursery;
Selfing of a Haploids of a maximum genetic variance in line per
heterozygote heterozygote se and testcross trials;
high reproducibility of early-selection
Gametes: 1/2 A + 1/2 a Gametes: 1/2 A + 1/2 a
F2 1/4 AA, 1/2 Aa, 1/4 aa chromosome doubling results;
F3 1/4 Aa 1/2 AA + 1/2 aa
high efficiency in stacking specific tar-
F4 1/8 Aa geted genes in homozygous lines; and
F5 1/16 Aa simplified logistics for seed exchange
F6 1/32 Aa between main and off-season pro-
1/2 AA + 1/2 aa grammes since each line is fixed and
can be represented by a single plant.
Apparently, the DH approach has a time DHs have been used in plant breeding
reduction of three to four generations com- programmes to produce homozygous geno-
pared to inbreeding-based breeding. The DH types in a number of important species,
approach features many logistical advan- e.g. tobacco (Nicotiana tabacum L.), wheat,
tages simplifying breeding to a large extent barley, canola (Brassica napus L.), rice
and enabling evaluation of genetically fixed and maize (Maluszynski et al., 2003), but
hybrid components from the very beginning only rarely in triticale, oat, rye and others.
of the selection process. Depending on the Research in crops such as rice, wheat and
material, the costs and the breeding scheme maize has shown that significant progress
adopted, the DH approach can reduce the in haploid technology is attainable given an
time for development and commercializa- intensive research effort. Well-established
tion of new inbred lines and lead to a higher methods in these crops have allowed major
expected genetic gain per unit of time. parts or whole breeding programmes to
As outlined above, DH lines extracted be based on DH production. Oat, triticale,
from a heterozygote or a segregating popula- wild barley, potato and cabbage are exam-
tion represent immortalized, reproducible ples of crops where DH technologies are
gametes that can be immediately evaluated to less advanced but in which hundreds of
130 Chapter 4
DHs may still be obtained (Tuvesson et al., rare alleles and aid the efficient selec-
2007). In other crops, including some veg- tion for quantitative traits in breeding. In
etable species and forage and turf grass spe- outcrossing species, DHs enable undesir-
cies, DH methods are being developed, but able recessive genes to be eliminated from
applications in crop improvement are rare. lines at any breeding stage (Forster and
The DH approach has yet to be exploited Thomas, 2004).
in leguminous species, predominantly due Development of DHs through anther
to their cultivation in developing countries culture has been very successful with many
and consequent paucity of research fund- cultivars released in barley breeding world-
ing. Difficulties have also been posed by the wide and in rice breeding in China since the
small anther size and relatively low number 1970s. The production of DHs has become
of microspores per anther in legume crops the preferred tool in many advanced plant
(Croser et al., 2006). breeding institutes and commercial compa-
The DH technique offers an efficient nies for breeding many crop species. Due
tool for extracting individual gametes from to the obvious advantages of DH lines and
heterozygous materials and transform- the enhancements made in in vivo haploid
ing them into homozygous lines that can induction in recent years, many commer-
be reproduced ad libitum by selfing. DHs cial breeding companies such as Agreliant,
extracted from a heterogeneous popula- Monsanto and Pioneer are presently adopt-
tion, e.g. landraces, represent immortal- ing or are already routinely using this
ized, reproducible gametes that can be technology in their maize breeding pro-
immediately evaluated to assess their true grammes (Seitz, 2005). Recurrent selection
breeding potential for target traits. They for testcross performance using DHs has
can also serve as source material for breed- reduced the cycle length and improved
ing programmes of hybrids and synthetics. genetic advance (Gallais and Bordes, 2007).
Furthermore, DH lines may be used for In some companies in vivo haploid induc-
long-term conservation of heterogeneous tion has more or less replaced conventional
germplasm resources such as landraces line development with up to 15,000 DH
without the risk of genetic drift and other lines per year per breeding programme and
changes in gene frequencies, as well as for over 100,000 DH lines per year across all
in-depth characterization of the breeding programmes at costs of US$10 or less per
potential of each heterogeneous germplasm DH line. The first maize hybrids produced
collection because each of the extracted DH using DH lines have been commercialized
lines can be evaluated in replicated trials in in the USA and Europe (W. Schipprack,
diverse environments. University of Hohenheim, personal com-
With some DH methods, only a tiny munication). However, the development
fraction of the haploid seedlings will ger- of new, more efficient and cheaper large-
minate and survive to the adult stage due scale production protocols has meant that
to the uncovered genetic load and the stress DHs have also recently been applied in less
in plant development exerted by colchi- advanced breeding programmes.
cine treatment for chromosome doubling.
Nevertheless, because the DH technique is
rather simple, it is feasible to generate and
identify large numbers of haploid seeds, 4.2.7 Limitations and future prospects
treat them with colchicine and transplant
them to the field. Hence, by starting with a Genetics and breeding in DHs have not
sufficiently large number of haploid seeds given the desired and expected dividends,
it is possible to generate hundreds of via- despite the substantial investments made
ble DH lines with acceptable agronomic in haploid research since the late 1980s.
performance. Some of the widely recognized limitations
DHs are essentially important in the of DH breeding are as follows: (i) haploids
evaluation of diversity, because they fix cannot be obtained in the high frequency
Populations in Genetics and Breeding 131
required for selection in most important 4.3.1 Inbreeding and its genetic effects
crop species; (ii) the costbenefit ratio in
DH breeding is often not favourable, thus RILs result from continuous inbreeding such
discouraging its use despite the obvious as selfing or sibmating starting from an F2
advantages; (iii) haploids and DHs will population until homozygosity is reached.
express recessive deleterious traits and There are two genetic responses to inbreed-
deleterious mutations may arise during the ing, gene recombination and genotype
DH development process including anther homogenization. Starting from a heterozy-
culture, particularly for open-pollinated gote at a locus A-a, for example, selfing will
species; (iv) different ploidy levels may be produce three genotypes, AA, Aa and aa.
available so that haploid status may need With continuous selfing, two homozygotes,
to be confirmed cytologically; alternatively, AA and aa, will not segregate, while the
pollen culture may be necessary, which heterozygote Aa will continue to segregate
is expensive and has a relatively low suc- producing the three genotypes. However,
cess rate and is also genotype-dependent the proportion of heterozygotes in the popu-
in many species; (v) doubled haploidy may lation will decrease with continuous selfing
also decrease genetic diversity, which is and will approach zero. This process can be
better maintained in heterozygous lines; described as below.
(vi) the success of the DH method is highly Consider one locus with two alleles,
genotype dependent, so is not yet suitable A and a, underlying continuous selfing.
for all breeding programmes; (vii) some Homozygotes will increase by 50%, while
techniques, e.g. inducers in maize (espe- heterozygotes will decrease by 50% with
cially the good ones), are proprietary and each generation of selfing. At generation t,
not available to all interested breeders; and the proportion of heterozygotes in the
(viii) health and legal concerns related to population will be (1/2)t, while the propor-
handling the chemical doubling agents. tion of homozygotes will be 1 (1/2)t ; the
The Third International Conference on homozygotes AA and aa each account for
Haploids in Higher Plants (1215 February [1 (1/2)t]/2 = (2t 1)/2t+1 (Table 4.1).
2006, Vienna, Austria) highlighted the When two or more loci, for example k
following issues that are important to future loci, are involved, successive selfing from F1
studies on DHs: (i) new methods of haploid hybrids will produce (1/2)tk heterozygotes
and DH plant formation; (ii) mechanism and [1 (1/2)t]k = [(2t 1)/2t]k homozygotes
of initiation of haploids; (iii) application at generation t. The more loci are involved,
of haploid cells, gametes, haploid and DH the longer it takes to reach homozygos-
plants in fundamental and applied sci- ity (Fig. 4.2). In the seventh generation of
ence; (iv) genes controlling haploid forma- selfing starting from a heterozygous hybrid
tion from female and male gametes; and for example, the proportion of homozy-
(v) methods of diploidization of haploids. gotes will be 99% for the population with
one heterozygous locus involved, 96% for
the population with five heterozygous loci
involved, 89% for 15 loci, 79% for 30 loci
4.3 Recombinant Inbred Lines (RILs) and 46% for 100 loci.
If heterozygous loci are linked,
Recombinant inbred lines or random inbred successive inbreeding can still produce
lines (RILs) are usually a part of the ultimate a homozygous population. However, the
products of many breeding programmes rate of approach to homozygosity depends
and are also used as genetic materials. They on the recombination frequencies between
can be produced by various inbreeding the linked loci. The lower the recombina-
procedures. To help understand the whole tion frequency, the higher the proportion of
process of development and applications homozygotes in the population and the more
of RILs, the inbreeding procedure and its rapidly the population becomes homog-
effects will be discussed first. enized. If the recombination frequency, r,
132 Chapter 4
Table 4.1. Genotypes derived from a single-locus heterozygote and their frequencies in selfing generations.
Genotype
Frequency of Frequency of
Generation AA Aa aa heterozygotes homozygotes
0 1 - 1 0
1 1/4 2/4 1/4 1/2 50.0
2 3/8 2/8 3/8 1/4 75.0
3 7/16 2/16 7/16 1/8 87.5
4 15/32 2/32 15/32 1/16 93.8
5 31/64 2/64 31/64 1/32 96.9
10 1023/2048 2/2048 1023/2048 1/2048 99.9
100
75
Homozygotes (%)
1 5 10 20 40 100
50
25
0
1 2 3 4 5 6 7 8 9 10 11 12
Generations of selfing
Fig. 4.2. Effects of generations and genetic loci on the proportion of homozygotes in self-pollinated
populations (numbers of generations are 1, 5, 10, 20, 40, 100).
is close to zero or two loci are completed the genetic combinations of two parental
linked, the rate of homogenization will genomes represented in individual F2 plants
be close or equal to the rate for the popu- are each represented by an RIL (Fig. 4.3).
lation with one heterozygous locus. If r is The genetic combinations of two parental
about 50%, the rate of homogenization will genomes are fixed in a group of RILs.
be about the same as that for the popula- For quantitative traits that are con-
tion with two heterozygous loci. It can be trolled by polygenes or multiple QTL, the
estimated that for two linked loci and after mean value of the population will return
one generation of selfing, the proportion of to the average value of the parental lines
homozygotes will be 41% for r = 10%, 34% because dominance and dominance-related
for r = 20%, 26% for r = 40% and 25.26% epistasis will dissipate with increasing
for r = 45%. homogenization. The variance will also
Continuous inbreeding (e.g. selfing) change with increasing homogenization
results in the fixation of segregation so that but the direction of change will depend
Populations in Genetics and Breeding 133
A B
38 10
III
III
34 8
IV
I
Variance
Mean
30 6
I
IV II
26 4
P1 II
22 P2 2
P F1 F2 F3 F4 F5 F6 F2 F3 F4 F5 F6
Generation Generation
Fig. 4.4. Change of mean (A) and variance (B) in RIL populations derived by SSD. (I) Additive increasing
alleles are completely dominant. (II) Additive without dominance effect. (III) Additive increasing alleles are
completely dominant with complementary interaction. (IV) Additive increasing alleles are completely domi-
nant with duplicate interaction.
because of lack of seed germination or fail- hill the following generation. An individual
ure of plant establishment to produce seed. plant is harvested from each line when the
It is necessary to decide on the number of population has reached the desired level of
inbred plants that are desired in the last homozygosity.
generation and begin with an appropriate With the single-hill procedure the iden-
population size in the F2 generation. The tity of each F2 plant and its progeny can be
single-seed procedure ensures that each maintained during self-pollination. When
individual in the final population traces to the identity of an F2 is maintained, the seed
a different F2 individual. However, the pro- packet and hill must be properly identified
cedure cannot ensure that a particular F2 with a line designation for planting and
will be represented in the final population harvest.
because failure of any seed to germinate or
generate a productive plant automatically MULTIPLE-SEED PROCEDURE. Use of the single-
eliminates that seeds F2 family. seed procedure requires that the size of the
populations in F2 be larger than in later gen-
SINGLE-HILL PROCEDURE. The single-hill pro- erations, due to lack of seed germination or
cedure can be used to ensure that each F2 plant establishment for seed set. Usually,
plant will have progeny represented in two samples are harvested, one for planting
each generation of inbreeding. Progeny in the next generation and one for a reserve.
from individual plants are maintained as Researchers sometimes bulk two or three
separate lines during each generation of seeds from each plant during harvest. Part
inbreeding by planting a few seeds in a of the sample is planted and the remainder
hill or row, harvesting self-pollinated seeds is reserved. The procedure is referred to as
from the hill and planting them in another modified SSD. The number of seeds planted
Populations in Genetics and Breeding 135
and harvested each season depends on the opportunities to recombine in RIL popu-
number of lines desired from the popula- lations. This property was discovered by
tion and the anticipated percentages of seed Haldane and Waddington (1931) by studying
germination, seedling establishment and inbreeding populations. For tightly linked
seed set. loci, the number of recombinants observed
in RILs is twice that observed in the popu-
lations with only one cycle of meiosis. At
Advantages and disadvantages of SSD
the beginning stage of genetic mapping,
procedures
this multiple recombination in RILs makes
Fehr (1987) summarized the merits of the it difficult to detect linkage. Once linkage
SSD procedures and indicated the follow- relationships are roughly established among
ing advantages: loci, the greater frequency of recombination
makes it easy to detect non-allelism among
They are an easy way to maintain popu- loci. It also makes the estimation of genetic
lations during inbreeding. distances more accurate because the con-
Natural selection does not influence fidence interval for an estimated genetic
the population unless genotypes differ distance is a function of recombination
in their ability to produce at least one frequency. With the increased number of
viable seed each generation. meiosis events, there are more opportunities
The procedures are well suited to green- to find recombinants between two tightly
house and off-season nurseries where linked loci (Fig. 4.5).
the performance of genotypes may not In populations that have undergone only
be representative of their performance one cycle of meiosis, recombinant frequency
in the area in which they are normally r (%) is linearly related to map distance
grown. R (cM), as indicated by the dashed line in
The disadvantages are: (i) artificial selec- Fig. 4.5. In RIL populations derived from
tion is based on the phenotype of individual selfing, r is almost equal to 2R when the
plants, not on progeny performance, when map distance is small, which is indicated
SSD is used for cultivar development rather by the solid line and formula R = r/(22r)
than genetic population development; and (Fig. 4.5). For the RIL populations derived
(ii) natural selection cannot influence the
populations in a positive manner unless
undesirable genotypes do not germinate or 50
set any seed.
Recombinant frequency r (%)
2R
40 r =
1+2R
from sibmating, the skew becomes more sig- populations of comparable population size.
nificant with r nearly equal to 4R when the According to Taylor (1978), RILs derived
map distance is small. from sibmating were more powerful in the
estimation of map distances than popula-
tions undergoing single meiosis when R
12.5cM. Based on Taylors method, it can be
4.3.4 Construction of genetic inferred that RILs derived from self-pollina-
maps using RILs tion have greater influence on the estima-
tion of map distance when R 23cM.
As each RIL is inbred as a DH line and thus Because of the advantages of RILs,
can be propagated indefinitely, a panel of they have been receiving great attention in
RILs has a number of advantages for genomic genomics research. Numerous RIL popu-
studies: (i) each line needs to be genotyped lations have been developed in plant spe-
only once; (ii) multiple individuals can be cies, especially in maize and rice. Burr et
phenotyped from each line to reduce indi- al. (1988) reported RFLP maps constructed
vidual, environmental and measurement var- using two maize RIL populations, T232
iability; (iii) multiple invasive (destructive) CM37 and CO159 Tx303. Among
phenotypes can be obtained on the same set 334 mapped genetic loci, 220 were poly-
of genomes; and (iv) as recombinations are morphic in both populations. By comparing
more frequent in RILs than in populations the map distances obtained from these two
with only one meiosis, greater resolution populations with each other and with pub-
can be achieved in genetic mapping. licly accepted map distances, they found
In genetic mapping with RIL popula- that the differences could be twice as large
tions, recombinant frequency should be in some cases. Although these differences
converted into map distance using the for- were still within the range of confidence
mula R = r/(22r) proposed by Haldane and intervals, they might be due to the genetic
Waddington (1931). There are no mapping difference in recombinant frequencies at
functions available for RIL populations to specific chromosomal regions. In maize
adjust for double crossover events as there there is no significant polymorphism caused
are for populations with one cycle of mei- by chromosome rearrangement, except for
osis as discussed in Chapter 2. When the chromosome 10. Therefore, it is not surpris-
map distance is within the range that allows ing that there was no significant difference
confidence about linkage detection, recom- in map distance between the two maize RIL
binant frequency has a linear relationship populations. Table 4.2 provides some exam-
with map distance (Fig. 4.5; Silver, 1985). ples of RIL populations developed in maize
Non-linked loci may be linked simply (Burr et al., 1988) and in rice (Xu, Y., 2002)
due to chance. These false linkages can often that have been widely used for linkage map-
be confirmed by whether a linkage detected ping and gene tagging.
with one marker is also judged to be linked
by other markers in the same linkage group
and whether the suspected linkage found in 4.3.5 Intermated RILs and nested RIL
one population can also be detected in other populations
RIL populations. Mouse geneticists dis-
cussed the case when a linkage could not be
Intermated RILs
certain because of small population sizes,
and Silver (1985) provided a table for the The production of RILs allows for the accu-
95 and 99% confidence intervals for esti- mulation of recombination breakpoints
mated map distances based on RILs derived during the inbreeding phase. However, the
from sibmating. At low rates of recombi- accumulation in RILs is limited by the fact
nation, these intervals are relatively small that each generation of inbreeding makes
when compared with those obtained from the recombining chromosomes more simi-
the binomial distribution for F2 and BC1 lar to one another so that meiosis ceases to
Populations in Genetics and Breeding 137
Table 4.2. Some examples of RIL populations developed in maize (Burr et al., 1988) and rice (Xu, 2002).
The Complex Trait Consortium for ing variation in maize, 25 RIL mapping
mouse proposed the development of a large populations were created. Twenty-five
panel of eight-way RILs (Complex Trait diverse lines were selected to capture
Consortium, 2004). An eight-way RIL, also 80% of the nucleotide polymorphism in
known as Collaborative Cross, is formed by maize. In order to provide a uniform eval-
intermating eight parental inbred strains uation background, each line was crossed
followed by repeated sibling mating to to a common parent, B73 (the standard
produce a new set of inbred lines whose US inbred), to form 25 RIL populations.
genome is a mosaic of the eight parental Each of these RIL populations has at least
strains (Broman, 2005). Such a panel of 200 RILs, each descended from a unique
RILs would serve as a valuable resource F2 plant, resulting in a total of 5000 RILs.
for mapping the loci that contribute to Using SSD and low density planting, 88%
complex phenotypes in mouse and would success in advancing lines per genera-
support studies that incorporate multiple tion was achieved. This has developed as
genetic, environmental and developmen- an integrated mapping approach, called
tal variables into comprehensive statisti- Nested Association Mapping (NAM),
cal models of complex traits (Complex which exploits simultaneously the advan-
Trait Consortium, 2004). The genomes of tages of linkage analysis and association
eight founder strains are rapidly combined (or linkage disequilibrium, LD) mapping
and are then inbred to produce finished as discussed in Chapter 6. The power of
RIL strains. Eight-way RIL strains achieve NAM for genome-wide QTL mapping has
99% inbreeding by generation 23. Each been demonstrated by computer simu-
strain captures 135 unique recombinant lation with varied numbers of QTL and
events. With genetic contributions from trait heritabilities (Yu et al., 2008). With
multiple parental strains including several a dense coverage (2.6 cM) of common-
wild derivatives, the eight-way RILs will parent-specific (CPS) markers, the genome
capture an abundance of genetic diversity information for 5000 RILs can be inferred
and will retain segregating polymorphisms based on the parental genome informa-
every 100200 bp. This level of genetic tion. Essentially, the linkage information
diversity will be sufficient to drive phe- captured by the CPS markers and the LD
notypic diversity in almost any trait of information among loci residing between
interest. An estimated 1000 strains will CPS markers was then projected to RIL
be required to guarantee high mapping based on parental information, ultimately
resolution and detect extended networks allowing for genome-wide high-resolution
of epistatic and geneenvironment interac- mapping. The power of NAM with 5000
tions. This estimate is based on the statisti-