Measuring Human Capital
Measuring Human Capital
Abstract
Students around the world are going to school but many of more limited; (b) girls outperform boys on learning—a
them are not learning—an emerging gap in human capital positive gender gap—in contrast to a negative gender gap
formation. To understand this gap, this paper introduces observed for schooling; (c) human capital when measured
a new global data set measuring learning in 164 countries. by both schooling and learning accounts for between a
The data cover 98 percent of the world’s population from fifth to half of cross-country income differences—a middle
2000 to 2017. The data set will be publicly available and ground in the recent development accounting literature and
updated at regular intervals by the World Bank and is (d) average estimates mask important underlying heteroge-
designed to serve as a public good to accelerate global policy neity by country income status and region. These stylized
and research agendas focused on quality education and facts demonstrate the potential of this new global dataset to
human capital formation. The paper presents several moti- reveal insights into the relationship between human capital
vating facts in a first application of the data: (a) although and economic development.
enrollment has increased worldwide, learning progress is
This paper is a product of the Education Global Practice and the Office of the Chief Economist, Development Economics.
It is part of a larger effort by the World Bank to provide open access to its research and make a contribution to development
policy discussions around the world. Policy Research Working Papers are also posted on the Web at http://www.worldbank.
org/research. The authors may be contacted at [email protected].
The Policy Research Working Paper Series disseminates the findings of work in progress to encourage the exchange of ideas about development
issues. An objective of the series is to get the findings out quickly, even if the presentations are less than fully polished. The papers carry the
names of the authors and should be cited accordingly. The findings, interpretations, and conclusions expressed in this paper are entirely those
of the authors. They do not necessarily represent the views of the International Bank for Reconstruction and Development/World Bank and
its affiliated organizations, or those of the Executive Directors of the World Bank or the governments they represent.
*
Angrist: University of Oxford, and World Bank, 1818 H Street NW, Washington, DC 20433 (e-mail:
[email protected]); Djankov: London School of Economics and the Peterson Institute for
International Economics (e-mail: [email protected]); Goldberg: Yale University, and the Peterson
Institute for International Economics (e-mail: [email protected]); Patrinos: World Bank 1818 H
Street NW, Washington, DC 20433 (e-mail: [email protected]). We thank the referees for
revisions which improved the paper. We are grateful to Syedah Aroob Iqbal and Husein Abdul-Hamid for
research support. This work builds on co-authored work with Nadir Altinok. Particular thanks to Aart
Kraay for detailed comments and contributions to the methodology. Valuable comments were provided
by Eva L. Baker, Felipe Barrera-Osorio, Eduardo Cascallar, Paul Collier, Stefan Dercon, Deon
Filmer, Roberta Gatti, Rachel Glennerster, Daniel Koretz, Julien Labonne, Silvia Montoya, George
Psacharopoulos, Simon Quinn, Heiner Rindermann, Halsey Rogers, Jaime Saavedra, Shwetlena
Sabarwal, and Eduardo Velez. This paper benefited from seminars held at the World Bank, World
Congress of Cliometrics, American Economic Association, IDB, Oxford, USAID and FHI360. A great
number of individuals and organizations supplied us with data. A special thanks to Rebecca Rhodes for
access to the microdata for many EGRA learning assessments. The views expressed here are those of the
authors and should not be attributed to the World Bank.
I. INTRODUCTION
The notion of human capital – resources imbedded in people – was alluded to as early as 1776
by Adam Smith and formalized two centuries later by Becker (1962). Ever since, the literature has
explored the role of human capital in economic development. For decades, this literature proxied
human capital and education with measures of schooling.2 This applies even to the most prominent
index of human capital to date, the United Nation’s Human Development Index (HDI).
However, proxying human capital with schooling assumes that being in school translates into
learning. Evidence suggests that this is often not the case (Pritchett 2013). A recent analysis reveals
that six out of ten adolescents worldwide cannot meet basic proficiency levels in math and reading
(UNESCO 2017). The gap between schooling and learning is acute in developing countries. In
Kenya, Tanzania, and Uganda three-quarters of grade 3 students cannot read a basic sentence such
as “the name of the dog is Puppy.” In rural India, half of grade 3 students cannot solve a two-digit
subtraction problem such as 46 minus 17 (World Bank, World Development Report 2018).
These stylized facts demonstrate a gap in human capital formation: students are in school but
learning is limited.3 Closing this gap presents a significant margin to drive economic development.
Several papers have argued that when human capital is measured by schooling it fails to deliver
the returns predicted by growth models. However, when measured by learning, human capital is
more strongly associated with growth4.
To date, much of the analysis of human capital when measured by learning has focused on
advanced economies. This is due to the absence of comparable measures of learning in developing
countries. This excludes a significant portion of the global distribution, in particular countries with
the most potential to gain from human capital accumulation.
In this paper, we bridge this gap. We introduce a database of globally comparable learning
outcomes for 164 countries covering 98 percent of the global population from 2000 to 2017.5 This
is the largest and most current global learning database, one of the first to disaggregate learning
results by gender, and introduces methodological improvements, such as inclusion of standard
errors to quantify uncertainty. The database will be updated at regular intervals and made available
2
Examples include Mincer (1984), Mankiw, Romer, and Weil (1992), and Lutz and Samir (2011).
3
We refer to ‘schooling’ as measured by enrollment or average years of school and ‘learning’ measured by the stock of cognitive skills on basic
proficiencies including mathematics, reading and science.
4
See related papers by Krueger and Lindahl (2001); Pritchett (2006); Hanushek and Woessmann (2012a).
5
There are two administrative regions: Hong Kong and Macao, which we refer to as countries for simplicity.
2
for public use. We hope this database is a public good that will enable tracking of human capital
formation and deepen understanding of the factors driving human capital formation and economic
development. A large-scale effort using this database to track and understand human capital
formation is the World Bank’s new Human Capital Index.6
The database includes 164 countries from 2000 to 2017 and was produced through a large-scale
effort by the World Bank to identify, collect and collate student assessment data worldwide. We
include seven assessment regimes total, including three international tests and three regional
standardized achievement tests. We also include the Early Grade Reading Assessment (EGRA),
which adds 48 countries to the database with at least one data point in the past 10 years, including
large developing economies such as Bangladesh, Nigeria and Pakistan. Each test covers between
10 and 72 countries. By combining these assessments and making them comparable we include
countries which represent 98 percent of the global population. The supplement includes a detailed
description of the methodology we use to develop harmonized learning measures and all data
included in the database.
The final database includes mean scores as well as standard errors for each measure to quantify
uncertainty. Scores are disaggregated by schooling level (primary and secondary), subject
(reading, math and science) and gender (male and female). We include year-by-year data. The
database will be made publicly available by the World Bank and will be updated regularly. We do
not extend the time series prior to 2000 since data quality is low and this does not significantly
affect country coverage, with an addition of just two territories.
Several statistics demonstrate the coverage and detail of the database. Table I presents coverage
for country-year observations by region. The database includes 2134 observations across all
countries from 2000-2017. Disaggregation by gender is available for nearly all the data, with 2,105
country-year observations. Most data come from math scores, with 768 country-year observations,
followed by science scores, with 690 and lastly by science scores, with 676. A third of scores are
primary school scores, and two-thirds of observations are secondary school scores. Latin America
6
The World Bank Human Capital Index includes additional measures of human capital, such as combining measures of school enrollment with
learning, as well as including measures of survival and health (Kraay 2019).
3
and the Caribbean and Sub-Saharan Africa make up nearly a quarter of all available data. This
provides the largest representation of developing countries to date in a learning database.
Our methodology leverages the growth of international assessments to construct globally
comparable learning outcomes. These tests are derived from assessments conducted in the United
States since the 1960s such as the SAT and the National Assessment of Educational Progress
(NAEP). The tests are psychometrically designed, standardized assessments of cognitive skills.
Since the 1990s, international assessments have been conducted by organizations such as the
OECD. Two high profile examples are PISA and TIMSS which covered 71 and 65 countries in
2015. These assessments enable credible global comparison of learning across countries and over
time. However, to date most analyses of these assessments focus mainly on OECD countries and
cover few developing countries.7 This limits the distribution of countries represented and has
implications for our understanding of the link between human capital and economic development.
We include 164 countries, two-thirds of which are developing countries, by linking international
assessments to their regional counterparts. Regional assessments cover much of Sub-Saharan
Africa and Latin America, but have often been excluded from international comparisons. We
employ methods to convert a regional test score to an international test score within subjects and
schooling levels (primary and secondary) and within adjacent years. By including tests across the
same testing round and at the disaggregated schooling and subject level, this minimizes the
likelihood that test differences are a function of time, proficiency, schooling level, or data
availability and are an accurate reflection of test difficulty. We then apply this conversion to a
country that participates in regional test but not an international test to produce a comparable score
(referred to as a Harmonized Learning Outcome (HLO) in the database). Means are also calculated
for disaggregated groups such as by gender.
The success of this approach hinges on three key assumptions. First, linked tests must capture
the same underlying population. This assumption is satisfied by using sample-based assessments
representative at the national level where a country participated in both a regional and international
assessment. This ensures that the underlying population tested is the same on average and we
capture differences between tests. Second, tests should measure similar proficiencies. To this end,
we link within subjects (math, reading and science) and schooling levels (primary and secondary)
7
Earlier studies include Barro and Lee (2001); Hanushek and Kimko (2000); Hanushek and Woessmann (2012a); Altinok, Angrist and Patrinos
(2018).
4
to ensure overlap. Third, the linking function should capture differences between tests rather than
country-specific effects. This assumption is most likely to hold the larger the number of countries
which participate in a given pair of tests being linked. To maximize the likelihood this assumption
holds, we construct the linking function over the entire interval. This increases the sample size
used to link tests, increasing the likelihood that we capture test-specific rather than country-
specific differences. In fixing the linking function, we assume that the relationship between tests
stays constant across rounds. This assumption is reasonable since the mid-1990s when assessments
started to use a standardized approach and to link testing rounds with overlapping test items. A
related advantage of a linking function over a fixed interval is that it guarantees that any changes
in test scores over this interval are due to realized progress in learning rather than changing linking
functions between tests. Of note, every update of the database increases the number of countries
participating in a given pair of assessments. Thus, each update expands coverage and enhances the
reliability of all estimates by enabling construction of a more robust linking procedure.
We use multiple methods to link regional to international assessments. Our primary approach
uses regression when multiple countries participate in assessments being compared. When only
one country participates, we use linear linking. Both methods adjust test scores by a constant as
well as relative standard deviations across tests. These approaches build on a literature comparing
scores across different tests (Kolen and Brennan 2014) as well as a more recent work linking
aggregate level scores across states in the United States (Reardon, Kalogrides and Ho 2019). In
the supplement we conduct a series of sensitivity tests, including conducting the conversion using
country-fixed effects or random draws of countries and time periods. We further explore additional
methods, such as mean linking and ratio conversions, highlighting the tradeoffs of each approach
and examining robustness across them. We find a .99 and above correlation for scores and relative
ranks across all robustness tests. We also compare our data to a smaller database using Item
Response Theory (IRT) where tests share common test items and find a .98 correlation.
The tests used are conducted at school. To this end, learning data might be affected by enrollment
patterns, and we advise users of the data to analyze learning outcomes alongside enrollment trends.
For example, average test scores might be driven by lower-performing students entering the system
rather than learning progress for those who were already in school. While this is a potential concern
when analyzing average scores, it is mitigated for a few reasons. First, primary enrollment rates
are relatively high, reaching 90 percent on average. Second, the direction of the bias is likely to
5
yield a conservative upper bound of learning in a given country. Since most countries at the bottom
of the distribution of learning are also those with relatively lower enrollments, it is unlikely new
school entrants will alter substantive conclusions – the lowest preforming countries will be
revealed to be even lower performing. In addition, data at the primary level should be largely
unaffected, since at this level students are being taught basic skills, such as reading “the name of
the dog is Puppy.” Even if new students enter the system, these students should still be expected
to attain basic skills by the time they are tested in later primary school grades.
Through construction of a cross-test conversion between international and regional assessments
we quantify the difference between them, adjust for this difference, and then place learning
outcomes from regional assessments on a global scale. For a high-performance benchmark on the
upper end of the distribution, we use the TIMSS benchmark of 625. For the low-performance
benchmark on the lower end of the distribution, we use 300, which is the equivalent on the HLO
scale of the minimum benchmarks on regional assessments such as LLECE and PASEC. This
approach enables us to capture performance across the distribution of both international and
regional benchmarks. The detailed methodology is described in the supplement.
Figure I presents learning outcomes for 164 countries from 2000-2017. The supplement provides
a corresponding table for all countries and includes gender disaggregation. Figure I makes the
global coverage of the database immediately apparent with typically excluded regions from
international tests such as PISA and TIMSS included in our database. This includes the vast
majority of Sub-Saharan Africa, Latin America and the Caribbean, and South Asia – economies
with significant potential to close learning gaps for economic development.
A few trends emerge: advanced economies far outpace developing economies; Sub-Saharan
African lags behind all regions; within Sub-Saharan Africa, a few countries such as Kenya and
Tanzania lead, on par with many countries in Latin America; within Latin America, a few countries
such as Chile are on par with European counterparts; the Middle East performs similarly or worse
than Latin America; many Asian countries outperform North American and European
counterparts, while a few South Asian countries such as India perform on par with Sub-Saharan
African countries.
6
245
579
No data
FIGURE I
Average Learning (2000-2017)
Notes: Learning levels are calculated as an average across subjects and levels over the given period of time.
Source: Our HLO learning database.
We present a series of stylized facts in a first application of the database. Figure II contrast years
of schooling with learning for the most recent year in which data is available. A few trends emerge
from this graph. First, the exponential shape of the line of best fit indicates that there might be a
tipping point where countries realize more learning after about ten years of schooling. Second,
there is high variance in learning conditional on years of schooling, with some countries reaching
around 8 years of expected schooling yet less than 300 on learning, such as Nigeria, while others
with similar years of schooling reach almost 400 on learning, such as Tanzania. Moreover, this
graph reveals that many developing countries have successfully achieved high schooling, but have
not yet realized high rates of learning. A few notable examples with high schooling but low
learning include India, Brazil and Ghana. Brazil has 11.7 expected years of schooling, yet a
learning score of just 411. India has 10.2 expected years of schooling, yet a learning score of just
366. Ghana has 11.6 years of expected schooling yet only a 229 score on learning. In contrast,
some countries outperform the trend, with Vietnam achieving learning on the upper end of the
distribution with a score of 514 on learning even without achieving the upper end of expected
schooling. These trends reveal that schooling does not translate one-to-one into learning, with
significant margins to improve learning both by increasing schooling as well conditional on
schooling.
7
600
Vietnam
Learning 500
Brazil
400
Tanzania
India
300
Nigeria
Ghana
200
4 6 8 10 12 14
Expected Years of School
FIGURE II
Notes: We compare the average country learning in the latest available year from our database to expected years of schooling.
Source: Expected years of schooling are from the World Bank Human Capital Index based on data compiled by UNESCO; average learning
outcomes are from our database. Both measures take the latest data available.
Figure III explores the contrast between changes in schooling and learning over time. We
measure schooling using primary school enrollment rates. We compare this to our measure of
learning in primary school for the years 2000-2015. We use data this period since it has the most
overlap of schooling and learning measures. We restrict our comparison to countries with at least
two data points over this time period for both enrollment and learning data in primary school to
maximize comparability and minimize bias due to changing country coverage over the time period.
We see a clear trend towards increased schooling, while learning progress is inconsistent and in
some cases stagnated. We observe limited learning progress even in regions where enrollments are
relatively constant, such as Latin America and the Caribbean (LAC). We explicitly condition
average learning on enrollment across countries and over time using multivariate regression and
show results in Figure IV. We find learning has stagnated on average, even when conditioned on
enrollment and when including country-fixed effects. Together, these data reveal a striking human
capital gap: students are increasingly in school but learning progress is inconsistent and limited.
This trend has been referred to as ‘the learning crisis’ (UNESCO 2017; World Bank 2018).
8
Primary Enrollment Rate Learning in Primary School
100.0 600
95.0
500
90.0
400
85.0
FIGURE III
Enrollment versus Learning (2000-2010), by Region
Notes: Primary enrollment and learning estimates are averaged within regions. LAC refers to Latin American and the Caribbean; MENA refers to
the Middle East and North Africa; and SSA refers to Sub-Saharan Africa. We have a total of 73 countries which have both learning and enrollment
data over this period and at least two data points for each.
Source: Primary enrollment rates are from Lee and Lee (2016) and are available until 2010. Learning estimates are taken from our database.
500
400
300
200
2000 2002 2004 2006 2008 2010
year
FIGURE IV
Learning Conditional on Enrollment (2000-2010)
Notes: Primary enrollment and learning estimates are averaged.
Source: Primary enrollment rates are from Lee and Lee (2016). Learning estimates are taken from our database.
9
Next, we explore gender gaps. We find gender gaps in learning are positive on average with girls
outperforming boys across nearly all regions as shown in Figure V by region. This points in the
opposite direction of the gender gap for years of schooling which is negative on average.
Enrollment
5
Gender Gap
-5
-10
East Asia & Pacific
LAC
MENA
North America
SSA
Learning
70
50
Gender Gap
30
10
-10
-30
-50
East Asia & Pacific
LAC
MENA
North America
SSA
FIGURE V
Gender Gap – Enrollment versus Learning (2000-2010), By Region
Notes: LAC refers to Latin American and the Caribbean; MENA refers to the Middle East and North Africa; and SSA refers to Sub-Saharan Africa.
The gender gap takes the difference of female and male enrollment or learning. A positive gender gap indicates females do better and vice-versa.
Source: Primary enrollment rates are from Lee and Lee (2016). Learning estimates are taken from our database.
10
This might suggest that as women increasingly join the labor market worldwide, girls who have
attained schooling might realize large returns if they can obtain skilled work and might partially
explain why in cross-country Mincerian returns estimates women have higher returns to schooling
(Psacharopoulos and Patrinos 2018). Of note, the flip in the gender gap might be due to selection.
In regions where enrollment is low, only high-achievers might be taking assessments. This
explanation is consistent with trends observed for the Middle East and North Africa. However, in
Sub-Saharan Africa, where enrollment is second lowest, the learning gender gap is negative, as is
the enrollment gap, indicating selection is unlikely the only driver. We present the contrast in
gender gaps in schooling versus learning not as definitive, but rather to motivate further in-depth
exploration, which we hope this database can enable.
We next examine the relationship between human capital and economic development on a global
scale. A development accounting literature studies the relative contribution of human capital in
cross-country income differences. However, this question remains unsettled, in part due to
difficulties in measuring human capital. While direct measures of years of schooling exist, the
quality of schooling has often been inferred or measurement of quality has covered only a limited
sample of countries.
Several approaches have emerged to estimate the quality of schooling, including cross-country
differences in Mincerian wage returns (Hall and Jones 1999; Caselli 2005), immigrant returns
(Schoellman 2011), and cross-country skill premia (Caselli and Coleman 2006). However, these
approaches have encountered challenges such as the substitutability between skilled and unskilled
workers (Jones 2019). The challenges in measuring quality have contributed to substantial
variation in estimates of the role of human capital in accounting for cross-country income
differences, ranging from nearly all to potentially none (Jones 2014; Caselli and Ciccone 2018).
In this paper, we provide a more direct and reliable measure of the quality of schooling on a
global scale. We construct human capital stocks using our learning outcome data and produce
development accounting results in a motivating application. We follow the literature (see Caselli
(2005) for a review) and begin with a standard aggregate production function in its per-capita form
following Klenow and Rodriguez-Clare (1997):
!
! "#!
𝑦 = 𝐴ℎ &" '
11
where y represents output per worker, k denotes the capital-labor ratio, h denotes the level of
human capital per capita, and A captures the residual, usually attributed to Total Factor
Productivity (TFP). Taking the log on both sides decomposes cross-country income differences
into three proximate sources: capital-output ratios, total factor productivity, and average human
capital. Since we are only interested in the share of income differences that can be explained by
variation in human capital, we provide decompositions of our baseline accounting results for the
human capital share. In Table II, we provide decompositions as direct analogies to Schoellman
(2011) who used inferred measures of quality as well as measures based on education quantity
(Hall and Jones 1999; Hendricks 2002). In Table III we include an additional decomposition:
#$(&$% )(#$ (&"% )
#$("$% )(#$ (""% )
which provides direct comparisons with a literature using various quality measures.
To measure human capital, we extend the standard Mincer specification that weights education
by its micro labor market returns to consider learning as well as schooling:
(5) ℎ = 𝑒 *+,-.
where S is the quantity of schooling and L is a measure of learning, and r and w are their
respective returns. For years of schooling, we use Barro-Lee (2010) data. For learning measures,
we use the data presented in this paper. We assume rates of return based on the microeconomic
literature. We take the value r = .10 for the rate of return per school year, and w = .20 per standard
deviation increase in learning.8 The .20 value is based on U.S. data. However, we might expect
that returns to skills will be higher in developing countries, where the supply of skills is lower, as
is the case in the returns to schooling literature. Significant work has been done to identify this
parameter value. For the purpose of this paper, our intention is not to provide a final result, but
rather to motivate the use of the data for future use in the development accounting literature. To
this end, we take parameter values as given and conduct sensitivity analyses with values w = .15
and w = .25. We include 131 countries in this development accounting exercise.9
Table II shows our results in comparison to Schoellman (2011), Hall and C. Jones (1999) and
Hendricks (2002). We find that when our human capital measure only captures quantity (w = 0),
the share of human capital accounts for roughly 9-26 percent of output per worker differences.
8
These values are based on Psacharopoulos and Patrinos (2004) and Hanushek and Zhang (2009) respectively.
9
This includes all countries which have both Barro-Lee data as well as learning data.
12
However, when we include quality, we find that this share goes up to 20–44 percent. These results
suggest that measuring human capital with quality substantially increases the role of human capital
in explaining cross-country output per worker differences.
#$(&$% )(#$ (&"% )
In Table III, we show results using the following decomposition: #$("$% )(#$ (""% )
. We compare
results to the recent literature, which varies from nearly all (Jones 2014) to potentially none (Caselli
and Ciccone 2018). We find that when including our measure of quality, the share of human capital
varies between 46 to 58 percent. Together with Table II, our results suggest human capital accounts
for between a fifth to around half of cross-country income differences – a middle ground in a
literature which ranges from zero to nearly all. These results are consistent with models of human
capital capturing the role of educated entrepreneurs and more comprehensive measures of human
capital including schooling, learning, and health (Gennaioli, La Porta, Lopez-de-Silanes, and
Shleifer 2013; Campbell and Üngör 2020). In this development accounting exercise, our central
contribution is not to provide a conclusive result, but rather to motivate the use of a direct measure
of schooling quality and thus a better measure of human capital.
In Table IV we further find the average relationship between learning and income masks
significant heterogeneity across countries. First, we find human capital explains between a tenth
and a fifth of cross-country income differences among low-income countries but up to two-thirds
among high-income countries. This suggests human capital plays a more central role as economies
develop. Second, we find the income gradient is often as steep or steeper as the quantity to quality
gradient, more than tripling the contribution of human capital. We find even steeper gradients by
regions. For example, when measured by schooling, human capital accounts for 31 percent of
cross-country income differences in Europe and just 5 percent in sub-Saharan Africa. When we
include learning, this gap widens to 67 percent in Europe but just 8 percent in sub-Saharan Africa.
This substantial heterogeneity reveals the importance of the inclusion a global distribution of
countries covering multiple stages of economic development to account for the role of human
capital.
Finally, we compare our measure of human capital to alternatives in Table V. We find that our
measure of human capital has a stronger and more statistically significant association with growth
than human capital measures in prominent global databases such as the Penn World Tables (PWT)
and the Human Development Index (HDI). This is the case in when comparing measures on their
own in columns (1)-(4). Each variable is transformed to a log scale to compare percent changes or
13
elasticities in comparable units. We observe that a one percent change in learning is associated
with a 6.5 percent change in annual growth. In contrast, a one percent change in the other human
capital measures is associated with between a 1.6 to 3.3 percent change in annual growth.
Moreover, the R-squared for the learning measure is highest at .275 relative to non-learning human
capital measures which range from .240 to .261. We further observe when we include variables in
the same multivariate regression that the relationship between learning and growth remains
statistically significant between 4.7 to 5.5 percent, whereas other human capital variables have a
reduced and statistically insignificant association with growth. However, we observe the overall
model fit improves when all measures are included with an R-squared that increases, although only
slightly, from .275 to between .281 to .298.
To this end, we observe that our measure of human capital individually and jointly appears to
have a stronger relationship with economic growth. This is likely because alterative human capital
measures rely largely on years of schooling and might underestimate the role of human capital in
economic development by omitting learning.10 However, their use remains standard practice in
part since these data have the broadest coverage. By constructing learning data across 164
countries we fill a key gap: broad coverage over nearly two decades and a measure of human
capital with strong links to economic development.
IV. CONCLUSION
To understand and track human capital formation, a critical ingredient for development, there is
need for globally comparable measurement of learning. The growth of international standardized
achievement tests is a significant step in this direction. However, many of the countries that
participate in these tests are often already rich. This limits the ability to track, compare or
understand education patterns in developing countries – the countries that might have the most
potential to gain from human capital formation.
We bridge this gap, constructing a globally comparable database of 164 countries from 2000-
2017, representing more than 98 percent of the global population and over two-thirds of countries
included are developing countries. We document a series of motivating stylized facts in a first
application of the data. First, we show that global learning progress has been relatively limited to
10
This does not mean that schooling is not useful, but might lead to growth largely through the channel of learning.
14
date and that there is a female learning premium. We also contribute conduct a development
accounting exercise, providing a direct measure of school quality. We estimate that the role of
human capital in explaining cross-country income differences ranges from a fifth to half – a middle
ground in a wide-ranging literature. Moreover, we find that average estimates mask significant
heterogeneity by country income status and region. This reveals the importance of including
countries at all stages of economic development for understanding the role of human capital.
Finally, we show that our learning database provides a measure of human capital that is more
closely associated with economic growth than current education and human capital measures
included in the Penn World Tables 9.0 and the Human Development Index.
This database comes at a moment when a series of global efforts have been launched to measure
and track learning on a global scale. While recent modelling suggests the world is on track to
achieve universal primary enrollment by 2030 (Friedman et al. 2020), if learning continues to
stagnate this achievement will be partial. In recognition of this, the Sustainable Development Goals
(SDGs) include a focus on learning whereas the Millennium Development Goals focused largely
on schooling. In addition to the SDGs, another notable effort to measure and track learning on a
global scale is the World Bank’s Human Capital Index which compares countries’ levels of human
capital around the world (Kim 2018; Kraay 2019; World Bank 2019). This effort aims to
disseminate a measure of human capital that will encourage countries to invest in the education of
their people. The Human Capital Index includes learning outcomes from this database as one of
its core ingredients. The database in this paper will be updated regularly and made public to enable
these large-scale efforts, among others, and to advance our understanding and ability to track
human capital formation and potential links to economic development.
15
REFERENCES
Altinok, Nadir, Noam Angrist, and Harry A. Patrinos. Global Dataset on Education Quality 1965-
2015. World Bank Policy Research Working Paper No. 8314, 2018.
Barro, Robert J., and Jong-Wha Lee. “International data on educational attainment: updates and
implications.” Oxford Economic Papers 53, no. 3 (2001): 541-563.
Barro, Robert J., and Jong Wha Lee. “A new data set of educational attainment in the world, 1950–
2010.” Journal of Development Economics 104 (2013): 184-198.
Becker, Gary S. “Investment in human capital: A theoretical analysis.” Journal of Political
Economy 70, no. 5, Part 2 (1962): 9-49.
Campbell, Susanna G., and Murat Üngör. 2020. “Revisiting human capital and aggregate income
differences.” Economic Modelling.
Caselli, Francesco and Antonio Ciccone. “The Human Capital Stock: A Generalized Approach.
Comment.” American Economic Review, 109, no. 3 (2019): 1155-74
Gennaioli, Nicola, Rafael La Porta, Florencio Lopez-de-Silanes, and Andrei Shleifer. “Human
capital and regional development.” The Quarterly Journal of Economics 128, no.1 (2013): 105-
164.
Friedman, Joseph, Hunter York, Nicholas Graetz, Lauren Woyczynski, Joanna Whisnant, Simon
I. Hay, and Emmanuela Gakidou. “Measuring and forecasting progress towards the education-
related SDG targets.” Nature (2020): 1-4.
Hanushek, Eric A. and Dennis D. Kimko. “Schooling, Labor-force Quality, and the Growth of
Nations.” American Economic Review 90, no.5 (2000): 1184-1208.
Hanushek, Eric A., and Ludger Woessmann. “Do Better Schools Lead to More Growth? Cognitive
Skills, Economic Outcomes, and Causation.” Journal of Economic Growth 17, no.4 (2012a):
267-321.
Jones, Benjamin F. “The Human Capital Stock: A Generalized Approach.” American Economic
Review 104, no.11 (2014): 3752-77.
Jones, Benjamin F. “The Human Capital Stock: A Generalized Approach: Reply.” American
Economic Review 109, no. 3 (2019): 1175-95.
Kim, Jim Yong. “The Human Capital Gap: Getting Governments to Invest in People.” Foreign
Affairs 97 (2018): 92
Kolen, Michael J., and Robert L. Brennan. Nonequivalent groups: Linear methods. Test equating,
scaling, and linking. (2014): 103-142.
Kraay, Aart. “The World Bank Human Capital Index: A Guide.” The World Bank Research
Observer 34, no. 1 (2019): 1-33
Krueger, Alan B. and Mikael Lindahl. “Education for Growth: Why and For Whom?” Journal of
Economic Literature 39, no.4 (2001): 1101-1136.
Lange, Glenn-Marie, Quentin Wodon and Kevin Carey. The Changing Wealth of Nations 2018:
Building a Sustainable Future. The World Bank, 2018.
16
Lee, Jong-Wha, and Hanol Lee. “Human Capital in the Long Run.” Journal of Development
Economics 122 (2016): 147-169.
Lutz, Wolfgang, and K. C. Samir. “Global human capital: Integrating education and
population.” Science 333, no. 6042 (2011): 587-592.
Mankiw, N. Gregory, David Romer and David N. Weil. “A Contribution to the Empirics of
Economic Growth.” Quarterly Journal of Economics 107, no. 2 (1992): 407-437.
Mincer, Jacob. “Human Capital and Economic Growth.” Economics of Education Review 3, no.3
(1984): 195-205.
OECD. PISA 2015 Technical Report. OECD Publishing, 2015.
Psacharopoulos. George and Harry Anthony Patrinos. “Returns to investment in education: a
decennial review of the global literature.” Education Economics 26, no. 5 (2018): 445-458.
Pritchett, Lant. “Does Learning to Add Up Add Up? The Returns to Schooling in Aggregate
Data.” Handbook of the Economics of Education 1 (2006): 635-695.
Pritchett, Lant. The rebirth of education: Schooling ain't learning. CGD Books, 2013.
Reardon, Sean F., Demetra Kalogrides, and Andrew D. Ho. “Validation methods for aggregate-
level test scale linking: A case study mapping school district test score distributions to a common
scale.” Journal of Educational and Behavioral Statistics (2019).
Smith, Adam. An Inquiry into the Nature and Causes of the Wealth of Nations (Volume One).
London: printed for W. Strahan; and T. Cadell, 1776.
UNESCO. More Than One-half of Children and Adolescents are not Learning Worldwide. UIS
Fact Sheet No. 46, 2017.
World Bank. World Development Report 2018: Learning to Realize Education’s Promise.
Washington, DC, 2018.
World Bank. World Development Report 2019: The Changing Nature of Work. Washington, DC,
2019.
17
TABLE I
COUNTRY-YEAR OBSERVATIONS BY DISAGGREGATION AND REGION
East Asia & Pacific 360 358 358 130 100 130 99 261
Europe & Central Asia 984 984 984 343 298 343 255 729
Latin America & Caribbean 249 249 249 84 88 77 111 138
Middle East & North Africa 271 271 271 106 60 105 90 181
North America 62 62 62 22 18 22 17 45
South Asia 10 9 9 1 8 1 7 3
Sub-Saharan Africa 198 172 172 82 104 12 175 23
Total 2134 2105 2105 768 676 690 754 1380
Notes: This table presents coverage for country-year observations by region. The database includes 2134 observations across all countries from
2000-2017.
18
TABLE II
BASELINE ACCOUNTING RESULTS AND COMPARISON TO QUANTITY LITERATURE
Notes: y: real output per worker (2000-2010); h: human capital based on school and learning estimates (2000-2010). We convert HLO units into
standard deviations based on a cross-country standard deviation of 68 to correspond to the returns to schooling parameter value which is given per
standard deviation.
Source: Schooling data are from Barro-Lee (2013). GDP data are from PWT 9.0. Learning estimates are from our database. Literature estimates
are derived from the referenced papers
19
TABLE III
BASELINE ACCOUNTING RESULTS AND COMPARISON TO QUALITY LITERATURE
ln(h90 )-ln(h10 )
0.31 0.46 0.51 0.58 Nearly All 0.51 0.62 Potentially None
ln(y90 )-ln(y10 )
Notes: y: real output per worker (2000-2010); h: human capital based on school and learning estimates (2000-2010). We convert HLO units into
standard deviations based on a cross-country standard deviation of 68 to correspond to the returns to schooling parameter value which is given per
standard deviation. We assume rates of return based on the microeconomic literature. We take the value r = .10 for the rate of return per school
year, and w = .20 per standard deviation increase in learning. The .20 value is based on U.S. data. However, we might expect that returns to skills
will be higher in developing countries, where the supply of skills is lower, as is the case in the returns to schooling literature. Significant work has
been done to identify this parameter value. For the purpose of this paper, our intention is not to provide a final result, but rather to motivate the use
of the data for future use in the development accounting literature. To this end, we take parameter values as given and conduct sensitivity analyses
with values w = .15 and w = .25. When w = 0, our accounting de facto only includes schooling; for any value w > 0, we include learning as well as
schooling. We include 131 countries in this development accounting exercise.
Source: Schooling data are from Barro-Lee (2013). GDP data are from PWT 9.0. Learning estimates are from our database. Literature estimates
are derived from the referenced papers.
20
TABLE IV
HUMAN CAPITAL SHARE BY INCOME STATUS AND REGION
Notes: y: real output per worker (2000-2010); h: human capital based on school and learning estimates (2000-2010). We convert HLO units into
standard deviations based on a cross-country standard deviation of 68 to correspond to the returns to schooling parameter value which is given per
standard deviation. We assume rates of return based on the microeconomic literature. We take the value r = .10 for the rate of return per school
year, and w = .20 per standard deviation increase in learning. The .20 value is based on U.S. data. However, we might expect that returns to skills
will be higher in developing countries, where the supply of skills is lower, as is the case in the returns to schooling literature. Significant work has
been done to identify this parameter value. For the purpose of this paper, our intention is not to provide a final result, but rather to motivate the use
of the data for future use in the development accounting literature. To this end, we take parameter values as given and conduct sensitivity analyses
with values w = .15 and w = .25. When w = 0, our accounting de facto only includes schooling; for any value w > 0, we include learning as well as
!"#[%&'())]
schooling. We include 131 countries in this development accounting exercise. We define the human capital share as .
!"#[%&'(,)]
Source: Schooling data are from Barro-Lee (2013). GDP data are from PWT 9.0. Learning estimates are from our database. Literature estimates
are derived from the referenced papers.
21
TABLE V
HUMAN CAPITAL AND ECONOMIC DEVELOPMENT – COMPARING MEASURES
Notes: Dependent variable: annual growth rates averaged across 2000-2010. Human Capital - Schooling refers to estimates in 2000, the beginning
of the time period. Human Capital – Penn World Tables refers to the measure of human capital in the Penn World Tables. Human Capital – Learning
refers to the measure of human capital in this database from 2000 onwards. Human Capital - HDI Education Index refers to the measure of education
included in the HDI in the year 2000. Results exclude countries in civil war, inflation crises and with rents from natural resources above 15 percent.
All independent variables are transformed to log units to derive comparable elasticities. We control for initial GDP per capita refers to levels at the
beginning of the period in the year 2000 in all specifications following standard practice in the growth literature.
Source: Schooling data are from Barro-Lee (2013). HDI data are from the Human Development Index Education Index. GDP and human capital
data are from PWT 9.0. Learning estimates are from our database.
22
SUPPLEMENT
We include 164 countries, two-thirds of which are developing countries, by linking international
assessments to their regional counterparts. Regional assessments cover much of Sub-Saharan
Africa and Latin America. Thus, through construction of a linking procedure between international
and regional assessments we quantify the difference between them, adjust for this difference, and
then place learning outcomes from regional assessments on a global scale.
The central intuition behind the construction of globally comparable learning outcomes is the
production of a linking function between international and regional assessments. This function can
be produced for countries that participate in a given pair of assessments and captures the difference
in difficulty between the two assessments. This linking function can then be used to place scores
for countries that only participate in regional assessments on the international scale. This enables
construction of globally comparable learning outcomes.
We use multiple methods to produce globally comparable scores. Our primary approach uses
regression when multiple countries participate in assessments being compared. When only one
country participates, we use linear linking. Both methods adjust test scores by a constant as well
as relative standard deviations across tests. These approaches build on a literature comparing
scores across different tests (Kolen and Brennan 2014) as well as a more recent work linking
aggregate level scores across states in the United States (Reardon, Kalogrides and Ho 2019).
The conversion can be implemented by regressing mean scores from countries that partake in a
regional and international assessment to derive 𝛼 and 𝛽 and produce a linking function between
assessments:
𝜇𝑌𝑖 = 𝛼 + 𝛽𝜇𝑋𝑖 + 𝜀𝑖
where 𝜇 denotes the mean scores, 𝑋 is a regional assessment, 𝑌 is an international assessment and
i denotes countries that have scores on both assessments. We can then convert scores from
23
countries that only participate in regional assessment X onto an international scale Y using 𝛼 and
𝛽.
The success of this approach hinges on three key assumptions. First, linked tests must capture the
same underlying population. This assumption is satisfied by using sample-based assessments
representative at the national level where a country participated in both a regional and international
assessment. This ensures that the underlying population tested is the same on average and we
capture differences between tests.
Second, tests should measure similar proficiencies. To this end, we link within subjects (math,
reading and science) and schooling levels (primary and secondary) to ensure overlap.
Third, the linking function should capture differences between tests rather than country-specific
effects. This assumption is most likely to hold the larger the number of countries which participate
in a given pair of tests being linked. To ensure this last assumption holds, we use the same linking
parameters over the entire interval. This increases the sample size used to link tests, increasing the
likelihood that we capture test-specific rather than country-specific differences. In fixing the
linking function over time, we assume that the relationship between tests stays constant across
rounds. This assumption is reasonable since the mid-1990s when assessments started to use a
standardized approach and to link testing rounds with overlapping test items. A related advantage
of fixing the linking function is that it guarantees that any changes in test scores over this interval
are due to realized progress in learning rather than changing relationship between tests. Of note,
every update of the database increases the number of countries participating in a given pair of
assessments. Thus, each update both expands coverage as well as enhances the reliability of all
estimates by enabling construction of a more robust linking procedure.
Below we capture a level of precision needed to satisfy the above assumptions. We produce a
linking function within subjects and schooling levels (primary and secondary) from test X to test
Y:
where i is a country in the set countries that participate in both tests X and Y in a given subject s,
and schooling level l. Scores from test X and Y are further matched by testing round. We consider
tests to be in the same round if they are five years apart and optimize to have the rounds as tight
as possible. Most often the time window is one to two years. In some cases, this extends to three
to five years apart. In a few exceptions, we average adjacent years across one another. This
minimizes the likelihood that test differences are a function of time, proficiency, schooling level,
or data availability and are an accurate reflection of test difficulty.
We present a simplified and illustrative example. In 2006 Colombia and El Salvador participated
in the regional test in Latin America and the Caribbean called LLECE as well as an international
test, TIMSS. Thus, they have primary science scores on both assessments representative at the
national level. In 2013, Chile and Honduras participated in both assessments and have primary
science scores on both assessments representative at the national level. A regression for this set of
countries of LLECE on TIMSS at primary level and on math scores yields an estimate 𝛽 of .816
24
and a constant adjustment 𝛼 of 15.824. We can then use this estimated relationship to convert
scores from countries which only took part in regional assessments to an international scale. For
example, Argentina has a score of 501.32 in primary science in 2013 on LLECE and would thus
have an equivalent international score of around 425.
We can also use an alternative approach called linear linking when only one country participates
in pairwise assessments. This approach uses information on within-country standard deviations
and mean scores to estimate 𝛼 and 𝛽 as follows:
𝑌 = 𝛼 + 𝛽𝑋
By producing a linking function and placing regional scores on an international scale, we are able
to compare learning outcomes on a global scale. On this scale, 625 represents advanced attainment
and 300 represents minimum attainment. This interpretation is derived by taking establish
benchmarks already used on international and regional assessments. For the high-performance
benchmark on the upper end of the distribution, we use the TIMSS benchmark of 625. For the low-
performance benchmark on the lower end of the distribution, we use 300, which is the equivalent
on the HLO scale of the minimum benchmarks on regional assessments such as LLECE and
PASEC. This approach enables us to capture performance across the distribution and accounts for
floor and ceiling effects that would be introduced by taking either international or regional
benchmarks on both ends of the distribution. Supplement Table VI includes descriptions of each
assessment to enable derivation of linking functions.
B. Sensitivity Tests
We conduct a series of sensitivity tests. First, we examine the degree to which linking functions
are stable across countries using two approaches. For tests where we have multiple participating
countries and for which we use the regression method we can also produce linking functions using
country-fixed effects.
SUPPLEMENT TABLE I
SCORES USING REGRESSION WITH AND WITHOUT COUNTRY-FIXED EFFECTS
EGRA reading LLECE math LLECE reading LLECE science PISA math PISA science
Notes: HLO references Harmonized Learning Outcomes produced with a linking function without country-fixed
effects. HLO- Country Fixed Effects refers to HLO scores produced from with a linking function derived using a
25
regression which includes country-fixed effects. We only compute scores using the regression method for LLECE,
EGRA and PISA since SACMEQ and PASEC only have a single country used to make score comparisons.
Supplement Table I compares scores with and without country-fixed effects linking methods. We
observe differences in scores ranging from 10 to 20 points. These differences are smaller for tests
with more overlapping countries, with the smallest differences for PISA, followed by LLECE and
then EGRA. This supports the approach of fixing the linking function across countries and over
time, since more overlapping countries and time periods maximize the likelihood that linking
parameters are a function of relative difficulty rather than country factors. Overall, while we
observe differences in scores, they are relatively small, and we find a perfect correlation among
scores within test and subject.
We further test the robustness of linking by conducting a random draw of half of all available
countries and time periods per test-subject-level to produce the linking function using linear
linking for consistency of method. Supplement Figure I shows a scatter plot of scores with all
countries and time periods relative to linking functions using a random sample. Supplement Table
II quantifies these differences.
We find average point differences of less than 1 point for PISA, followed by 7 points for EGRA
and LLECE, and 20-25 points for SACMEQ and PASEC. This variation is consistent with country-
fixed effects results suggesting smaller differences where there is more country overlap and data
availability. EGRA and LLECE converge similarly to PISA with the difference in scores falling
within standard error margins of 2 to 7. PASEC and SACMEQ score differences vary more widely,
necessitating caution when interpreting precise scores. Overall, we find consistently high
correlations above .95 indicating while scores are not identical, they change in consistent
directions. This indicates relative rankings and country groupings are preserved.
EGRA, reading, pri LLECE, math, pri LLECE, reading, pri LLECE, science, pri
600
500
400
HLO, All Countries and Time Periods
300
200
PASEC, math, pri PASEC, reading, pri PISA, math, sec PISA, science, sec
600
500
400
300
200
200 300 400 500 600 200 300 400 500 600
500
400
300
200
200 300 400 500 600 200 300 400 500 600
SUPPLEMENT FIGURE I:
Learning Scores with All Countries and Time Periods vs. Random Subset
Notes: We compare Harmonized learning Outcomes (HLO) using all county and time periods over the fixed linking
period with HLO scores computed using a random subset of half of available countries for each test-subject-level.
26
SUPPLEMENT TABLE II:
LEARNING SCORES WITH ALL COUNTRY AND TIME PERIODS VS. RANDOM SUBSET
Notes: We compare Harmonized learning Outcomes (HLO) using all county and time periods over the fixed linking
function period with HLO scores computed using a random subset of half of available countries for each test-subject-
level.
Next we explicitly account for linking errors by including measures of uncertainty to quantify the
degree of confidence around our estimates by test. We capture two sources of uncertainty: scores
on the original test and uncertainty in the estimation of linking parameters across tests. We
calculate the variance by bootstrapping. We consider each average country score on a given
subject, test, and schooling level as a random variable with a mean – the score itself – and a
standard deviation which captures the sampling variation across students. This distribution of
scores is asymptotically normal by virtue of the central limit theorem. We take 1,000 draws from
the distribution of subject-level average test scores for each testing regime. We do this as a
computational shortcut, rather than bootstrapping subsamples of students from each test. We
derive the linking function and scores from each bootstrapped sample. We take the 2.5th and 97.5th
percentiles of the distribution and use this to construct lower and upper bounds of uncertainty.
We find small uncertainty intervals overall, as shown in Supplement Figure II with an average of
3.5 points and ranging from 1 to 11 points. This is consistent with original standard errors from
each respective testing regime. Consistent with sensitivity tests, we find larger uncertainty for our
estimates relative to original scores when testing regimes have fewer countries participating in a
given pair of tests. Supplement Figure II decomposes standard errors due to within-test sampling
variation as well as variance in the linking function. This figure shows that for tests where there is
no need to produce a linking function, or many pair-wise countries which we can use to produce
this linking, the final standard errors remain similar to standard errors from the original test. For
tests with fewer pair-wise countries, the linking has more uncertainty, such as PASEC, where the
average standard error increases from 5.3 on the original test to 7.5 for the HLO. By quantifying
the degree of uncertainty, we can more reliably bound our estimates.
27
EGRA
LLECE
PASEC
PIRLS
PISA
SACMEQ
TIMSS
0 5 10 15
SUPPLEMENT FIGURE II
Standard Errors by Test
Notes: We decompose standard errors on the overall HLO score versus the original test. This reveals the source of uncertainty derived from
sampling variation from the original test relative to variation introduced from the production of the linking function.
Finally, we supplement the primary approaches to link regional to international assessments with
alternative linking methods. We compare results with learning scores using Item-Response
Theory, often considered one of the most reliable methods in the psychometric literature (Kolen
and Brennan 2014). IRT models the probability a given pupil answers a given test item correctly
as a function of pupil and item-specific characteristics (Mislevy, Beaton, Kaplan, and Sheehan
1992; Holland and Dorans 2006). This methodology is used to construct the underlying tests we
use. However, to use it to compare learning across assessments would require enough overlap in
the test items across each assessment. This is not true for a large enough set of tests and time
periods to create a globally comparable panel data set. For example, TIMSS 1995 and SACMEQ
2000 included overlapping math items, but only had three items to make this comparison. When
this overlap is small, standard maximum likelihood estimates will reflect both true variance and
measurement error, overstating the variance in the test score distribution. Das and Zajonc (2010)
elaborate on the various challenges of estimating IRT parameters with limited item-specific
overlap.
While IRT might not be a reliable approach when there is limited item-by-item overlap, we conduct
a few comparisons where overlap is larger. We compare our results to the Linking International
Comparative Student Assessment (LINCS) project which uses IRT methods and has significant
overlap in items for a subset of international studies focused on reading at primary school
(Steinmann, Strietholt, and Bos 2014). We find that our database can produce similar results to
IRT methods for average scores where there is overlap with a correlation coefficient above 0.98
for primary reading scores.
28
550
450
400
Notes: We compare our data to the Linking International Comparative Student Assessment (LINCS) project which uses Item-Response Theory
(IRT) methods to link test score data. IRT methods are used to create scores for the underlying international and regional assessments used.
However, to compare across assessments, IRT would require enough overlap in the test items. This is not true for a significant enough set of tests
and time intervals to create a globally comparable panel data set. However, for a subset of tests, this is the case, such as a series of international
studies focused on reading at primary school. The LINCS project produces scores on a globally comparable scale using this subset of data
(Steinmann, Strietholt, and Bos 2014). We compare results on the HLO as well as the LINCS data for primary school reading on average between
2000-2010.
Supplement Figure III compares scores for the same subject (reading), schooling level (primary)
and time period (2000-2010), with a correlation of .984. This comparison indicates that as we
expand coverage to 164 countries, we maintain high consistency with alternative measures where
there is overlap.
Finally, we compare our primary linking approach using regression and linear linking with two
alternative approaches and compare robustness across them. First, we use simple mean linking
which introduces a constant adjustment between tests matched with rounds, and which we average
across testing rounds. This approach assumes constant standard deviations across tests. Second,
we use a ratio between test means and also take an average across rounds. This approach assumes
a constant scalar adjustment 𝜆 between means and standard deviations across tests. The ratio
approach is salient and intuitive for policymakers. However, a potential challenge in applying
ratios is that they are in principle sensitive to the scale of the test. For example, given score scales
have no absolute zero, in theory we can add 300 points to the mean of each test and preserve the
interval properties of the scale, but will alter the conversion ratios (i.e. exchange rates). We address
this potential issue by having strict inclusion criteria for the underlying tests: they have a uniform
scale with a mean of 500 and standard deviation of 100. “Exchange rates” are derived using the
same scale and applied on the same scale. Thus, while in theory changing score scales might bias
results, by design this is not the case. This increases the likelihood we capture differences in test
difficulty rather than arbitrary scaling variation.
29
.008 Correlation: .995-1
.006
.004
.002
0
300 400 500 600
HLO
Mean Ratio
Regression Linear
SUPPLEMENT FIGURE IV
Comparison of Scores Across Linking Methods
Notes: We compute Harmonized Learning Outcomes scores using multiple methods including regression, linear,
mean and ratio linking. This figure compares scores using a density plot.
Supplement Figure IV shows how scores compare across methods. Overall, we find a correlation
of .995 and above, indicating high levels of robustness. Supplement Figure V below breaks down
scores distributions by test and method. This reveals similar patterns, with testing regimes with
more overlapping countries showing more consistent scores across method. Taken together, these
results reveal overall robustness. A caveat is that scores from regional assessments from PASEC
and SACMEQ in particular should be interpreted carefully and focus less on precise scores and
more on relative ranks and country groupings.
Over time, as more countries participate in more assessments, we anticipate the linking functions
used to produce harmonized scores will become increasingly robust. The approach outlined here
produces a first set of global comparisons, demonstrates aggregate reliability, quantifies
uncertainty to bound estimates, and provides a foundation for continuingly more robust data and
comparisons as more countries partake in regional and international assessments.
30
600
500
400
Mean
300 Ratio
Regression
Linear
SUPPLEMENT FIGURE V
Comparison of Scores Across Linking Methods by Test
Notes: We compute Harmonized Learning Outcomes scores using multiple methods including regression, linear, mean
and ratio linking functions. This figure compares scores using a density plot. We only compute scores using the
regression method for LLECE, EGRA and PISA since SACMEQ and PASEC only have a single country used to make
score comparisons.
C. Potential Limitations
A potential limitation is the representativeness of the data of the total stock of cognitive skills in a
given country. While the tests used are nationally representative, they are conducted at the school.
To this end, learning data might be affected by enrollment patterns, and we advise users of the data
to analyze learning outcomes alongside enrollment trends. For example, as marginal students enter
the schooling system, average test scores might be driven by selection rather than true learning
progress. While this is a potential concern, it is mitigated for a few reasons. First, primary
enrollment rates are relatively high, reaching 90 percent on average, and above 75 percent even in
the furthest behind regions, such as Sub-Saharan Africa. Second, the direction of the bias is likely
to yield a conservative upper bound of learning in a given country. If all students enrolled, the
average test score would be even lower, since the marginal students would pull the average down.
Since most countries at the bottom of the distribution of learning are also those with relatively
lower enrollments, it is unlikely this will alter substantive conclusions – the lowest preforming
countries will be revealed to be even lower performing. In addition, data at the primary level should
be largely unaffected, since at this level students are being taught basic skills, such as reading “the
name of the dog is Puppy.” Thus, even if marginal students enter the system, these students should
still be expected to attain basic skills by the time they are tested in later primary school grades. Of
note, in future work, we aim to include household-based learning data to sign and quantify the
31
degree of selection present in school-based testing. However, current household-based data is
limited and not yet comparable across a significant number of countries.
A second limitation regards data availability. While this is the largest learning outcomes database
to date, data are still sparse for some countries. This introduces bias if data availability is correlated
with education quality or progress. For example, if countries that perform worse have data only in
later years (because they were later to introduce assessments), their average score will be likely
biased upwards, as the test scores will reflect more recent testing, not stronger performance. Since
we provide year-by-year scores this can be accounted for.
Relatedly, when averaging data across subjects, levels and over time, there is a possibility that
averages reflect the availability of data rather than learning gains. For example, let’s examine a
case where a country has a score of 500 in 2000 in math and jumps to 550 in 2005. If this country
added reading in 2005 and scored 450, the average score across subjects in 2005 would be 500,
reflecting no learning progress since average scores would be 500 in both years. However, an
apples-to-apples comparison in math shows learning grains from 500 to 550. To address this issue,
we construct disaggregated measures by subject and schooling levels as well as aggregated ones.
This enables analyses at each level considering the trade-offs.
A point of emphasis is that while learning measures human capital better than prior proxies, such
as enrollment, learning does not capture the concept of human capital in its totality. Moreover,
assessments do not capture only cognitive skills. For example, recent evidence suggests test scores
pick up as differential effort as well as cognitive ability (Gneezy, List, Livingston, Sadoff, Qin,
and Xu 2017). We use learning outcomes in this paper with these caveats in mind.
Several statistics demonstrate the coverage and detail of the database. Supplement Table III
presents country-subject-level observations by year. The data are spread over time, slightly
weighted towards recent years since countries are increasingly participating in assessments. A
related feature of the data is a large influx of data in particular testing years. This is more prevalent
for developing regions which participate in sporadic assessment.
32
SUPPLEMENT TABLE III
COUNTRY-SUBJECT-LEVEL OBSERVATIONS BY YEAR
33
Supplement Figure VI summarizes learning for selected countries for the last decade. We observe
a few interesting case studies. Russia outperforms the United States. Chile outperforms Eastern
European countries such as Georgia. Saudi Arabia places near the bottom outperforming only
African countries. The gap between Morocco and Singapore is substantial. Singapore and Finland
have low variation due to a potential plateau on the upper end of performance. Rwanda has low
variation due to limited data. Russia has high variation due to improving learning, whereas South
Africa has high variation due to declining learning.
Rwanda
Morocco
Egypt
South Africa
Saudi Arabia
Georgia
Chile
Israel
France
United States of America
Germany
Russian Federation
Finland
Hong Kong, SAR China
Singapore
SUPPLEMENT FIGURE VI
Notes: Average learning is calculated across subjects and schooling levels over the given period of time.
Source: Our learning outcomes database.
Supplement Figure VII plots average learning for each country by the log of their GDP per capita.
This graph illuminates cases where countries have managed to improve learning despite a lack of
resources, as well as cases where countries have resources to invest in to date unrealized learning
potential. Former or current centrally planned economies display better learning outcomes than
their income would suggest, such as Singapore, Poland, Bulgaria, Cuba and Vietnam. Countries
in the Middle East and Africa reach lower learning levels than predicted by income, such as Qatar,
Kuwait, United Arab Emirates, South Africa, Nigeria and Ghana. We also highlight large
developing countries: India, China, Mexico, and Brazil. China outperforms its counterparts,
Mexico, India and Brazil perform slightly below where their income would predict, and South
Africa trails far behind.
34
600
Large Economies
Singapore
High-Perfomers
Low-Performers
Vietnam Cuba Poland
500
Bulgaria
400
Qatar
India Kuwait
South Africa
300
Ghana
Nigeria
200
6 8 10 12
log GDP per capita
In the mid-1990s, there was an emergence of standardized, psychometrically robust and relatively
consistent ISATs. Below we describe the major ISATs we use in this database.
TIMSS. The Trends in International Mathematics and Science Study (TIMSS) is conducted by the
IEA. Five TIMSS rounds have been held to date in Math and Science subjects covering grades 4
and 8. The first, conducted in 1995, covered 45 national educational systems and three groups of
students.11 The second round covered 38 educational systems in 1999, examining pupils from
secondary education (grade 8). The third round covered 50 educational systems in 2003, focusing
11
IEA assessments define populations relative to specific grades, while PISA assessments focus on the age of pupils.
In IEA studies, three different groups of pupils were generally assessed: pupils from grade 4, grade 8 and from the
last grade of secondary education. In 1995, two adjacent grades were tested in both primary (3-4) and secondary
schools (7-8). To obtain comparable trends, we restricted the sample to grades 4 and 8. Some Canadian provinces and
states in the United States of America have occasionally taken part in the IEA surveys.
35
on both primary and secondary education (grades 4 and 8). In 2007, the fourth survey covered
grades 4 and 8 and more than 66 educational systems. In 2011, the survey covered 77 educational
systems across grades 4 and 8. The last round was performed in 2015 and covered 63
countries/areas. The precise content of the questionnaires varies but remains systematic across
countries.
PIRLS. The Progress in International Reading Literacy Study (PIRLS) survey is also conducted by
the IEA. The PIRLS tests pupils in primary schools in grade 4 in reading proficiency. Four rounds
of PIRLS have been held to date in 2001, 2006, 2011 and 2016.
In 2006, PIRLS included 41 countries/areas, two of which were African countries (Morocco and
South Africa), 4 lower middle-income countries (Georgia, Indonesia, Moldova, Morocco) and 8
upper middle-income countries (Bulgaria, Islamic Republic of Iran, Lithuania, Macedonia, Federal
Yugoslavian Republic, Romania, Russian Federation, South Africa). The 2011 round of PIRLS
was carried out alongside TIMSS and included 60 countries/areas. The newest round of PIRLS in
2016 includes 50 countries.
PISA. The Organization for Economic Co-operation and Development (OECD) launched the
Programme for International Student Assessment (PISA) in 1997 to provide comparable data on
student performance. Since 2000, PISA has assessed the skills of 15-year-old pupils every three
years. PISA concentrates on three subjects: mathematics, science and literacy. The framework for
evaluation remains the same across time to ensure comparability. In 2009, 75 countries/areas
participated; in 2012, 65 countries/areas participated and in 2015, 72 countries/areas participated.
An important distinction between PISA and IEA surveys is that PISA assesses 15-year-old pupils,
regardless of grade level, while IEA assessments assess grade 4 and 8.
In addition to the above international assessments, a series of regional assessments have been
conducted in Africa and Latin America and the Caribbean.
SACMEQ. The Southern and Eastern Africa Consortium for Monitoring Educational Quality
(SACMEQ). SACMEQ is a psychometrically designed, standardized test which generally assesses
math, reading and English in grade 6 pupils. The first SACMEQ round took place between 1995
and 1999. SACMEQ I covered seven different countries and assessed performance only in reading.
The participating countries were Kenya, Malawi, Mauritius, Namibia, United Republic of
Tanzania (Zanzibar), Zambia and Zimbabwe. The studies shared common features (instruments,
target populations, sampling and analytical procedures). SACMEQ II surveyed pupils from 2000-
2004 in 14 countries: Botswana, Kenya, Lesotho, Mauritius, Malawi, Mozambique, Namibia,
Seychelles, South Africa, Swaziland, Tanzania (Mainland), Tanzania (Zanzibar), Uganda, and
Zambia. Notably, SACMEQ II also collected information on pupils’ socioeconomic status as well
as educational inputs, the educational environment and issues relating to equitable allocation of
human and material resources. SACMEQ II also included overlapping items with a series of other
surveys for international comparison, namely the Indicators of the Quality of Education
(Zimbabwe) study, TIMSS and the 1985-94 IEA Reading Literacy Study. The third SACMEQ
round (SACMEQ III) spans 2006-2011 and covers the same countries as SACMEQ II plus
36
Zimbabwe. SACMEQ collected its latest round of data in 14 countries in East and Southern Africa
from 2012-2014. These include Botswana, Kenya, Lesotho, Mauritius, Malawi, Mozambique,
Namibia, Seychelles, South Africa, Tanzania, Uganda, Zambia, Zanzibar and Zimbabwe.
SACMEQ was designed and scaled to be comparable to past rounds. We include microdata from
prior rounds, and estimates from reports for the latest round of SACMEQ since the microdata are
pending.
LLECE. The Latin American Laboratory for Assessment of the Quality of Education (LLECE)
was formed in 1994 and is coordinated by the UNESCO Regional Bureau for Education in Latin
America and the Caribbean. Assessments conducted by the LLECE focus on achievement in
reading and mathematics in primary school. The first round was conducted in 1998 across grades 3
and 4 in 13 countries. These countries include: Argentina, Bolivia, Brazil, Chile, Colombia, Costa
Rica, Cuba, Dominican Republic, Honduras, Mexico, Paraguay, Peru and Venezuela. The second
round of the LLECE survey was initiated in 2006 in the same countries as LLECE I. In round two,
called the Second Regional Comparative and Explanatory Study (SERCE), pupils were tested in
grade 3 and grade 6. The Third Regional Comparative and Explanatory Study (TERCE), was done
in 2013 across grades 3 and 6 and included 15 Latin American and Caribbean countries. We only
include SERCE and TERCE data in this database, since these assessments are most similar and
cover comparable grades.
12
The following is a list of participating countries in chronological order: Djibouti (1994), Congo (1994), Mali (1995), Central
African Republic (1995), Senegal (1996), Burkina Faso (1996), Cameroon (1996), Côte d'Ivoire (1996), Madagascar (1997),
Guinea (2000), Togo (2001), Mali (2001), Niger (2001), Chad (2004), Mauritania (2004), Guinea (2004), Benin (2005), Cameroon
(2005), Madagascar (2006), Mauritius (2006), Republic of Congo (2007), Senegal (2007), Burkina Faso (2007), Burundi (2009),
Côte d’Ivoire (2009), Comoros (2009), Lebanon (2009), Togo (2010), Democratic Republic of Congo (2010), and Chad (2010).
Additional countries took a slightly different test between 2010 and 2011 (Lao PDR, Mali, Cambodia and Vietnam).
37
The Early Grade Reading Assessment (EGRA) is a basic literacy assessment conducted in early
grades. The assessment is conducted most often in grades 2-4. Since 2006, EGRA has been
conducted in over 65 countries. EGRA was developed by RTI and is typically implemented by
USAID, RTI and local partners (Gove 2009).
The assessment is a short oral assessment conducted with a child one-on-one. EGRA is designed
to be flexible and adapted across countries and contexts, while maintaining core modules and
similarities. EGRA is a timed test, enabling uniformity in how it is conducted. The tests often
represent the most common features of the local language and align with the expectations of the
grade level. EGRA includes up to thirteen subtasks, such as ‘oral reading fluency’, ‘vocabulary’,
‘diction’, and ‘reading comprehension.’ Multiple questions are included in each subtask to test
proficiency. Of the thirteen subtasks, there are a few subtasks encouraged to be delivered across
all countries and contexts (Dubeck and Gove 2015).
We compile and include data from the ‘reading comprehension’ indicator in EGRA from 48
countries. This indicator is available in nearly all EGRA data sets and is less sensitive to
differences in context, implementation and language. It also and has a strong conceptual link to
RSATs and ISATs (Abadzi 2008; Dubeck and Gove 2015) which also measure reading
comprehension. To ensure robustness to language effects, we only include data when students took
the test in their language of instruction. We use data for grades 2-4, which EGRA is designed for,
although certain countries will participate out of this range. We restrict data used for our database
to grades 2-4 to be consistent with the design of EGRA. We scale the EGRA microdata to a mean
of 500 and standard deviation of 100. This scale corresponds to the scale used by RSATs and
ISATs. We include all EGRA data from 2007-2017 as one round. This ensures our scaling is not
biased by changing distributions of countries. In the future, we will consider new EGRA data as
part of a future round and will conduct intertemporal comparisons using a similar approach to
PISA (OECD 2015). Patrinos and Angrist (2018) provide additional detailed analysis and
robustness checks on the inclusion of EGRA data.
The inclusion of EGRA adds 48 countries to the database with at least one data point in the past
10 years, nearly all of which are developing economies. Of the 48 countries, nearly two-thirds (31
countries) have data that is nationally representative. Linking functions for EGRA are derived
using countries with nationally representative data only, to ensure the assumptions underlying the
construction of the linking function hold. We include countries with non-representative data only
when the alternative is no data. We include a dummy variable indicating when the data is not
nationally representative to enable users of the database to analyze the data accordingly.
We include eight learning assessments in our database. Supplement Table IV summarizes the
assessments included. Supplement Table V further describes the distribution of source assessments
included in our database by country-level-year observations. Most regional assessments are done
at the primary level. Moreover, regional assessments comprise nearly 40 percent of primary
country-level-year observations, marking substantial representation of developing countries.
38
SUPPLEMENT TABLE IV
REVIEW OF STUDENT ACHIEVEMENT TESTS
IEA TIMSS Every four years since 2003 (latest round is 2015) M,S 38, 26, 48, 66, 65 4,8
UNESCO LLECE 2006, 2013 M,S,R 13, 16 (only 6 for science) 3,6
6
UNESCO SACMEQ 2000, 2003, 2007, 2013 M,R 7, 15, 16
6
Until 2014: 2,5
CONFEMEN PASEC 2006, 2014 M,R 22 (before 2014), 10
After 2014: 3, 6
IEA PIRLS Every five years since 2001 (latest round is 2016) R 35, 41, 55 4
OECD PISA Every three years since 2000 (latest round is 2015) M,S,R 43, 41, 57, 74, 65, 71 Age 15
SUPPLEMENT TABLE V
DISTRIBUTION OF SOURCE TEST FOR HLO
Over-time Comparability.— ISATs and RSATs have been designed to be comparable since the
late 1990s and early 2000s. Thus, the use of these modern assessments enables comparability over
time from this time period onwards.
Time Intervals.— While this is one of the largest and most comprehensive comparable learning
outcomes databases produced to date, it is still sparse given limited test frequency. In other
databases, the data is often disaggregated over 5-year periods. This produces continuously spaced
intervals, is designed to reduce noise by averaging results within these intervals, and is comparable
to the Barro-Lee approach for years of schooling. In this database, we have moved away from this
approach. We now provide the year of test as documented in official reports. This enables greater
granularity and precision of the data and enables the users of the database to make trade-offs at
their discretion.
39
Schooling Levels.— We construct a score for each grade and pool across grades within each
schooling level to produce primary and secondary school scores. We distinguish primary from
secondary schooling since enrollment rates drop off between levels in many developing countries.
This introduces a potential selection term in secondary school scores, with the highest performing
students progressing in the system, biasing scores up due to selection rather than actual learning.
Conceptually, the broader categories of ‘primary’ and ‘secondary’ scores enable us to categorize
learning at schooling levels across assessments which span multiple grades and age groups. If the
test is designed for an age group (for example, PISA) we code it at the relevant schooling level
(for example, secondary for PISA). We specify an approach to including specific grade levels to
ensure we have a tight grade interval within one to two years to minimize scope for grade-fixed
effects. While the interval is relatively small, it still leaves room for grade-fixed effects rather than
test-fixed effects when linking tests. For example, linking PIRLS 2001 grade 4 with SACMEQ
2000 grade 6 might capture a grade difference in PIRLS in addition to difficulty. However, to
enable greater country coverage, we put up with the need to expand beyond single grade level
intervals. Moreover, these differences are often small and since linking functions are applied to all
tests being linked, original ranks will be preserved. An analysis of EGRA in Supplement Figure
VIII demonstrates sensitivity to grade. We run a regression with and without grade-fixed effects
comparing mean scores relative to a country which participates across all three grade levels 2-4.
We find small differences, with near complete overlap in the confidence intervals on the grade and
non-grade-fixed estimates. This sensitivity analysis increases our confidence that the EGRA data,
and other regional assessment data, is robust to data availability by grade. We also include a
variable with grade information in the database to make this transparent.
Country-Year Dummies
Notes: We run a regression with and without grade-fixed effects comparing mean scores relative to a given country which participates across all
three grade levels 2-4 using EGRA data. We only include nationally representative data. The darker “All Grades” bar represents the confidence
interval without grade-fixed effects. The lighter “grade-fixed effects” represents the confidence interval with grade-fixed effects.
40
Subjects.— We construct linking functions specific to reading math, and science. While the
proficiency is not granular at the test item level, this ensures that there is significant proficiency
overlap when tests are being put on a global scale.
Subsamples.—When calculating the HLO by gender we apply the average linking function to each
subsample, rather than constructing subsample specific linking functions. While performance is
likely to vary across subsamples in a given test, the relationship between pair-wise tests being
linked is unlikely to vary across subsamples nor relative to the full sample.
Metadata versus Aggregate Data.— Our database is disaggregated by subject, schooling level,
grade, year and source test. We call this version the ‘metadata.’ The final data series used in the
Human Capital Index (HCI) aggregates the metadata presented in this paper. The aggregation used
in the HCI is described in depth in Kraay (2019). There are multiple ways to aggregate the data.
For example, the HCI averages data across schooling levels and subjects and uses the most recent
year available. The HCI further combines data differently depending on the testing source, for
example, including EGRA data in the final time series only when no other data is available. This
implicitly weights the importance of testing source over schooling level or subject. Alternative
aggregations of the metadata are possible. We present the metadata in this database to enable users
to make judgements based on the purpose of their analysis and for maximum transparency.
Exceptions.— In unusual cases, the procedures practiced for a given international or regional test
are adapted for the country context. Sri Lanka took a national assessment with items linked to the
PISA test to provide comparable scores. Sixth grade students in Botswana took TIMSS instead of
fourth grade in 2011. Another example includes India and China, where only certain states and
cities participated in PISA. These variations are acknowledged by the underlying tests and the data
is caveated with an asterisk in published reports. We preserve this information in our data, and
include notes in the metadata for each case. We describe each case in detail in Patrinos and Angrist
(2018). In the case of India, we verify that the state data is likely to be nationally representative
using national assessment data.
We make an adjustment beyond the underlying tests in the case of China given the likelihood that
China’s current PISA data is biased. The China HLO based on 2015 PISA data is from four cities
(Beijing, Shanghai, Jiangsu, and Guangdong) and is 532. However, this data is likely biased
upwards since the cities participating are urban and rich relative to the national average. We adjust
the score based on socioeconomic information by city and across the nation and produce an
average HLO of 462 at the secondary level, which is plausibly representative at the national level.
The detailed procedure is described in Patrinos and Angrist (2018).
41
SUPPLEMENT REFERENCES
Abadzi, Helen. “Efficient learning for the poor: New insights into literacy acquisition for
children.” International Review of Education 54, no. 5-6 (2008): 581-604.
Altinok, Nadir, Noam Angrist, and Harry A. Patrinos. Global Dataset on Education Quality 1965-
2015. World Bank Policy Research Working Paper No. 8314, 2018.
Barro, Robert J., and Jong Wha Lee. “A new data set of educational attainment in the world, 1950–
2010.” Journal of Development Economics 104 (2013): 184-198.
Caselli, Francesco. “Accounting for Cross-country Income Differences.” Handbook of Economic
Growth, 1 (2005), pp.679-741.
Caselli, Francesco and Antonio Ciccone. “The Human Capital Stock: A Generalized Approach.
Comment.” American Economic Review, 109, no. 3 (2019): 1155-74
Caselli, Francesco and John Coleman. “The World Technology Frontier.” American Economic
Review 96.3 (2006): 499-522.
Das, Jishnu and Tristan Zajonc. “India Shining and Bharat Drowning: Comparing Two Indian States
to the Worldwide Distribution in Mathematics Achievement.” Journal of Development Economics
92, no.2 (2010): 175–187
Dubeck, Margaret M. and Amber Gove. “The Early Grade Reading Assessment (EGRA): Its
Theoretical Foundation, Purpose, and Limitations.” International Journal of Educational
Development 40 (2015): 315-322.
Gneezy, Uri, John A. List, Jeffrey A. Livingston, Sally Sadoff, Xiangdong Qin, and Yang Xu.
Measuring success in education: the role of effort on the test itself. No. w24004. National Bureau
of Economic Research, 2017.
Gove, Amber. Early Grade Reading Assessment Toolkit. RTI International, USAID and the World
Bank, 2009.
Hall, Robert E. and Charles I. Jones. “Why Do Some Countries Produce So Much More Output Per
Worker Than Others?” Quarterly Journal of Economics, 114, no.1 (1999): 83-116.
Hanushek, Eric A., and Ludger Woessmann. “Do Better Schools Lead to More Growth? Cognitive
Skills, Economic Outcomes, and Causation.” Journal of Economic Growth 17, no.4 (2012a): 267-
321.
Hanushek, Eric A., and Lei Zhang. “Quality-consistent estimates of international schooling and skill
gradients.” Journal of Human Capital 3, no. 2 (2009): 107-143.
Hendricks, Lutz. “How Important is Human Capital for Development? Evidence from Immigrant
Earnings.” American Economic Review 92, no.1 (2002), 198-219.
Hendricks, Lutz. and Todd Schoellman. “Human Capital and Development Accounting: New
Evidence from Wage Gains at Migration.” Quarterly Journal of Economics, 133 no. 2 (2017):
665-700.
Holland, Paul W. and Neil J. Dorans. “Linking and Equating.” Educational Measurement 4 (2006):
187-220.
42
Jones, Benjamin F. “The Human Capital Stock: A Generalized Approach.” American Economic
Review 104, no.11 (2014): 3752-77.
Jones, Benjamin F. “The Human Capital Stock: A Generalized Approach: Reply.” American
Economic Review 109, no. 3 (2019): 1175-95.
Klenow, Peter J., and Andres Rodriquez-Clare. The neoclassical revival in growth economics: has it
gone too far?” NBER Macroeconomics Annual, MIT Press, Cambridge, MA (1997): 83–103.
Kolen, Michael J., and Robert L. Brennan. Nonequivalent groups: Linear methods. Test equating,
scaling, and linking. (2014): 103-142.
Kraay, Aart. “The World Bank Human Capital Index: A Guide.” The World Bank Research
Observer 34, no. 1 (2019): 1-33
Mankiw, N. Gregory, David Romer and David N. Weil. “A Contribution to the Empirics of Economic
Growth.” Quarterly Journal of Economics 107, no. 2 (1992): 407-437.
Mislevy, Robert J., Albert E. Beaton, Bruce Kaplan and Kathleen M. Sheehan. “Estimating
Population Characteristics from Sparse Matrix Samples of Item Responses.” Journal of
Educational Measurement 29, no.2 (1992): 133-161.
OECD. PISA 2015 Technical Report. OECD Publishing, 2015.
Patrinos, Harry Anthony, and Noam Angrist. Global Dataset on Education Quality: A Review and
Update (2000–2017). The World Bank, 2018.
Psacharopoulos, George, and Harry Anthony Patrinos. “Returns to investment in education: a further
update.” Education Economics 12, no. 2 (2004): 111-134.
Reardon, Sean F., Demetra Kalogrides, and Andrew D. Ho. 2019. “Validation methods for aggregate-
level test scale linking: A case study mapping school district test score distributions to a common
scale.” Journal of Educational and Behavioral Statistics (2019).
Schoellman, Todd. “Education Quality and Development Accounting.” The Review of Economic
Studies 79, no. 1 (2011): 388-417.
Steinmann, Isa, Rolf Strietholt and Wilfried Bos. Linking International Comparative Student
Assessment. LINCS Technical Report, 2014.
43
SUPPLEMENTAL TABLES
SUPPLEMENT TABLE VI
TEST LINKING ARCHITECTURE
Notes: For ease of representation, we include countries used at any point in time for each test linking procedure. In some rounds, some countries
are not included, since we specify that for a given round to be linked, tests should be administered in adjacent years. A more detailed architecture
by year is available on request.
44
SUPPLEMENTAL FIGURES
268
597
No data
SUPPLEMENT FIGURE IX
Primary Learning Score
297
577
No data
SUPPLEMENT FIGURE X
Secondary Learning Score
45
305
601
No data
SUPPLEMENT FIGURE XI
Math Learning Score
243
548
No data
288
578
No data
46
278
583
No data
283
575
No data
SUPPLEMENT FIGURE XV
Male Learning Score
47