The Effects of Label Errors in Training Data on Model Performance and Overfitting
Nicholas Pellegrino*1 Nolen Zhao*1,2 Paul Fieguth1
1 Vision and Image Processing Group, Systems Design Engineering, University of Waterloo
2 Mechanical & Mechatronics Engineering, University of Waterloo
{npellegr,n37zhao,pfieguth}@uwaterloo.ca
Abstract Outliers Label Errors
Training data used in machine learning applications are often as-
Data points
sumed to be perfect, i.e., do not contain any errors; however, this
is almost never the case and may lead to limitations in the resulting
model performance. In this paper, the effects of the presence of la-
bel errors in training data are studied quantitatively and in relation to
model overfitting. By artificially creating label errors, it is observed that
a constrained (small) CNN model exhibits remarkable generalizability
— retaining high accuracy even when most data are mislabelled! Test
accuracy catastrophically falls only for unrealistically high label error
rates, at a point related to the number of classes present in the data.
These preliminary experiments pave the road towards further studies
1-NN
of model robustness, possibly offering a quantitative method through
which to compare models.
1 Introduction
In supervised learning problems, a set of labelled data, known as
training data, are required to optimize / train the model [1, 2]. Deep
neural networks, including convolutional neural networks (CNNs) [3,
5-NN
4], consist of layers of interconnected artificial neurons with associ-
ated weights which must be optimized in order to train the model.
Machine learning engineers normally assume that the “ground truth”
training data are labelled correctly ; however, this is not necessarily
the case, and in fact is often not the case! Indeed, in many bench-
mark datasets, label errors are present in rates on the order of 5% [5],
for example, in ImageNet [6]. In biological data, for example, the re- Fig. 1: Training data may contain both outliers and label errors. The
cently introduced BIOSCAN-1M Insect Dataset [7], where images of two columns include versions of a 2-class dataset: one with outliers
insects are labelled according to their taxonomy, the presence of la- and the other with label errors. The first row shows the data points,
belling errors is nearly inevitable given the difficulty of the taxonomic while the subsequent rows show nearest-neighbour (1-NN) and 5-
assessment problem [7, 8] and human error. In cases where train- nearest-neighbour (5-NN) classification regions. Data point shape
ing data label errors exist, one must ask the question of how model (circle vs. triangle) indicates true class and colour (red vs. blue) in-
performance ought to be evaluated, and what it means to achieve a dicates ground truth label. Outliers and mislabelled data may appear
particular percentage accuracy when some (likely unknown) fraction to be similar, but arise from completely different causes.
of labels are incorrect.
For illustration purposes, two versions of labelled data from a sim-
ple 2-class problem are pictured in the top row of Figure 1. In the first
training and evaluating a simple CNN model. By setting the corruption
column, the data contains outliers, and in the second column, there
rate, evaluations of model overfitting are made in a very controlled en-
are label errors. Data point shape (circle vs. triangle) indicates the
vironment. Techniques shown here may also lend themselves toward
true class and colour (red vs. blue) indicates the ground truth (train-
the determination of whether a particular model type may be more or
ing) label. Note that while outliers and mislabelled data may appear
less robust to the presence of training label errors.
to be similar, the two arise from completely different causes and will
impact classification models differently. Assuming data are clustered
with high density, surrounding some prototypical center point, outlier
points are those that are far from their true class’s center, whereas
2 Background
mislabelled data may appear anywhere but actually are often (due to
the assumption of high density) near their true class’s center. The As introduced in Section 1, biological data are especially prone to
presence of mislabelled data may lead to local classification error, es- mislabelling due to their complex nature. In particular, the BIOSCAN
pecially in cases of overfitting. Indeed, a more local classifier, such as project [10] is an ecologically important and relevant research effort
a nearest-neighbour scheme (shown in middle row of Figure 1), would in which the presence of label errors must be considered. In the
be highly susceptible to overfitting and local errors, whereas a more BIOSCAN project, insects are hand-labelled by taxonomic experts
global classifier, for example a 5-nearest-neighbour scheme (shown who make their assessment based on captured images. The main
in bottom row of Figure 1), would be less susceptible to overfitting difficulty here, ignoring the requirement for a high level of expertise,
and local classification errors due to its reliance on the consensus of is the lack of consensus and certainty about the taxonomy of life itself
multiple training data points. (i.e., the locations and numbers of branches / subcategories within
While simple nearest neighbour classification schemes may be the tree-like hierarchy). Fundamentally, the taxonomic categorization
easy to envision and intuitively understood for simple problems such of life is based on theory more so than an observable underlying struc-
as that of Figure 1, the behaviours of deep-neural-network-based ture. Indeed, much controversy may be found within the community of
classifiers on real-world problems are not. This paper studies the taxonomists! Nonetheless, it is accepted that a hierarchical structure
impact of having mislabelled training data by artificially corrupting the does exist and may eventually be largely uncovered. Therefore, the
training data from a familiar benchmark dataset, MNIST [9], and then notion of what should be considered an error is somewhat vague. Er-
rors may arise as a result of human error (e.g., labelling two examples
* Indicates equal contribution, joint first-authorship. of the same species as being of different taxa), or as a result of simply
not knowing in which category a given example belongs (e.g., labelling
an example (or entire group of examples) as being part of a given cat-
egory, when in fact it would better fit elsewhere). In the BIOSCAN-1M
Insect Dataset, the error rate is unknown; however, there is no doubt
that some errors are present.
To address the presence of label errors, in 2021, Northcutt et al.
developed a method for automatically detecting and correcting errors
in training data, known as Confident Learning [5, 11]. In doing so,
benchmark datasets including MNIST [9], CIFAR [12], ImageNet [6],
and more were examined, and possible mislabelled examples were
identified. Crowd-sourcing (Mechanical Turk) was then used both to
verify which selected examples were indeed incorrectly labelled, and
to propose a corrected label through consensus. These results are
available at labelerrors.com and provide a valuable resource for
those in the field. While this work provides one possible path for-
wards in contending with label errors in training data, little is known
about the behaviour and robustness of specific deep neural network
architectures in terms of handling label errors.
3 Preliminary Experiments & Results
Experiments are conducted upon the MNIST dataset, known to have Fig. 2: Model accuracy as a function of training data corruption rate.
a very low error rate (0.15%) [5] due to its simplicity. To evaluate Accuracy remains remarkably high even when most training data are
the impact of having increased error rates on model accuracy, the mislabelled! Until the corruption rate nears 0.9, model performance is
training partition of the dataset is artificially corrupted. Data are re- hardly affected. Beyond this point, there becomes fewer labels of the
labelled according to a specified corruption rate, rc ∈ [0, 1]. Whether correct class than any other, incorrect, class, and accuracy plummets
any given example is re-labelled is determined randomly according to towards zero.
whether a random number drawn (from a uniform distribution) is less
than rc . In this manner, over large quantities of data, the proportion of
re-labelled data approximates rc . Note that if selected, an example’s those of a 6-class and 2-class problem. In each case, for the general
label is necessary changed, i.e., made incorrect. Throughout all ex- M-class problem, examples from the first M classes of MNIST are
periments, model and training hyperparameters are set according to retained, omitting the remainder. Figure 3 shows the results of this
values specified in Table 1. To keep experiments simple, a minimalis- experiment which indeed confirm that the point at which the catas-
tic model based on an introductory example from PyTorch [13] capa- trophic change occurs is related to the number of classes through
ble of achieving > 99% accuracy on the MNIST dataset was selected. rc′ = 1 − 1/M.
The model used is a CNN consisting of two convolutional layers, fol- To gain further insight, training and testing loss are examined in
lowed by max pooling, dropout, a fully connected layer, dropout, and a Figure 4. Notice that while both losses do increase with increasing
final fully connected layer. In total, the model has only 1.2 M trainable corruption rate, the testing loss remains below the training loss until a
parameters. cross-over point at rc = 0.9, demonstrating for this range of corruption
rates that the model performs better during testing than it does during
Table 1: Hyperparameters used for experiments. training and is able to generalize quite well (i.e., not overfit) in spite
of large quantities of mislabelled data. At rc = 0.9, the cross-over
point, for each true class, there are approximately equal numbers of
Parameter Setting training samples labelled as all ten classes, and the model learns to
randomly guess, thereby resulting in equal loss during training and
Loss function Cross-Entropy testing. Beyond the cross-over point, fewer training samples of each
Optimizer SGD with momentum given class are labelled correctly than all other classes, the model
Learning rate 0.01 learns to not estimate the correct class (i.e., has overfit to mislabelled
Momentum 0.9 data), training loss plateaus, and testing loss spikes.
Batch-Size 64
Num. Epochs 12
4 Discussion
Firstly, the model validation accuracy is examined as a function In Figure 2, accuracy tapers quite gradually for modest (i.e., realis-
of training data corruption rate, shown in Figure 2. Observe that ac- tic) corruption rates, for example 0.05 < rc < 0.3. This insensitivity
curacy remains approximately steady and high (over 95%!) until a to corruption rate indicates that the model is able to generalize well,
corruption rate of approximately rc = 0.9, where an abrupt downward and may be a feature useful as a point of comparison between model
change occurs, before settling-out once again. This finding is quite types. Models for which accuracy decreases at a greater rate would
remarkable, given that the model continues to be accurate even when have a greater tendency to overfit and would generalize more poorly
most training data is mislabelled! The abrupt change seems to corre- than those models for which accuracy decreases more gradually.
spond to the transition point at which for any class, the number of Comparing loss with accuracy, in Figure 4, loss increases grad-
labels indicating the correct class equals the number of labels for ually with corruption rate, when meanwhile in Figure 2, the accuracy
any other, incorrect, class. Before this point, the model still tends is almost invariant to corruption rate until a point at which there is a
to learn the correct class-label association, and performs quite well. catastrophic and large change. This behaviour in accuracy seems to
After this point, the model has overfit to the mislabelled data and per- contradict what is seen in the loss:
forms poorly during testing. In terms of the number of classes within
the dataset (M = 10, for MNIST), the relationship determining the lo- Why is it that loss changes by only a small amount (specifically
cation of this catastrophic change in model behaviour appears to be surrounding the rc = 0.9 point) when yet accuracy rapidly plummets
rc′ = 1 − 1/M. from near 100% to near 0%?
To verify the relationship between the location of the abrupt
change and the number of classes, a similar experiment over which This is a result of how inference is performed and how cross-entropy
the number of classes is artificially reduced is conducted. Here, ac- loss is defined. The model outputs (after running through SoftMax)
curacy results for the original 10-class problem are shown alongside a set of predicted class probabilities, { p̂i }i∈[1,10] . The class with the
Fig. 3: Similar to the accuracy vs. corruption rate plot of Figure 2, Fig. 4: Training and testing loss as a function of corruption rate. Ob-
model accuracy is evaluated for a 10-class, 6-class, and 2-class prob- serve the cross-over point at rc = 0.9, whereby testing loss begins to
lem. For the 10-class problem, the abrupt change occurs at a corrup- exceed training loss. Training loss tends to plateau as false-labels
tion rate of approximately 0.9, whereas for the 6-class problem, the tend towards being fully uncorrelated and then anti-correlated with
abrupt change occurs at only 0.83, and for the 2-class problem, al- the data itself, i.e., random but not correct. Testing loss initially is
ready at 0.5. This location follows a trend specified by rc′ = 1 − 1/M, below training loss, as the model is still able to partially learn the
where M is the number of classes. correct class-label relationships (given that most data is still correctly
labelled); however, beyond the cross-over point, most data is not la-
belled correctly, the model learns to not estimate the correct class,
and testing loss spikes.
highest predicted probability is selected as the inferred class for a
given input, i.e.,
predicted class = arg max p̂i . (1) 5 Conclusion
i
This study investigated the impacts of the presence of label errors in
training data on model accuracy and training and testing loss. A sim-
So long as the predicted probability for the correct class, p̂c , is slightly
higher that that of all others p̂i , i ̸= c, the network will infer the cor-ple CNN model was used, with data artificially corrupted in the MNIST
rect class. As corruption rates increase towards rc = 0.9, fewer and dataset. Remarkably, the model continued to perform with high accu-
fewer samples are correctly labelled, and the predicted class proba- racy (over 95%) even when most training data was mislabelled! While
bilities tend towards that of a uniform random distribution. Just prior cases of data with large error rates are highly unlikely in practice, sim-
to rc = 0.9, the amount of correctly labelled data slightly exceeds the ilar investigations may be useful for machine learning engineers to
amount of incorrectly labelled data for each label, the predicted class learn more about which model architectures tend to generalize better
probability for the correct class, p̂c , generally slightly exceeds that of and can be used to avoid overfitting to mislabelled data.
all others (just greater than 0.1), and the model tends to still correctly Much future work remains in the study of label errors and model
classify testing data correctly. However, cross-entropy loss computes overfitting. Investigations of
the negative natural log of the predicted correct class probability, p̂c , • more complicated models and classification problems
averaged over all samples, indexed by n, in a batch of size N, (datasets),
• non-uniform error distributions (since label errors in real data
are likely to exhibit some correlation), and
1 N • constraints that may induce overfitting (e.g., limiting the amount
JCE = − ∑ ln ( p̂c ). (2)
N n=1 of data)
will be performed in order to better understand the architectural fea-
tures that make certain models more robust.
Notice that − ln (0.1) ≈ 2.3026, almost exactly the loss seen at the
cross-over point, at rc = 0.9. The negative log of the predicted cor-
rect class probability, − ln ( p̂c ), is smooth and does not exhibit large Acknowledgments
change surrounding the point p̂c = 0.1, whereas the highly non-linear
class selection method of Equation (1), which simply selects the class This research was enabled in part by support provided by Calcul
with highest predicted probability, abruptly changes as p̂c decreases Québec (calculquebec.ca) and the Digital Research Alliance of
below 0.1, leading to a near instantaneous loss in accuracy. Canada (alliancecan.ca).
While it is totally unrealistic to assume that models are being We acknowledge the support of the Natural Sciences and Engi-
trained with data having error rates towards rc = 0.9 in practice, the neering Research Council of Canada (NSERC), NSERC-PGS D, and
resulting observed trends in accuracy vs. corruption rate do reveal NSERC Discovery Grant, funding reference number RGPIN-2020-
a great deal about the robustness of a particular model to the pres- 04490.
ence of label errors. Having robustness to mislabelled data indicates Cette recherche a été financée par le Conseil de recherches en
that a model is better able to generalize, and not overfit to mislabelled sciences naturelles et en génie du Canada (CRSNG), CRSNG-ES D,
data. While only one model was explored in this study, this type of et CRSNG Subvention à la Découverte, numéro de référence RGPIN-
approach may be used to analyze and compare other prospective 2020-04490.
models for use in more complex classification problems in the real
world, allowing a designer to discover which models or architectures
are most susceptible to overfitting the dataset at hand, and select the
most suitable one.
References
[1] V. Nasteski, “An overview of the supervised machine learning
methods,” Horizons. b, vol. 4, pp. 51–62, 2017.
[2] A. Mathew, P. Amudha, and S. Sivakumari, “Deep learning tech-
niques: an overview,” Advanced Machine Learning Technologies
and Applications: Proceedings of AMLTA 2020, pp. 599–608,
2021.
[3] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol.
521, no. 7553, pp. 436–444, 2015.
[4] I. Goodfellow, Y. Bengio, and A. Courville, Deep learning. MIT
press, 2016. [Online]. Available: http://www.deeplearningbook.
org
[5] C. G. Northcutt, A. Athalye, and J. Mueller, “Pervasive label
errors in test sets destabilize machine learning benchmarks,”
NeurIPS 2021 Datasets and Benchmarks Track, 2021.
[6] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Ima-
genet: A large-scale hierarchical image database,” in 2009 IEEE
conference on computer vision and pattern recognition. Ieee,
2009, pp. 248–255.
[7] Z. Gharaee, Z. Gong, N. Pellegrino, I. Zarubiieva, J. B. Haurum,
S. C. Lowe, J. T. McKeown, C. C. Ho, J. McLeod, Y.-Y. C. Wei
et al., “A step towards worldwide biodiversity assessment: The
bioscan-1m insect dataset,” arXiv preprint arXiv:2307.10455,
2023.
[8] N. Pellegrino, Z. Gharaee, and P. Fieguth, “Machine learning
challenges of biological factors in insect image data,” Journal of
Computational Vision and Imaging Systems, vol. 8, no. 1, pp.
34–37, 2022.
[9] L. Deng, “The mnist database of handwritten digit images for
machine learning research,” IEEE Signal Processing Magazine,
vol. 29, no. 6, pp. 141–142, 2012.
[10] “BIOSCAN,” Jun 2022. [Online]. Available: https://ibol.org/
programs/bioscan/
[11] C. Northcutt, L. Jiang, and I. Chuang, “Confident learning: Esti-
mating uncertainty in dataset labels,” Journal of Artificial Intelli-
gence Research, vol. 70, pp. 1373–1411, 2021.
[12] A. Krizhevsky, “Learning multiple layers of features from tiny im-
ages,” Tech. Rep., 2009.
[13] PyTorch, “Basic mnist example,” Sep 2022. [Online]. Available:
https://github.com/pytorch/examples/tree/main/mnist