A Theoretical Analysis of the Learning Dynamics
under Class Imbalance
E. Francazia,b, M. Baity Jesib, A. Lucchic
a
Physics Department, EPFL, Switzerland
b
SIAM Department, Eawag (ETH), Switzerland
c
Department of Mathematics and Computer Science, University of Basel, Switzerland
contact: [email protected]
Why Class Imbalance is interesting
Why Class Imbalance is interesting
Many
Many datasets
datasets areare affected
affected by Class
by Class Imbalance
Imbalance
Imbalance can significantly impact performance
fraud detectio
Performance of minority classes drops:
spam identificatio
fraud detectio
Minority Initial Drop (MID
biodiversity monitorin
spam identificatio
...
biodiversity monitorin
Species [Kyathanahally et al. 2021]
... The learning dynamics
Faces [Zhang et al. 2017]
is delayed for
imbalanced problems
Places [Wang et al. 2017]
Actions [Zhang et al. 2019]
How does class imbalance affect learning dynamics?
Class Imbalance causes drop in minority class performance (MID
This delays the learning process
GD and SGD are differently affected by Class Imbalance
Majority class gradient
MID is caused by differences in the per-class gradients
Minority class gradient
Whole dataset
Descent vector
Gradient of single example
Gradient Descent Stochastic Gradient Descent
Per-class full-batch gradient Gradient is dominated by Randomness due to batch
majority class contribution
random selection causes
Per-class full-batch normalized
gradient
directional noise in gradients
Per-class mini-batch normalized
gradient
Orthogonal projection
Minority class contribution has
negative scalar product with Directional noise is higher for
descent vector
minority class
(per-class loss increases)
(lower signal along
full-batch per-class direction)
Gradient Descent Stochastic Gradient Descent
Class Imbalance induces a difference in the per-class gradient norms;
Imbalance induces a difference in the per-class gradient directional noise;
GD dynamics, which follow the gradient direction, will be ruled by the majority class. the signal along the full-batch direction (FBD) is damped more for minority class.
Equalizing per-class norms
Normalizing the gradient contribution from each class (PCNGD)
isnot enough (PCNSGD);
eliminates the gap between per-class performances. We need to equalize
the projections along FBD (PCNSGD+R)
Per-class
normalization
Per-class
normalization (a) Full batch (b) One mini batch (c) Many mini batches
Conclusion
Class Imbalance induces differences in per-class gradients causing drop in minority class performance
minority class performance
GD: Per-class normalization allows for monotonic loss. We prove convergenc
GD: Per-class normalization allows for monotonic loss. We prove convergenc
SGD: Addictional directional noise must be taken into accoun
SGD: Addictional directional noise must be taken into accoun
Directional noise explains effectiveness of methods such as oversampling (O)
Directional noise explains effectiveness of methods such as oversampling (O) arXiv:2207.00391