0% found this document useful (0 votes)
20 views10 pages

Solving Data Imbalance in Text Classification With

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views10 pages

Solving Data Imbalance in Text Classification With

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

This article has been accepted for publication in IEEE Access.

This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3306805

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

VOLUME x, xxxx 1

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3306805

Date of publication xxxx 00, 0000, date of current version November 9, 2022.
Digital Object Identifier 10.1109/ACCESS.2017.DOI

Solving Data Imbalance in Text


Classification with Constructing
Contrastive Samples
XI CHEN1 , WEI ZHANG1 , SHUAI PAN1 AND JIAYIN CHEN1
1
Advanced Institution of Information Technology Peking University, No.233, Yonghui Rd, Hangzhou, Zhejiang, 311215 China (e-mail: [email protected])
Corresponding author: Wei Zhang (e-mail: [email protected]).
This work was supported by the National Key Research and Development Program of China under Grant 2022YFF0903302

ABSTRACT
Contrastive learning (CL) has been successfully applied in Natural Language Processing (NLP) as a
powerful representation learning method and has shown promising results in various downstream tasks.
Recent research has highlighted the importance of constructing effective contrastive samples through data
augmentation. However, current data augmentation methods primarily rely on random word deletion,
substitution, and cropping, which may introduce noisy samples and hinder representation learning. In
this article, we propose a novel approach to address data imbalance in text classification by constructing
contrastive samples. Our method involves the use of a Label-indicative Component to generate high-
quality positive samples for the minority class, along with the introduction of a Hard Negative Mixing
strategy to synthesize challenging negative samples at the feature level. By applying supervised contrastive
learning to these samples, we are able to obtain superior text representations, which significantly benefit
text classification tasks with imbalanced data. Our approach effectively mitigates distributional biases
and promotes noise-resistant representation learning. To validate the effectiveness of our method, we
conducted experiments on benchmark datasets (THUCNews, AG’s News, 20NG) as well as the imbalanced
FDCNews dataset. The code for our method is publicly available at the following GitHub repository:
https://github.com/hanggun/CLDMTC.

INDEX TERMS Data imbalance, contrastive learning, data augmentation, hard negative samples, text
classification

I. INTRODUCTION Semi-supervised techniques [9] select high-quality unlabeled


Learning a good representation has become an essential data with pseudo labels to enhance model performance, but
problem for text classification tasks. The models of text rep- they often introduce noise that can negatively impact results.
resentation include N-gram statistics [1], word embeddings Researchers have also explored the use of Generative Adver-
[2], CNN-based [3], [4], RNN-based [5], and Transformer- sarial Networks (GANs) to tackle data imbalance. Methods
based [6]. Especially, powerful pre-trained models for text like sGAN [10] and H-GANs [11] generate high-quality
representation, such as BERT [7], have shown state-of-the- synthetic data to enhance data representation and employ
art performance on text classification tasks without any task- discriminators for classification. Bi-GAN [12], on the other
specific architectural adaptations. However, the above meth- hand, solely uses normal data to establish decision bound-
ods have a problem in dealing with data imbalance in text aries between normal and anomaly data. MFC-GAN [13]
classification: the trained model tends to be biased towards adds high-quality synthetic instances of the minority class to
the distribution of original data. imbalanced datasets, preventing the model from being biased
There have been numerous approaches aimed at address- towards majority class instances. However, GANs are prone
ing the issue of data imbalance. Some methods involve incor- to model collapse, which makes training challenging.
porating external word knowledge through dynamic semantic Recently, contrastive learning (CL) has been successful
representation [8], which improves the text representation. as a powerful representation learning method and facilitates

2 VOLUME x, xxxx

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3306805

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

various downstream tasks [14], [15], which aims to improve hopefully also hard but more diverse, negative points.
the ability to learn more discriminative and robust features In general, our method of constructing contrastive samples
by pulling semantically close neighbors together and pushing includes two parts, one is to generate positive samples for the
apart non-neighbors [16]. Inspired by contrastive learning, minority class before training, and the other is to synthesize
we are interested in exploring whether it could handle the hard negative samples in mini batch training. We minimize
data imbalance problem and alleviate distributional skews. the cross entropy loss and contrastive loss on these samples to
However, a core problem within the contrastive learning improve text classification. To validate the effectiveness and
framework is how to construct effective contrastive samples. generality of our approach, we conduct experiments on four
To do so, [17] adopt a simple approach of taking the same text classification datasets, including three simulated bench-
labels as positive samples and the different labels as neg- mark datasets based on THUCNews, AG’s News, 20NG and
ative samples. This approach does not fulfill the potential imbalanced FDCNews dataset. The main contributions of this
of contrastive learning because the experiment [18] shows paper can be summarized as follows:
that the more contrastive samples, the better the learning • We propose a novel approach of constructing contrastive
effect. Some researchers [19], [20] consider using common samples and apply supervised contrastive learning for
data augmentation techniques to generate positive contrastive data imbalance in text classification, which alleviates
samples, e.g., randomly word deletion, replacement or span distributional skews and improves the performance.
deletion. Although this method further improves the effect • We design Label-indicative Component to generate pos-
of text classification, there is a significant problem: it may itive samples for the minority class and introduce Hard
produce noise contrastive samples. The reason is that if the Negative Mixing strategy to synthesize hard negative
core semantic words of the sample are randomly deleted or samples, which enhance the quality of generated sam-
replaced, it is likely to cause the label of the new sample to ples and reduce noise-invariant representation learning.
change, especially when the samples are not balanced. As • Extensive experiments on four benchmark datasets
shown in Figure 1a, the way of Keywords substitution gener- (both in English and Chinese) illustrate the effectiveness
ates samples with less noise. What’s more, if the keywords of our approach. The results show that the accuracy rate
related to the label are deleted by a random way, it will have is improved by 1% on average, and the highest is 3.21%.
a greater negative effect.
To alleviate aforementioned problems, we propose a novel II. RELATED WORK
approach to construct contrastive samples for data imbalance A. CONTRASTIVE LEARNING
in text classification. Specifically, we first design a statisti- Contrastive Learning has become a rising representation
cal indicator, a Label-indicative Component (LIC), to find learning method because of its significant success in various
and sort out label-indicative keywords; then replace label- computer vision tasks [14], [23], [24]. Some researchers
indicative keywords with the most similar words by word proposed to enhance the representations of the different aug-
vector techniques, such as Word2vec, BERT, to generate mentation of an image agree with each other and showed pos-
positive contrastive samples for the minority class. Using itive results [18], [25]. Inspired by the success of contrastive
this method ensures that the labels of the generated new learning in computer vision, a lot of works have tried to use
samples remain consistent with the original samples, while the contrast learning framework to improve the representa-
simultaneously increasing the number of samples in the mi- tion learning of text and achieved significant results. [26]
nority class. Additionally, our observation reveals that many proposed a pre-training model, SimCSE, based on contrastive
substitution words may not appear in the dataset, allowing sentence embedding framework and improved downstream
the model to learn new samples beyond the distribution and tasks. In addition, several studies [27], [28] directly add
enhance the learning of more generalized features. To further contrastive loss to the supervision task for joint learning
improve the quality of constructing contrastive samples, we to improve the model representation ability. Our approach
have restricted the threshold of similarity between words, i.e. is different from these previous works in that we utilize a
exceeding 0.6. novel data augmentation, i.e. Label-indicative Component
Besides, some recent works [18], [21] have shown that and Hard Negative Mixing strategy.
more contrastive samples do not necessarily mean mean-
ingful samples. When the positive sample and the negative B. CONSTRUCTING CONTRASTIVE SAMPLES
sample are far away in the semantic space, they hardly The main difference among works on contrastive learning
contribute to the contrastive loss. Inspired by these works, is their various ways of constructing contrastive samples
we adopt Hard Negative Mixing (HNM) strategy [22] to through data augmentation. Currently, data augmentation for
synthesize hard negative samples at the feature level, which text is not as easy as for image. The main reason is that
makes the model more focus on the learning of hard negative every word in a sentence may play an essential role in
samples and allows it to learn more robust features. As shown expressing the whole meaning or judging the label. CERT
in Figure 1b, positive samples contain many negatives and [29] applies the back-translation to create positive samples
few hard ones. We propose to mix only the hardest negatives of original sentences, while CLEAR [19] proposes random-
(based on their similarity to the positive) and synthesize new, words-deletion, spans-deletion, synonym-substitution, and
VOLUME x, xxxx 3

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3306805

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

Positive Samples
Data augmentation THUCNews
Negative Samples
None 96.86 Synthesize Negative Samples
Random cropping 94.11
Random deletion 95.08
Random substitution 94.92
Keywords substitution 96.05
Keywords deletion 93.21
(a): comparison of different data augmentations (b): an example of hard negative mixing strategy
FIGURE 1. (a) Comparison of different data augmentations on THUCNews test set. Random cropping: random crop and keep a continuous span; Keywords: words
related to the label of the sample; substitution: words are substituted by their similar words. All operations are based on 10% of the sentence length 300. (b) A
t-SNE plot after representing samples from the THUCNews on 64-dimensional embeddings on the unit hypersphere, which shows a toy example of the hard
negative mixing strategy.

reordering four ways for constructing sentence contrastive


samples. These methods are not suitable for supervised learn-
ing tasks such as text classification, because they do not
consider the impact of words on the label of a sentence. [15]
use text summarization as the data augmentation strategy for
text classification, although this is highly related to our work,
it processes document-level text and does not consider the
impact of keywords on labels.

C. DATA IMBALANCE IN TEXT CLASSIFICATION


Data imbalance is a common problem in text classification.
Traditional methods to deal with data imbalance involve
resampling (random undersampling the majority class, ran-
dom oversampling the minority class and generating addi-
tional synthetic minority samples) or cost-adjustment [30].
FIGURE 2. The framework of the proposed model incorporates several key
Synthetic resampling methods based on Synthetic Minority contributions, which are highlighted in bold.
Oversampling Technique (SMOTE) [31] is one of the most
used approaches. Its basic idea is to interpolate between
the observations of the minority classes to oversample the representations:
training data. Our hard negative mixing strategy is similar to
p(Yi,c |Hi ) = sof tmax(W Hi ) c ∈ C (1)
it, but we focus more on how to use contrastive learning to
further solve data imbalance problem. where W ∈ RC×d , and C denotes the number of classes.
A model is trained by minimizing the cross entropy loss:
III. METHOD
N C
This section proposes a novel constructing contrastive sam- 1 XX
ples method for contrastive learning in text classification. The LCE = − Yi,c log(p(Yi,c |Hi )) (2)
N i=1 c=1
complete framework is depicted in Figure 2.
where N is the batch size.
A. PRELIMINARIES
Our learning setup is based on a standard multi-class clas- B. CONSTRUCTING CONTRASTIVE SAMPLES
sification problem with input training samples{Xi , Yi }i = In this paper, our method of constructing contrastive samples
1, ..., D. Given a token sequence Xi = [w1 , w2 , ..., wt ], consists of two parts, one is to use Label-indicative Compo-
input it into the Encoder model, such as CNNs and BERT, nent to generate positive samples for the minority class, and
and get a sequence of contextualized token representations another is to utilize Hard Negative Mixing to synthesize hard
Hi = {h1 , h2 , ..., hd }, where t, d denotes the length of the negative samples.
sequence and the dimension of the sequence representation,
respectively. 1) Label-indicative Component
The standard practice for text classification is to add a We design a statistical indicator, a Label-indicative Compo-
softmax classifier on top of the Encoder’s sentence-level nent (LIC), to find and sort out the keywords most relevant
4 VOLUME x, xxxx

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3306805

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

to the label for each sample. Then we substitute those label-


indicative keywords as the data augmentation strategy to Gˆk
constructing high-quality new positive contrastive samples Gk =
for the minority class, which aims to effectively alleviate ||Gˆk ||2
the distributional skews of the original dataset. The statistical Gˆk = αk Hi− + (1 − αk )Hj− (4)
indicator LIC is defined as follows: sim(Hi ,Hi− )
e
αk = − −
esim(Hi ,Hi ) + esim(Hi ,Hj )
T Fc (w) − µ(T F−c (w))
LIC(w, c) = ∗ IDF (w) (3) where Hi− , Hj− ∈ P K are randomly chosen negative
σ(T F−c (w))
features from the set P K = {H1− , H2− , ..., HK −
} of the
where T Fc (w) is the frequency of the w word in the closest K negatives, αk ∈ (0, 1) is a mixing coefficient
c category dataset and T F−c (w) denotes except the c-th to balance the contribution of two negative samples and
category. µ, σ refer to the mean and variance. IDF (w) ||.||2 is the ℓ2 -norm. About parameter αk , [22] adopt the
is inverse document frequency. The left part of equation 3 way of manual setting, such as unification of 0.5. Here, we
represents the correlation between the word w and the label dynamically calculate αk based on the similarity between
c, and the right part calculates the importance of w in the the two negative samples and the positive sample, so that the
whole dataset. Through this statistical indicator, we can find synthesized negative sample can bear more information close
the label-indicative keywords (top 2%) in the text. Table 1 to the positive sample, and further improve the rationality
shows the keyword results calculated on the dataset. of the synthesis of hard negative samples. After mixing, the
Then we use the word vector techniques, such as synthesized samples and the original samples form a new
Word2vec, to replace label-indicative keywords with the negative set to participate in the calculation of contrastive
most similar words and generate contrastive samples, which loss.
makes the new samples own the same label as its source text.
In addition, to ensure the quality of the substitution words, C. SUPERVISED CONTRASTIVE LEARNING
we limit the similarity of the words, such as greater than The main idea of supervised contrastive learning is minimiz-
0.6. For an example, we find label-indicative keywords in the ing the intra-class representation while maximizing the inter-
original sample (us chip-related stocks including intel decline class representation. It would be more easier for classifier
in europe.) is stocks, then we could generate a new sample (us to learn a good decision boundary after applying supervised
chip-related shares including intel decline in europe.). The contrastive learning.
labels of these two samples are consistent. Before training, we first apply label-indicative component
to construct constrastive samples for the minority class in
2) Hard Negative Mixing training set, and get new training set. In the training phase, for
In MoCo [21] the authors show that increasing the batch each sample xi and its feature vector Hi in the mini batch, we
size, is crucial to get better negative samples. However, Y. view the samples with the same label as its positive samples,
Kalantidis [22] has experimentally proved that as training others as negative samples in the mini batch. In addition, we
progresses, fewer and fewer negatives offer significant con- use hard negative mixing to generate hard negative samples.
tributions to the loss, which illustrates most of negatives are Therefore, our contrastive objective is to minimize the vector
practically not helping a lot towards learning the contrastive distance between the positive pair < Hi , Hi+ > and maxi-
task. mize the vector distance between the negative set. Here the
Inspired by above works, we adopt Hard Negative Mix- negative set contains two parts, one is normal negative set
ing (HNM) at the feature level to force the model to con- P = {H1− , H2− , ..., HM−
} and the other is hard negative set
centrate on more meaningful negative samples and learn − − −
G = {G1 , G2 , ..., Gs }. Referring to previous work [18],
more robust features. For a sample Xi , its feature vector [19], [22], the loss function for a positive pair is defined as:
Hi and the ordered set of negative feature vectors P =
{H1− , H2− , ..., HM

} from the same mini batch, such that: loss(Hi , Hi+ ) =
sim(Hi , H1 ) > sim(Hi , H2− ), the set of negative features

+
sorted by decreasing similarity to that particular positive fea- esim(Hi ,Hi )/τ (5)
− log −
ture, where sim(u, v) = uT v/(||u||2 ||v||2 ) denotes cosine +
esim(Hi ,Hi )/τ + j∈P ∪G esim(Hi ,Hj )/τ
P
similarity between two vectors.
For each positive sample, we synthesize s hard negative where Hi , Hi+ , Hi− are the feature vectors of the corre-
features by creating convex linear combinations of pairs of its sponding samples, and τ is a temperature hyperparameter.
“hardest” existing negatives. We define the hardest negatives The overall contrastive learning loss is defined as the sum of
by truncating the ordered set P , i.e. only keeping the first all positive pairs’ loss in a mini batch:
K < M items. Formally, let G = {G− − −
1 , G2 , ..., Gs } be Ti
N X
the set of synthetic sample to be generated. Then, a synthetic
X
+
LCL = loss(Hi , Hij ) (6)
sample Gk ∈ G, would be given by: i=1 j=1

VOLUME x, xxxx 5

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3306805

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

Label Label-indicative Keywords


Science/Tech {sp, java, telescope, cyber, ipod, uses, flaws, yahoo, malicious, robot}
World {darfur, palestinians, muslim, embassy, allawi, gunmen, arafat, sudan, clashes, lebanon}
Business {fullquote, kmart, href, boosted, yukos, marsh, celebrex, disruptions, stocks, widened}
Sports {nascar, doping, coaches, ron, titans, memphis, quarterback, panthers, robinson, speedway}

TABLE 1. An example of top 10 label-indicative keywords of each label from AG News dataset.

where Ti is the number of positive pairs in the mini batch • LIC denotes the use of our proposed data augmentation
for the sample Hi . Combining the classification loss with a method, Label-indicative Component (LIC), to generate
trade-off parameter, we can get the final loss function: news samples for training.
• CL represents adding Contrastive Loss (CL) to the
L = LCE + λLCL (7) original training task.
• LIC+CL means to add contrastive learning to the new
λ is a hyperparameter to control the relative importance of
dataset generated by data augmentation.
cross entropy loss and supervised contrastive loss.
• LIC+HNM+CL is based on LIC+CL, adding Hard
Negative Mixing method (HNM), whcih is also our
IV. EXPERIMENTS
overall framework of constructing contrastive samples.
A. EXPERIMENT SETUP
1) Dataset
4) Evaluation Metrics
To validate the effectiveness and generality of our approach,
we conduct experiments on four text classification datasets, We use two metrics to measure the performance of all the
including three simulated benchmark datasets based on approaches,including accuracy and Macro-F1 [33]. Macro-
THUCNews, AG’s News, 20NG and imbalanced FDCNews F1 equals to the average F1-score of labels, which is suitable
dataset. Since the first three datasets are balanced, we con- for the evaluation of the minority class.
struct them so that part of the labels are unbalanced. Please
see Appendix A for details of construction. 5) Settings
1
• The THUCNews dataset is a Chinese news classifica-
For constructing contrastive samples, we use Sogou News
tion dataset collected by Tsinghua University. (word 300d) 5 to look up similar words for Chinese tasks,
2
• The AG’s News dataset is created by Xiang Zhang [32]
and GloVe embedding 6 for English tasks. For CNN, we use
which contains 127600 samples with 4 classes. 3 filters with size [2,3,4], and the number of filters for each
3
• The 20NG dataset (bydata version) is an English news
convolution block is 128, and the embedding size is 300. For
dataset that contains 18846 documents evenly catego- BERT, we use BERT-Base-Cased [7] 7 for AG’s News and
rized into 20 different categories. 20NG, and use RoBERTa_zh_L12 [34]8 for THUCNews and
4
• The FDCNews dataset is provided by Fudan Univer-
FDCNews. In BERT models, we obtain text representations
sity which contains 9833 Chinese news categorized into from the BERT model and then use a dense layer with 256
20 different classes. This is an unbalanced dataset and units to decrease the dimension of the text representation
does not need to be constructed. to 256. We tune some parameters of our models by grid
searching on the validation dataset, we final set our λ equal to
0.1, the number of synthetic hard negative samples s = 10,
2) Encoder Model
the closest negatives K = 5. All models are optimized
Our method is proposed as an enhancement for current main
by Adam [35] with an initial learning rate of 0.001 and
stream models and focus the influence of contrastive learning
batch size of 64. All of the models are implemented using
on the encoded vectors. Therefore, we only select two widely
Pytorch and are trained on GPU GeForce GTX 1070 Ti. More
used model structure, CNN model [3] and pre-training model
parameter settings are in Appendix B.
BERT [7], as our encoder model.
B. EXPERIMENTAL RESULTS
3) Components for Comparison
In order to explore the effects of contrastive learning and our 1) Test Performance
proposed components for data imbalance in text classifica- Table 2 presents the test accuracy and Macro-F1 on all
tion, we conduct the following combined comparative test. datasets. We can observe that the proposed constructing
1 http://thuctc.thunlp.org 5 https://github.com/Embedding/Chinese-Word-Vectors
2 https://di.unipi.it 6 https://github.com/stanfordnlp/GloVe
3 https://www.cs.umb.edu/~smimarog/textmining/datasets 7 https://github.com/google-research/bert
4 http://www.nlpir.org 8 https://github.com/brightmart/roberta_zh

6 VOLUME x, xxxx

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3306805

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

AG’s News THUCNews FDCNews 20NG


Model
Acc Macro-F1 Acc Macro-F1 Acc Macro-F1 Acc Macro-F1
CNN 0.8797 0.8797 0.7989 0.7693 0.9694 0.8615 0.7550 0.7452
CNN+LIC 0.8822 0.8820 0.7987 0.7856 0.9723 0.8751 0.7630 0.7534
CNN+CL 0.8763 0.8757 0.8092 0.7912 0.9706 0.8719 0.7716 0.7621
CNN+LIC+CL 0.8871 0.8872 0.8158 0.8076 0.9728 0.8836 0.7865 0.7787
CNN+LIC+HNM+CL 0.8844 0.8848 0.8219 0.8150 0.9752 0.8825 0.7865 0.7787
BERT 0.8902 0.8895 0.8216 0.7905 0.9724 0.8763 0.7901 0.7711
BERT+LIC 0.8989 0.9022 0.8275 0.8136 0.9783 0.8807 0.8021 0.7782
BERT+CL 0.8893 0.8867 0.8299 0.8172 0.9694 0.8731 0.8114 0.7838
BERT+LIC+CL 0.9115 0.9123 0.8324 0.8216 0.9798 0.8836 0.8203 0.7924
BERT+LIC+HNM+CL 0.9108 0.9127 0.8419 0.8285 0.9811 0.8883 0.8222 0.8104

TABLE 2. Test Accuracy and Macro-F1 on different text classification tasks.

澻濴澼 濖瀂瀆濼瀁濸澳瀆濼瀀濼濿濴瀅濼瀇瀌澳瀀濴瀇瀅濼瀋 澻濵澼 瀇激濦濡濘澳瀉濼瀆瀈濴濿濼瀍濴瀇濼瀂瀁澳瀂濹澳瀆濴瀀瀃濿濸澳瀅濸瀃瀅濸瀆濸瀁瀇濴瀇濼瀂瀁瀆

FIGURE 3. Cosine similarity matrix (a) and corresponding t-SNE visualization (b) of sample representations of 20NG datasets. motocycles-Aug is the constructed
sample, forming a pair of positive samples with motocycles, and others are negative samples. hard negative mixing are the synthesized samples.

澻濴澼澳濧濻濸澳濸濹濹濸濶瀇澳瀂濹澳煰澳濼瀁澳濿瀂瀆瀆澳濹瀈瀁濶瀇濼瀂瀁 澻濵澼澳濧濻濸澳濸濹濹濸濶瀇澳瀂濹澳濵濴瀇濶濻澳瀆濼瀍濸澳濼瀁澳濶瀂瀁瀇瀅濴瀆瀇濼瀉濸澳濿濸濴瀅瀁濼瀁濺

FIGURE 4. Hyper-parameter analysis of our model, including the effect of λ in loss function and batch_size to constrastive learning. Here CNN_our denotes
CNN+LIC+HNM+CL.

contrastive samples method (LIC+HNM+CL) outperforms show that adding contrastive loss do not lead to improve-
base models and other ablation models on most datasets, ment in datasets AG’s News and FDCNews, but adding
which shows the effectiveness of constructing contrastive contrastive loss with constructed samples surpass baselines
samples for data imbalance in text classification task. The on each dataset, which illustrate constructing contrastive
results of Encoder (CNN or BERT)+LIC models show gen- samples can improve the feature representation ability of
erating positive samples for the minority class can improve contrastive learning. The Encoder+LIC+HNM+CL models
accuracy and Macro-F1 of classification task, up to 1%. achieve most of the best results, which show the effectiveness
The results of Encoder+CL and Encoder+LIC+CL models of hard negative mixing strategy. The biggest improvement

VOLUME x, xxxx 7

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3306805

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

The Majority Class The Minority Class


Model
Sports Estate Education Fashion Game Finance Home Technology Politics Entertainment
CNN 0.997 0.937 0.937 0.997 0.990 0.618 0.364 0.037 0.861 0.966
CNN_our 1.000 0.934 0.964 0.996 0.986 0.613 0.651 0.181 0.891 0.976
Performance 0.003↑ 0.003↓ 0.027↑ 0.001↓ 0.004↓ 0.005↓ 0.287↑ 0.144↑ 0.030↑ 0.010↑

TABLE 3. The accuracy of each class on THUCNews test dataset.

was achieved by the BERT+LIC+HNM+CL model on the the experimental results of different batch size are not very
20NG dataset, with 3.21% increase in test accuracy and different. When too larger, for example batch_size=256 in
3.93% in test Macro-F1. The natural confusion of the labels this case, the effect is slightly reduced. We think the reason
in 20NG dataset can shine light on the reason why our model is that the larger batch size is, the number of positive samples
performs quite well. Specifically, labels in the same group will also increase, which cannot increase the proportion
are highly similar, and contrastive learning and hard negative between positive and negative samples.
mixing method can more effectively improve the model’s
ability of learning discriminative features. D. ANALYSIS
In this section, we further conduct analyses to understand the
2) Constructing Contrastive Samples inner workings of our method.
We further visualize the learned samples representations in
20NG dataset, as shown in Figure 3. We can observe that the 1) Data imbalance
constructed positive sample is highly similar to the original Table 3 shows the accuracy of each class on THUCNews test
sample, and the synthesized hard negative samples are also dataset, including the majority class accuracy and the mi-
very similar, as shown in Figure 3a. From Figure 3b, We nority class accuracy. We observe that our proposed method
can find that the synthesized negative samples are the closest has a more obvious improvement on the minority class,
to postitive samples in the semantic space. This shows that especially for those labels that have particularly poor classi-
our label-indicative component can guarantee the quality fication effect, which indicates that our method can alleviate
of constructing positive samples, and hard negative mixing distributional skews and is more suitable for imbalanced text
strategy can make the model focus more on hard negative classification tasks. In addition, we also find that our model
samples. has slight negative effects on some labels. This shows that
the contrastive learning effect of data augmentation on some
C. ABLATION STUDY labels is lower than the impact of noise. Therefore, how
We investigate how different λ, batch sizes affect our models’ to further improve the quality of data augmentation is still
performance. All results are using the CNN_our model, eval- worthy of future research.
uated on the development set of 20NG dataset. In Appendix
C, we explore the number of hard negative samples. 2) Data augmentation
To explore the impact of different data augmentation on
1) The effect of λ contrastive learning, we compare 5 methods under two pro-
The λ is a controlling hyper-parameter to decide the impor- cessing ratios, each method trained 5 times, and reporting
tance of cross entropy loss and supervised contrastive loss. their mean ± standard deviation, the results as shown in table
The larger λ will give contrastive learning more weight when 4. We can see that our method LIC (semantic replace label-
training classification task. Figure 4a shows the accuracy indicative words) has the most stable performance under
curve of different λ on 20NG dataset. We can see from the different processing ratios. Furthermore, each data augmen-
graph, if the parameter λ is too large, it will cause the model tation will produce a certain amount of noise, and the Ran-
to be too biased towards comparative learning and reduce the domly Delete way has the largest fluctuation, and the Delete
classification effect. If it’s too small, the effect is similar to Label-indicative Words way has the largest negative effect.
the result of λ = 0. Therefore, we set the λ to 0.1 in our This results also indicate that under supervised learning, how
experiments. to data augmentation around label-indicative words is crucial.
In Appendix D, we further show that the impact of those data
2) The effect of batch size augmentation under more processing ratios.
In addition, We explore the impact of batch sizes. The reason
is that [21] found that the larger the batch size, the more 3) Contrastive learning on text classification
negative samples, and the better the comparative learning To directly show the strengths of our approaches for data
performance. However, we find that it does not meet this imbalance, we compare the performance of contrastive learn-
characteristic in our experiments. As shown in Figure 4b, ing on balanced text classification in table 5. We can see
8 VOLUME x, xxxx

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3306805

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

Augmentation FDCNews constructed samples. (ii) generalizing our method to multi-


None 0.8834 label classification problems.
2% 5%
RR 0.8843 ± 0.0108 0.8737 ± 0.0111
Conflicts of Interest The authors declare that they have
RD 0.8801 ± 0.0122 0.8842 ± 0.0196
no competing interest
RR_LW 0.8730 ± 0.0044 0.8729 ± 0.0039
D_LW 0.8726 ± 0.0021 0.8688 ± 0.0018
REFERENCES
[1] Sida Wang and Christopher Manning, “Baselines and bigrams: Simple,
SR_LW(LIC) 0.8847 ± 0.0008 0.8831 ± 0.0011
good sentiment and topic classification,” in Proceedings of the 50th Annual
Meeting of the Association for Computational Linguistics (Volume 2:
TABLE 4. Based on the CNN_our model, Macro-F1 of different data Short Papers), Jeju Island, Korea, July 2012, pp. 90–94, Association for
augmentations on FDCNews development set. RR k%: Randomly Replace Computational Linguistics.
100-k% of the length; RD k%: Randomly Delete k% words; RR_LIW k%: [2] Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov,
Randomly Replace k% Label-indicative Words; D_LW k%: Delete k% “Bag of tricks for efficient text classification,” in Proceedings of the 15th
Label-indicative Words.SR_LW k%: Semantic Replace k% Label-indicative Conference of the European Chapter of the Association for Computational
Words, corresponding to our LIC method. Linguistics: Volume 2, Short Papers, Valencia, Spain, Apr. 2017, pp. 427–
431, Association for Computational Linguistics.
[3] Yoon Kim, “Convolutional neural networks for sentence classification,”
in Proceedings of the 2014 Conference on Empirical Methods in Natural
Model CNN CNN+CL Improvement Language Processing (EMNLP), Doha, Qatar, Oct. 2014, pp. 1746–1751,
Association for Computational Linguistics.
AG’s News(*) 0.9244 0.9255 0.0011↑
[4] Rie Johnson and Tong Zhang, “Deep pyramid convolutional neural
AG’s News 0.8797 0.8757 0.0040↓ networks for text categorization,” in Proceedings of the 55th Annual
THUCNews(*) 0.9631 0.9664 0.0033↑ Meeting of the Association for Computational Linguistics (Volume 1:
Long Papers), Vancouver, Canada, July 2017, pp. 562–570, Association
THUCNews 0.7693 0.7912 0.0219↑ for Computational Linguistics.
20NG(*) 0.8352 0.8401 0.0049↑ [5] Duyu Tang, Bing Qin, and Ting Liu, “Document modeling with gated
recurrent neural network for sentiment classification,” in Proceedings
20NG 0.7452 0.7621 0.0169↑
of the 2015 Conference on Empirical Methods in Natural Language
Processing, Lisbon, Portugal, Sept. 2015, pp. 1422–1432, Association for
TABLE 5. Predictive improvement of each comparing algorithm on balanced Computational Linguistics.
and unbalanced text classification. (*) represents the original dataset and the [6] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion
samples are balanced. Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin, “Attention
is all you need,” in Advances in Neural Information Processing Systems,
I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vish-
wanathan, and R. Garnett, Eds. 2017, vol. 30, Curran Associates, Inc.
[7] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova,
that on the THUCNews the 20NG, contrastive learning has a “Bert: Pre-training of deep bidirectional transformers for language under-
significant improvement in the imbalance problem. However, standing,” 2018.
[8] Tianshi Wang, Li Liu, Naiwen Liu, Huaxiang Zhang, Long Zhang, and
on AG’s New, there is no obvious improvement, and even a Shanshan Feng, “A multi-label text classification method via dynamic
decline in imbalance. The reason should be that this dataset semantic representation model and deep neural network,” Applied Intelli-
has only four labels, and the discrimination of each label is al- gence, vol. 50, pp. 2339–2351, 2020.
[9] Jianyu Long, Yibin Chen, Zhe Yang, Yunwei Huang, and Chuan Li, “A
ready so high that there is no need to contrastive learning for novel self-training semi-supervised deep learning approach for machinery
feature enhancement. In other words, supervised contrastive fault diagnosis,” International Journal of Production Research, pp. 1–14,
learning is more suitable for datasets with strong similarity 2022.
[10] Kouhei Nakaji and Naoki Yamamoto, “Quantum semi-supervised gen-
between labels, such as multi-label text classification. erative adversarial network for enhanced data classification,” Scientific
reports, vol. 11, no. 1, pp. 19649, 2021.
V. CONCLUSION AND FUTURE WORK [11] R Elakkiya, Pandi Vijayakumar, and Neeraj Kumar, “An optimized gener-
ative adversarial network based continuous sign language classification,”
In this work, we propose an approach to construct contrastive Expert Systems with Applications, vol. 182, pp. 115276, 2021.
samples for data imbalance in text classification, including [12] Ziqiang Pu, Diego Cabrera, Yun Bai, and Chuan Li, “A one-class genera-
using Label-indicative Component (LIC) to generate positive tive adversarial detection framework for multifunctional fault diagnoses,”
IEEE Transactions on Industrial Electronics, vol. 69, no. 8, pp. 8411–8419,
samples for the minority class, and Hard Negative Mixing 2021.
(HNM) strategy to synthesize hard negative samples. We [13] Adamu Ali-Gombe and Eyad Elyan, “Mfc-gan: Class-imbalanced dataset
perform supervised contrastive learning on these samples classification using multiple fake class generative adversarial network,”
Neurocomputing, vol. 361, pp. 212–221, 2019.
to alleviate distributional skews and improve classification
[14] Olivier J Henaff, Aravind Srinivas, Jeffrey De Fauw, Ali Razavi, Carl
performance. Experiments on four benchmark datasets have Doersch, S. M. Ali Eslami, and Aaron van den Oord, “Data-efficient image
demonstrated the enhancement of our method across sev- recognition with contrastive predictive coding,” 2020.
eral popular deep learning models. We believe that our ap- [15] Yangkai Du, Tengfei Ma, Lingfei Wu, Fangli Xu, Xuhong Zhang,
Bo Long, and Shouling Ji, “Constructing contrastive samples via sum-
proach of constructing contrastive samples, particularly for marization for text classification with limited annotations,” 01 2021, pp.
addressing data imbalance, has broader applications in NLP. 1365–1376.
It provides a new perspective on data augmentation with [16] R. Hadsell, S. Chopra, and Y. LeCun, “Dimensionality reduction by learn-
ing an invariant mapping,” in 2006 IEEE Computer Society Conference
text inputs in contrastive learning. Our future work includes on Computer Vision and Pattern Recognition (CVPR’06), 2006, vol. 2, pp.
the following directions: (i) further improving the quality of 1735–1742.

VOLUME x, xxxx 9

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3306805

Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

[17] Sora Ohashi, Junya Takayama, Tomoyuki Kajiwara, Chenhui Chu, and XI CHEN received the Ph.D. degree in mechanical
Yuki Arase, “Text classification with negative supervision,” in Proceedings engineering from Zhejiang University, China, in
of the 58th Annual Meeting of the Association for Computational Lin- 2005. After graduation, he focused on data ana-
guistics, Online, July 2020, pp. 351–357, Association for Computational lytics and software development in company for a
Linguistics. number of years, and currently be a researcher in
[18] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton, the Advanced Institute of Information Technology,
“A simple framework for contrastive learning of visual representations,” Peking University, Hangzhou, China. His current
in International conference on machine learning. PMLR, 2020, pp. 1597–
research interests include natural language pro-
1607.
cessing, computer vision, and the data mining of
[19] Zhuofeng Wu, Sinong Wang, Jiatao Gu, Madian Khabsa, Fei Sun, and
Hao Ma, “Clear: Contrastive learning for sentence representation,” arXiv large data sets.
preprint arXiv:2012.15466, 2020.
[20] Lin Pan, Chung-Wei Hang, Avirup Sil, and Saloni Potdar, “Improved text
classification via contrastive adversarial training,” in Proceedings of the
AAAI Conference on Artificial Intelligence, 2022, vol. 36, pp. 11130–
11138.
[21] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick,
“Momentum contrast for unsupervised visual representation learning,” in
Proceedings of the IEEE/CVF conference on computer vision and pattern
recognition, 2020, pp. 9729–9738.
[22] Yannis Kalantidis, Mert Bulent Sariyildiz, Noe Pion, Philippe Weinza- WEI ZHANG Research associate. His research in-
epfel, and Diane Larlus, “Hard negative mixing for contrastive learning,” terests include software engineering and modeling
Advances in Neural Information Processing Systems, vol. 33, pp. 21798– of big data algorithm in the industrial field
21809, 2020.
[23] Aaron van den Oord, Yazhe Li, and Oriol Vinyals, “Representation learn-
ing with contrastive predictive coding,” arXiv preprint arXiv:1807.03748,
2018.
[24] Yonglong Tian, Dilip Krishnan, and Phillip Isola, “Contrastive multiview
coding,” in Computer Vision–ECCV 2020: 16th European Conference,
Glasgow, UK, August 23–28, 2020, Proceedings, Part XI 16. Springer,
2020, pp. 776–794.
[25] Ishan Misra and Laurens van der Maaten, “Self-supervised learning
of pretext-invariant representations,” in Proceedings of the IEEE/CVF
conference on computer vision and pattern recognition, 2020, pp. 6707–
6717.
[26] Tianyu Gao, Xingcheng Yao, and Danqi Chen, “SimCSE: Simple con-
trastive learning of sentence embeddings,” in Proceedings of the 2021
Conference on Empirical Methods in Natural Language Processing, On-
line and Punta Cana, Dominican Republic, Nov. 2021, pp. 6894–6910,
SHUAI PAN received the M.S. degree from the
Association for Computational Linguistics.
University of Edinburgh, Scotland, U.K., in 2020.
[27] Tian Shi, Liuqing Li, Ping Wang, and Chandan K Reddy, “A simple and
effective self-supervised contrastive learning framework for aspect detec-
His current research interests include natural lan-
tion,” in Proceedings of the AAAI conference on artificial intelligence, guage processing, sentiment analysis, and natural
2021, vol. 35, pp. 13815–13824. language generation.
[28] Haotian Fu, Hongyao Tang, Jianye Hao, Chen Chen, Xidong Feng, Dong
Li, and Wulong Liu, “Towards effective context for meta-reinforcement
learning: an approach based on contrastive learning,” in Proceedings of
the AAAI Conference on Artificial Intelligence, 2021, vol. 35, pp. 7457–
7465.
[29] Hongchao Fang, Sicheng Wang, Meng Zhou, Jiayuan Ding, and Pengtao
Xie, “Cert: Contrastive self-supervised learning for language understand-
ing,” arXiv preprint arXiv:2005.12766, 2020.
[30] Paula Branco, Luís Torgo, and Rita P Ribeiro, “A survey of predictive
modeling on imbalanced domains,” ACM computing surveys (CSUR),
vol. 49, no. 2, pp. 1–50, 2016.
[31] Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip
Kegelmeyer, “Smote: synthetic minority over-sampling technique,” Jour-
nal of artificial intelligence research, vol. 16, pp. 321–357, 2002. JIAYIN CHEN received the M.S. degree from
[32] Xiang Zhang, Junbo Zhao, and Yann LeCun, “Character-level convolu- Dalian University of Technology in 2016. He is
tional networks for text classification,” Advances in neural information currently working as an algorithm engineer in Hi-
processing systems, vol. 28, 2015. think Flush Information Network Co., Ltd., China.
[33] Siddharth Gopal and Yiming Yang, “Recursive regularization for large-
His current research interests include natural lan-
scale classification with hierarchical and graphical dependencies,” in
guage processing, information extraction, knowl-
Proceedings of the 19th ACM SIGKDD international conference on
Knowledge discovery and data mining, 2013, pp. 257–265. edge graph,and question answering.
[34] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi
Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov,
“Roberta: A robustly optimized bert pretraining approach,” 2019.
[35] Diederik Kingma and Jimmy Ba, “Adam: A method for stochastic
optimization,” International Conference on Learning Representations, 12
2014.

10 VOLUME x, xxxx

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4

You might also like