Solving Data Imbalance in Text Classification With
Solving Data Imbalance in Text Classification With
This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3306805
VOLUME x, xxxx 1
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3306805
Date of publication xxxx 00, 0000, date of current version November 9, 2022.
Digital Object Identifier 10.1109/ACCESS.2017.DOI
ABSTRACT
Contrastive learning (CL) has been successfully applied in Natural Language Processing (NLP) as a
powerful representation learning method and has shown promising results in various downstream tasks.
Recent research has highlighted the importance of constructing effective contrastive samples through data
augmentation. However, current data augmentation methods primarily rely on random word deletion,
substitution, and cropping, which may introduce noisy samples and hinder representation learning. In
this article, we propose a novel approach to address data imbalance in text classification by constructing
contrastive samples. Our method involves the use of a Label-indicative Component to generate high-
quality positive samples for the minority class, along with the introduction of a Hard Negative Mixing
strategy to synthesize challenging negative samples at the feature level. By applying supervised contrastive
learning to these samples, we are able to obtain superior text representations, which significantly benefit
text classification tasks with imbalanced data. Our approach effectively mitigates distributional biases
and promotes noise-resistant representation learning. To validate the effectiveness of our method, we
conducted experiments on benchmark datasets (THUCNews, AG’s News, 20NG) as well as the imbalanced
FDCNews dataset. The code for our method is publicly available at the following GitHub repository:
https://github.com/hanggun/CLDMTC.
INDEX TERMS Data imbalance, contrastive learning, data augmentation, hard negative samples, text
classification
2 VOLUME x, xxxx
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3306805
various downstream tasks [14], [15], which aims to improve hopefully also hard but more diverse, negative points.
the ability to learn more discriminative and robust features In general, our method of constructing contrastive samples
by pulling semantically close neighbors together and pushing includes two parts, one is to generate positive samples for the
apart non-neighbors [16]. Inspired by contrastive learning, minority class before training, and the other is to synthesize
we are interested in exploring whether it could handle the hard negative samples in mini batch training. We minimize
data imbalance problem and alleviate distributional skews. the cross entropy loss and contrastive loss on these samples to
However, a core problem within the contrastive learning improve text classification. To validate the effectiveness and
framework is how to construct effective contrastive samples. generality of our approach, we conduct experiments on four
To do so, [17] adopt a simple approach of taking the same text classification datasets, including three simulated bench-
labels as positive samples and the different labels as neg- mark datasets based on THUCNews, AG’s News, 20NG and
ative samples. This approach does not fulfill the potential imbalanced FDCNews dataset. The main contributions of this
of contrastive learning because the experiment [18] shows paper can be summarized as follows:
that the more contrastive samples, the better the learning • We propose a novel approach of constructing contrastive
effect. Some researchers [19], [20] consider using common samples and apply supervised contrastive learning for
data augmentation techniques to generate positive contrastive data imbalance in text classification, which alleviates
samples, e.g., randomly word deletion, replacement or span distributional skews and improves the performance.
deletion. Although this method further improves the effect • We design Label-indicative Component to generate pos-
of text classification, there is a significant problem: it may itive samples for the minority class and introduce Hard
produce noise contrastive samples. The reason is that if the Negative Mixing strategy to synthesize hard negative
core semantic words of the sample are randomly deleted or samples, which enhance the quality of generated sam-
replaced, it is likely to cause the label of the new sample to ples and reduce noise-invariant representation learning.
change, especially when the samples are not balanced. As • Extensive experiments on four benchmark datasets
shown in Figure 1a, the way of Keywords substitution gener- (both in English and Chinese) illustrate the effectiveness
ates samples with less noise. What’s more, if the keywords of our approach. The results show that the accuracy rate
related to the label are deleted by a random way, it will have is improved by 1% on average, and the highest is 3.21%.
a greater negative effect.
To alleviate aforementioned problems, we propose a novel II. RELATED WORK
approach to construct contrastive samples for data imbalance A. CONTRASTIVE LEARNING
in text classification. Specifically, we first design a statisti- Contrastive Learning has become a rising representation
cal indicator, a Label-indicative Component (LIC), to find learning method because of its significant success in various
and sort out label-indicative keywords; then replace label- computer vision tasks [14], [23], [24]. Some researchers
indicative keywords with the most similar words by word proposed to enhance the representations of the different aug-
vector techniques, such as Word2vec, BERT, to generate mentation of an image agree with each other and showed pos-
positive contrastive samples for the minority class. Using itive results [18], [25]. Inspired by the success of contrastive
this method ensures that the labels of the generated new learning in computer vision, a lot of works have tried to use
samples remain consistent with the original samples, while the contrast learning framework to improve the representa-
simultaneously increasing the number of samples in the mi- tion learning of text and achieved significant results. [26]
nority class. Additionally, our observation reveals that many proposed a pre-training model, SimCSE, based on contrastive
substitution words may not appear in the dataset, allowing sentence embedding framework and improved downstream
the model to learn new samples beyond the distribution and tasks. In addition, several studies [27], [28] directly add
enhance the learning of more generalized features. To further contrastive loss to the supervision task for joint learning
improve the quality of constructing contrastive samples, we to improve the model representation ability. Our approach
have restricted the threshold of similarity between words, i.e. is different from these previous works in that we utilize a
exceeding 0.6. novel data augmentation, i.e. Label-indicative Component
Besides, some recent works [18], [21] have shown that and Hard Negative Mixing strategy.
more contrastive samples do not necessarily mean mean-
ingful samples. When the positive sample and the negative B. CONSTRUCTING CONTRASTIVE SAMPLES
sample are far away in the semantic space, they hardly The main difference among works on contrastive learning
contribute to the contrastive loss. Inspired by these works, is their various ways of constructing contrastive samples
we adopt Hard Negative Mixing (HNM) strategy [22] to through data augmentation. Currently, data augmentation for
synthesize hard negative samples at the feature level, which text is not as easy as for image. The main reason is that
makes the model more focus on the learning of hard negative every word in a sentence may play an essential role in
samples and allows it to learn more robust features. As shown expressing the whole meaning or judging the label. CERT
in Figure 1b, positive samples contain many negatives and [29] applies the back-translation to create positive samples
few hard ones. We propose to mix only the hardest negatives of original sentences, while CLEAR [19] proposes random-
(based on their similarity to the positive) and synthesize new, words-deletion, spans-deletion, synonym-substitution, and
VOLUME x, xxxx 3
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3306805
Positive Samples
Data augmentation THUCNews
Negative Samples
None 96.86 Synthesize Negative Samples
Random cropping 94.11
Random deletion 95.08
Random substitution 94.92
Keywords substitution 96.05
Keywords deletion 93.21
(a): comparison of different data augmentations (b): an example of hard negative mixing strategy
FIGURE 1. (a) Comparison of different data augmentations on THUCNews test set. Random cropping: random crop and keep a continuous span; Keywords: words
related to the label of the sample; substitution: words are substituted by their similar words. All operations are based on 10% of the sentence length 300. (b) A
t-SNE plot after representing samples from the THUCNews on 64-dimensional embeddings on the unit hypersphere, which shows a toy example of the hard
negative mixing strategy.
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3306805
VOLUME x, xxxx 5
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3306805
TABLE 1. An example of top 10 label-indicative keywords of each label from AG News dataset.
where Ti is the number of positive pairs in the mini batch • LIC denotes the use of our proposed data augmentation
for the sample Hi . Combining the classification loss with a method, Label-indicative Component (LIC), to generate
trade-off parameter, we can get the final loss function: news samples for training.
• CL represents adding Contrastive Loss (CL) to the
L = LCE + λLCL (7) original training task.
• LIC+CL means to add contrastive learning to the new
λ is a hyperparameter to control the relative importance of
dataset generated by data augmentation.
cross entropy loss and supervised contrastive loss.
• LIC+HNM+CL is based on LIC+CL, adding Hard
Negative Mixing method (HNM), whcih is also our
IV. EXPERIMENTS
overall framework of constructing contrastive samples.
A. EXPERIMENT SETUP
1) Dataset
4) Evaluation Metrics
To validate the effectiveness and generality of our approach,
we conduct experiments on four text classification datasets, We use two metrics to measure the performance of all the
including three simulated benchmark datasets based on approaches,including accuracy and Macro-F1 [33]. Macro-
THUCNews, AG’s News, 20NG and imbalanced FDCNews F1 equals to the average F1-score of labels, which is suitable
dataset. Since the first three datasets are balanced, we con- for the evaluation of the minority class.
struct them so that part of the labels are unbalanced. Please
see Appendix A for details of construction. 5) Settings
1
• The THUCNews dataset is a Chinese news classifica-
For constructing contrastive samples, we use Sogou News
tion dataset collected by Tsinghua University. (word 300d) 5 to look up similar words for Chinese tasks,
2
• The AG’s News dataset is created by Xiang Zhang [32]
and GloVe embedding 6 for English tasks. For CNN, we use
which contains 127600 samples with 4 classes. 3 filters with size [2,3,4], and the number of filters for each
3
• The 20NG dataset (bydata version) is an English news
convolution block is 128, and the embedding size is 300. For
dataset that contains 18846 documents evenly catego- BERT, we use BERT-Base-Cased [7] 7 for AG’s News and
rized into 20 different categories. 20NG, and use RoBERTa_zh_L12 [34]8 for THUCNews and
4
• The FDCNews dataset is provided by Fudan Univer-
FDCNews. In BERT models, we obtain text representations
sity which contains 9833 Chinese news categorized into from the BERT model and then use a dense layer with 256
20 different classes. This is an unbalanced dataset and units to decrease the dimension of the text representation
does not need to be constructed. to 256. We tune some parameters of our models by grid
searching on the validation dataset, we final set our λ equal to
0.1, the number of synthetic hard negative samples s = 10,
2) Encoder Model
the closest negatives K = 5. All models are optimized
Our method is proposed as an enhancement for current main
by Adam [35] with an initial learning rate of 0.001 and
stream models and focus the influence of contrastive learning
batch size of 64. All of the models are implemented using
on the encoded vectors. Therefore, we only select two widely
Pytorch and are trained on GPU GeForce GTX 1070 Ti. More
used model structure, CNN model [3] and pre-training model
parameter settings are in Appendix B.
BERT [7], as our encoder model.
B. EXPERIMENTAL RESULTS
3) Components for Comparison
In order to explore the effects of contrastive learning and our 1) Test Performance
proposed components for data imbalance in text classifica- Table 2 presents the test accuracy and Macro-F1 on all
tion, we conduct the following combined comparative test. datasets. We can observe that the proposed constructing
1 http://thuctc.thunlp.org 5 https://github.com/Embedding/Chinese-Word-Vectors
2 https://di.unipi.it 6 https://github.com/stanfordnlp/GloVe
3 https://www.cs.umb.edu/~smimarog/textmining/datasets 7 https://github.com/google-research/bert
4 http://www.nlpir.org 8 https://github.com/brightmart/roberta_zh
6 VOLUME x, xxxx
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3306805
FIGURE 3. Cosine similarity matrix (a) and corresponding t-SNE visualization (b) of sample representations of 20NG datasets. motocycles-Aug is the constructed
sample, forming a pair of positive samples with motocycles, and others are negative samples. hard negative mixing are the synthesized samples.
澻濴澼澳濧濻濸澳濸濹濹濸濶瀇澳瀂濹澳煰澳濼瀁澳濿瀂瀆瀆澳濹瀈瀁濶瀇濼瀂瀁 澻濵澼澳濧濻濸澳濸濹濹濸濶瀇澳瀂濹澳濵濴瀇濶濻澳瀆濼瀍濸澳濼瀁澳濶瀂瀁瀇瀅濴瀆瀇濼瀉濸澳濿濸濴瀅瀁濼瀁濺
FIGURE 4. Hyper-parameter analysis of our model, including the effect of λ in loss function and batch_size to constrastive learning. Here CNN_our denotes
CNN+LIC+HNM+CL.
contrastive samples method (LIC+HNM+CL) outperforms show that adding contrastive loss do not lead to improve-
base models and other ablation models on most datasets, ment in datasets AG’s News and FDCNews, but adding
which shows the effectiveness of constructing contrastive contrastive loss with constructed samples surpass baselines
samples for data imbalance in text classification task. The on each dataset, which illustrate constructing contrastive
results of Encoder (CNN or BERT)+LIC models show gen- samples can improve the feature representation ability of
erating positive samples for the minority class can improve contrastive learning. The Encoder+LIC+HNM+CL models
accuracy and Macro-F1 of classification task, up to 1%. achieve most of the best results, which show the effectiveness
The results of Encoder+CL and Encoder+LIC+CL models of hard negative mixing strategy. The biggest improvement
VOLUME x, xxxx 7
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3306805
was achieved by the BERT+LIC+HNM+CL model on the the experimental results of different batch size are not very
20NG dataset, with 3.21% increase in test accuracy and different. When too larger, for example batch_size=256 in
3.93% in test Macro-F1. The natural confusion of the labels this case, the effect is slightly reduced. We think the reason
in 20NG dataset can shine light on the reason why our model is that the larger batch size is, the number of positive samples
performs quite well. Specifically, labels in the same group will also increase, which cannot increase the proportion
are highly similar, and contrastive learning and hard negative between positive and negative samples.
mixing method can more effectively improve the model’s
ability of learning discriminative features. D. ANALYSIS
In this section, we further conduct analyses to understand the
2) Constructing Contrastive Samples inner workings of our method.
We further visualize the learned samples representations in
20NG dataset, as shown in Figure 3. We can observe that the 1) Data imbalance
constructed positive sample is highly similar to the original Table 3 shows the accuracy of each class on THUCNews test
sample, and the synthesized hard negative samples are also dataset, including the majority class accuracy and the mi-
very similar, as shown in Figure 3a. From Figure 3b, We nority class accuracy. We observe that our proposed method
can find that the synthesized negative samples are the closest has a more obvious improvement on the minority class,
to postitive samples in the semantic space. This shows that especially for those labels that have particularly poor classi-
our label-indicative component can guarantee the quality fication effect, which indicates that our method can alleviate
of constructing positive samples, and hard negative mixing distributional skews and is more suitable for imbalanced text
strategy can make the model focus more on hard negative classification tasks. In addition, we also find that our model
samples. has slight negative effects on some labels. This shows that
the contrastive learning effect of data augmentation on some
C. ABLATION STUDY labels is lower than the impact of noise. Therefore, how
We investigate how different λ, batch sizes affect our models’ to further improve the quality of data augmentation is still
performance. All results are using the CNN_our model, eval- worthy of future research.
uated on the development set of 20NG dataset. In Appendix
C, we explore the number of hard negative samples. 2) Data augmentation
To explore the impact of different data augmentation on
1) The effect of λ contrastive learning, we compare 5 methods under two pro-
The λ is a controlling hyper-parameter to decide the impor- cessing ratios, each method trained 5 times, and reporting
tance of cross entropy loss and supervised contrastive loss. their mean ± standard deviation, the results as shown in table
The larger λ will give contrastive learning more weight when 4. We can see that our method LIC (semantic replace label-
training classification task. Figure 4a shows the accuracy indicative words) has the most stable performance under
curve of different λ on 20NG dataset. We can see from the different processing ratios. Furthermore, each data augmen-
graph, if the parameter λ is too large, it will cause the model tation will produce a certain amount of noise, and the Ran-
to be too biased towards comparative learning and reduce the domly Delete way has the largest fluctuation, and the Delete
classification effect. If it’s too small, the effect is similar to Label-indicative Words way has the largest negative effect.
the result of λ = 0. Therefore, we set the λ to 0.1 in our This results also indicate that under supervised learning, how
experiments. to data augmentation around label-indicative words is crucial.
In Appendix D, we further show that the impact of those data
2) The effect of batch size augmentation under more processing ratios.
In addition, We explore the impact of batch sizes. The reason
is that [21] found that the larger the batch size, the more 3) Contrastive learning on text classification
negative samples, and the better the comparative learning To directly show the strengths of our approaches for data
performance. However, we find that it does not meet this imbalance, we compare the performance of contrastive learn-
characteristic in our experiments. As shown in Figure 4b, ing on balanced text classification in table 5. We can see
8 VOLUME x, xxxx
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3306805
VOLUME x, xxxx 9
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2023.3306805
[17] Sora Ohashi, Junya Takayama, Tomoyuki Kajiwara, Chenhui Chu, and XI CHEN received the Ph.D. degree in mechanical
Yuki Arase, “Text classification with negative supervision,” in Proceedings engineering from Zhejiang University, China, in
of the 58th Annual Meeting of the Association for Computational Lin- 2005. After graduation, he focused on data ana-
guistics, Online, July 2020, pp. 351–357, Association for Computational lytics and software development in company for a
Linguistics. number of years, and currently be a researcher in
[18] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton, the Advanced Institute of Information Technology,
“A simple framework for contrastive learning of visual representations,” Peking University, Hangzhou, China. His current
in International conference on machine learning. PMLR, 2020, pp. 1597–
research interests include natural language pro-
1607.
cessing, computer vision, and the data mining of
[19] Zhuofeng Wu, Sinong Wang, Jiatao Gu, Madian Khabsa, Fei Sun, and
Hao Ma, “Clear: Contrastive learning for sentence representation,” arXiv large data sets.
preprint arXiv:2012.15466, 2020.
[20] Lin Pan, Chung-Wei Hang, Avirup Sil, and Saloni Potdar, “Improved text
classification via contrastive adversarial training,” in Proceedings of the
AAAI Conference on Artificial Intelligence, 2022, vol. 36, pp. 11130–
11138.
[21] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick,
“Momentum contrast for unsupervised visual representation learning,” in
Proceedings of the IEEE/CVF conference on computer vision and pattern
recognition, 2020, pp. 9729–9738.
[22] Yannis Kalantidis, Mert Bulent Sariyildiz, Noe Pion, Philippe Weinza- WEI ZHANG Research associate. His research in-
epfel, and Diane Larlus, “Hard negative mixing for contrastive learning,” terests include software engineering and modeling
Advances in Neural Information Processing Systems, vol. 33, pp. 21798– of big data algorithm in the industrial field
21809, 2020.
[23] Aaron van den Oord, Yazhe Li, and Oriol Vinyals, “Representation learn-
ing with contrastive predictive coding,” arXiv preprint arXiv:1807.03748,
2018.
[24] Yonglong Tian, Dilip Krishnan, and Phillip Isola, “Contrastive multiview
coding,” in Computer Vision–ECCV 2020: 16th European Conference,
Glasgow, UK, August 23–28, 2020, Proceedings, Part XI 16. Springer,
2020, pp. 776–794.
[25] Ishan Misra and Laurens van der Maaten, “Self-supervised learning
of pretext-invariant representations,” in Proceedings of the IEEE/CVF
conference on computer vision and pattern recognition, 2020, pp. 6707–
6717.
[26] Tianyu Gao, Xingcheng Yao, and Danqi Chen, “SimCSE: Simple con-
trastive learning of sentence embeddings,” in Proceedings of the 2021
Conference on Empirical Methods in Natural Language Processing, On-
line and Punta Cana, Dominican Republic, Nov. 2021, pp. 6894–6910,
SHUAI PAN received the M.S. degree from the
Association for Computational Linguistics.
University of Edinburgh, Scotland, U.K., in 2020.
[27] Tian Shi, Liuqing Li, Ping Wang, and Chandan K Reddy, “A simple and
effective self-supervised contrastive learning framework for aspect detec-
His current research interests include natural lan-
tion,” in Proceedings of the AAAI conference on artificial intelligence, guage processing, sentiment analysis, and natural
2021, vol. 35, pp. 13815–13824. language generation.
[28] Haotian Fu, Hongyao Tang, Jianye Hao, Chen Chen, Xidong Feng, Dong
Li, and Wulong Liu, “Towards effective context for meta-reinforcement
learning: an approach based on contrastive learning,” in Proceedings of
the AAAI Conference on Artificial Intelligence, 2021, vol. 35, pp. 7457–
7465.
[29] Hongchao Fang, Sicheng Wang, Meng Zhou, Jiayuan Ding, and Pengtao
Xie, “Cert: Contrastive self-supervised learning for language understand-
ing,” arXiv preprint arXiv:2005.12766, 2020.
[30] Paula Branco, Luís Torgo, and Rita P Ribeiro, “A survey of predictive
modeling on imbalanced domains,” ACM computing surveys (CSUR),
vol. 49, no. 2, pp. 1–50, 2016.
[31] Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip
Kegelmeyer, “Smote: synthetic minority over-sampling technique,” Jour-
nal of artificial intelligence research, vol. 16, pp. 321–357, 2002. JIAYIN CHEN received the M.S. degree from
[32] Xiang Zhang, Junbo Zhao, and Yann LeCun, “Character-level convolu- Dalian University of Technology in 2016. He is
tional networks for text classification,” Advances in neural information currently working as an algorithm engineer in Hi-
processing systems, vol. 28, 2015. think Flush Information Network Co., Ltd., China.
[33] Siddharth Gopal and Yiming Yang, “Recursive regularization for large-
His current research interests include natural lan-
scale classification with hierarchical and graphical dependencies,” in
guage processing, information extraction, knowl-
Proceedings of the 19th ACM SIGKDD international conference on
Knowledge discovery and data mining, 2013, pp. 257–265. edge graph,and question answering.
[34] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi
Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov,
“Roberta: A robustly optimized bert pretraining approach,” 2019.
[35] Diederik Kingma and Jimmy Ba, “Adam: A method for stochastic
optimization,” International Conference on Learning Representations, 12
2014.
10 VOLUME x, xxxx
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4