0% found this document useful (0 votes)
5 views12 pages

Diffcse 12344

DiffCSE is an unsupervised contrastive learning framework designed for learning sentence embeddings that are sensitive to differences between original and edited sentences. The approach combines standard contrastive learning with a difference prediction objective, achieving state-of-the-art results in unsupervised sentence representation learning, outperforming previous models like SimCSE. Experiments demonstrate that DiffCSE effectively improves semantic textual similarity tasks by leveraging equivariant contrastive learning principles.

Uploaded by

choujoe582
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views12 pages

Diffcse 12344

DiffCSE is an unsupervised contrastive learning framework designed for learning sentence embeddings that are sensitive to differences between original and edited sentences. The approach combines standard contrastive learning with a difference prediction objective, achieving state-of-the-art results in unsupervised sentence representation learning, outperforming previous models like SimCSE. Experiments demonstrate that DiffCSE effectively improves semantic textual similarity tasks by leveraging equivariant contrastive learning principles.

Uploaded by

choujoe582
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

DiffCSE: Difference-based Contrastive Learning for Sentence

Embeddings
Yung-Sung Chuang† Rumen Dangovski† Hongyin Luo† Yang Zhang‡ Shiyu Chang∗
Marin SoljačiㆠShang-Wen Li Wen-tau Yih Yoon Kim† James Glass†
Massachusetts Institute of Technology† Meta AI
MIT-IBM Watson AI Lab‡ UC Santa Barbara∗
yungsung@[Link]

Abstract be more similar to one another than negative pairs.


We propose DiffCSE, an unsupervised con- While different data augmentations (random crop-
trastive learning framework for learning sen- ping, color jitter, rotations, etc.) have been found to
tence embeddings. DiffCSE learns sentence be crucial for pretraining vision models (Chen et al.,
arXiv:2204.10298v1 [[Link]] 21 Apr 2022

embeddings that are sensitive to the difference 2020), such augmentations have generally been un-
between the original sentence and an edited successful when applied to contrastive learning of
sentence, where the edited sentence is ob- sentence embeddings. Indeed, Gao et al. (2021)
tained by stochastically masking out the origi- find that constructing positive pairs via a simple
nal sentence and then sampling from a masked
dropout-based augmentation works much better
language model. We show that DiffSCE is
an instance of equivariant contrastive learn- than more complex augmentations such as word
ing (Dangovski et al., 2021), which general- deletions or replacements based on synonyms or
izes contrastive learning and learns represen- masked language models. This is perhaps unsur-
tations that are insensitive to certain types of prising in hindsight; while the training objective
augmentations and sensitive to other “harm- in contrastive learning encourages representations
ful” types of augmentations. Our experiments to be invariant to augmentation transformations,
show that DiffCSE achieves state-of-the-art re-
direct augmentations on the input (e.g., deletion,
sults among unsupervised sentence representa-
tion learning methods, outperforming unsuper- replacement) often change the meaning of the sen-
vised SimCSE1 by 2.3 absolute points on se- tence. That is, ideal sentence embeddings should
mantic textual similarity tasks. 2 not be invariant to such transformations.
We propose to learn sentence representations
1 Introduction that are aware of, but not necessarily invariant to,
Learning “universal” sentence representations that such direct surface-level augmentations. This is an
capture rich semantic information and are at the instance of equivariant contrastive learning (Dan-
same time performant across a wide range of down- govski et al., 2021), which improves vision repre-
stream NLP tasks without task-specific finetuning sentation learning by using a contrastive loss on
is an important open issue in the field (Conneau insensitive image transformations (e.g., grayscale)
et al., 2017; Cer et al., 2018; Kiros et al., 2015; and a prediction loss on sensitive image trans-
Logeswaran and Lee, 2018; Giorgi et al., 2020; formations (e.g., rotations). We operationalize
Yan et al., 2021; Gao et al., 2021). Recent work equivariant contrastive learning on sentences by us-
has shown that finetuning pretrained language mod- ing dropout-based augmentation as the insensitive
els with contrastive learning makes it possible to transformation (as in SimCSE (Gao et al., 2021))
learn good sentence embeddings without any la- and MLM-based word replacement as the sensitive
beled data (Giorgi et al., 2020; Yan et al., 2021; transformation. This results in an additional cross-
Gao et al., 2021). Contrastive learning uses multi- entropy loss based on the difference between the
ple augmentations on a single datum to construct original and the transformed sentence.
positive pairs whose representations are trained to We conduct experiments on 7 semantic textual
1
SimCSE has two settings: unsupervised and supervised. similarity tasks (STS) and 7 transfer tasks from Sen-
In this paper, we focus on the unsupervised setting. Unless tEval (Conneau and Kiela, 2018) and find that this
otherwise stated, in this paper we use SimCSE to refer to difference-based learning greatly improves over
unsupervised SimCSE.
2
Pretrained models and code are available at https:// standard contrastive learning. Our DiffCSE ap-
[Link]/voidism/DiffCSE. proach can achieve around 2.3% absolute improve-
Replaced Token Detection Loss
Contrastive Loss 0: original
0 1 0 0 0 0 1
1: replaced

Discriminator

“You gotta know what you’re gonna do .”


Sentence Encoder

Generator (fixed)

Random
Masking
“You never know what you’re gonna get .” “You [MASK] know what you’re gonna [MASK] .”

Figure 1: Illustration of DiffCSE. On the left-hand side is a standard SimCSE model trained with regular contrastive
loss on dropout transformations. On the right hand side is a conditional difference prediction model which takes the
sentence vector h as input and predict the difference between x and x00 . During testing we discard the discriminator
and only use h as the sentence embedding.

ment on STS datasets over SimCSE, the previous amples and distant spans as negative examples for
state-of-the-art model. We also conduct a set of learning contrastive span representations. Finally,
ablation studies to justify our designed architecture. SimCSE (Gao et al., 2021) proposes an extremely
Qualitative study and analysis are also included to simple augmentation strategy by just switching
look into the embedding space of DiffCSE. dropout masks. While simple, sentence embed-
dings learned in this manner have been shown to be
2 Background and Related Work better than other more complicated augmentation
methods.
2.1 Learning Sentence Embeddings
Learning universal sentence embeddings has been 2.2 Equivariant Contrastive Learning
studied extensively in prior work, including unsu- DiffCSE is inspired by a recent generalization of
pervised approaches such as Skip-Thought (Kiros contrastive learning in computer vision (CV) called
et al., 2015), Quick-Thought (Logeswaran and Lee, equivariant contrastive learning (Dangovski et al.,
2018) and FastSent (Hill et al., 2016), or supervised 2021). We now explain how this CV technique can
methods such as InferSent (Conneau et al., 2017), be adapted to natural language.
Universal Sentence Encoder (Cer et al., 2018) and Understanding the role of input transformations
Sentence-BERT (Reimers and Gurevych, 2019). is crucial for successful contrastive learning. Past
Recently, researchers have focused on (unsuper- empirical studies have revealed useful transforma-
vised) contrastive learning approaches such as Sim- tions for contrastive learning, such as random re-
CLR (Chen et al., 2020) to learn sentence embed- sized cropping and color jitter for computer vision
dings. SimCLR (Chen et al., 2020) learns image (Chen et al., 2020) and dropout for NLP (Gao et al.,
representations by creating semantically close aug- 2021). Contrastive learning encourages representa-
mentations for the same images and then pulling tions to be insensitive to these transformations, i.e.
these representations to be closer than represen- the encoder is trained to be invariant to a set of man-
tations of random negative examples. The same ually chosen transformations. The above studies
framework can be adapted to learning sentence em- in CV and NLP have also revealed transformations
beddings by designing good augmentation meth- that are harmful for contrastive learning. For ex-
ods for natural language. ConSERT (Yan et al., ample, Chen et al. (2020) showed that making the
2021) uses a combination of four data augmen- representations insensitive to rotations decreases
tation strategies: adversarial attack, token shuf- the ImageNet linear probe accuracy, and Gao et al.
fling, cut-off, and dropout. DeCLUTR (Giorgi (2021) showed that using an MLM to replace 15%
et al., 2020) uses overlapped spans as positive ex- of the words drastically reduces performance on
STS-B. While previous works simply omit these tool for NLP researchers to discover useful trans-
transformations from contrastive pre-training, here formations.
we argue that we should still make use of these
transformations by learning representations that 3 Difference-based Contrastive Learning
are sensitive (but not necessarily invariant) to such Our approach is straightforward and can be seen as
transformations. combining the standard contrastive learning objec-
The notion of (in)sensitivity can be captured by tive from SimCSE (Figure 1, left) with a difference
the more general property of equivariance in math- prediction objective which conditions on the sen-
ematics. Let T be a transformation from a group tence embedding (Figure 1, right).
G and let T (x) denote the transformation of a sen- Given an unlabeled input sentence x, SimCSE
tence x. Equivariance is the property that there is creates a positive example x+ for it by applying
an induced group transformation T 0 on the output different dropout masks. By using the BERTbase
features (Dangovski et al., 2021): encoder f , we can obtain the sentence embedding
f (T (x)) = T 0 (f (x)). h = f (x) for x (see section 4 for how h is ob-
tained). The training objective for SimCSE is:
In the special case of contrastive learning, T 0 ’s +
target is the identity transformation, and we say esim(hi ,hi )/τ
Lcontrast = − log P + ,
that f is trained to be “invariant to T .” However, N
esim(hi ,hj )/τ
j=1
invariance is just a trivial case of equivariance,
and we can design training objectives where T 0 where N is the batch size for the input batch
is not the identity for some transformations (such {xi }Ni=1 as we are using in-batch negative exam-
as MLM), while it is the identity for others (such ples, sim(·, ·) is the cosine similarity function, and
as dropout). Dangovski et al. (2021) show that τ is a temperature hyperparameter.
generalizing contrastive learning to equivariance in On the right-hand side of Figure 1 is a con-
this way improves the semantic quality of features ditional version of the difference prediction ob-
in CV, and here we show that the complementary jective used in ELECTRA (Clark et al., 2020),
nature of invariance and equivariance extends to which contains a generator and a discrimina-
the NLP domain. The key observation is that the tor. Given a sentence of length T , x =
encoder should be equivariant to MLM-based aug- [x(1) , x(2) , ..., x(T ) ], we first apply a random mask
mentation instead of being invariant. We can oper- m = [m(1) , m(2) , ..., m(T ) ], m(t) ∈ [0, 1] on x to
ationalize this by using a conditional discriminator obtain x0 = m · x. We use another pretrained MLM
that combines the sentence representation with an as the generator G to perform masked language
edited sentence, and then predicts the difference modeling to recover randomly masked tokens in x0
between the original and edited sentences. This to obtain the edited sentence x00 = G(x0 ). Then,
is essentially a conditional version of the ELEC- we use a discriminator D to perform the Replaced
TRA model (Clark et al., 2020), which makes the Token Detection (RTD) task. For each token in the
encoder equivariant to MLM by using a binary dis- sentence, the model needs to predict whether it has
criminator which detects whether a token is from been replaced or not. The cross-entropy loss for a
the original sentence or from a generator. We hy- single sentence x is:
pothesize that conditioning the ELECTRA model T   
X
with the representation from our sentence encoder x
−1 x00(t) = x(t) log D x00 , h, t

LRTD =
is a useful objective for encouraging f to be “equiv- t=1
ariant to MLM.”   
x00(t) 00

To the best of our knowledge, we are the first to −1 6= x(t) log 1 − D x , h, t
observe and highlight the above parallel between
CV and NLP. In particular, we show that equivari- And
PN thextraining objective for a batch is LRTD =
i=1 LRTD . Finally we optimize these two losses
i
ant contrastive learning extends beyond CV, and
that it works for transformations even without al- together with a weighting coefficient λ:
gebraic structures, such as diff operations on sen- L = Lcontrast + λ · LRTD
tences. Further, insofar as the canonical set of
useful transformations is less established in NLP The difference between our model and ELECTRA
than is in CV, DiffCSE can serve as a diagnostic is that our discriminator D is conditional, so it can
use the information of x compressed in a fixed- are used and all embeddings are fixed once they
dimension vector h = f (x). The gradient of D are trained. The transfer tasks are various sen-
can be backward-propagated into f through h. By tence classification tasks, including MR (Pang and
doing so, f will be encouraged to make h infor- Lee, 2005), CR (Hu and Liu, 2004), SUBJ (Pang
mative enough to cover the full meaning of x, so and Lee, 2004), MPQA (Wiebe et al., 2005), SST-
that D can distinguish the tiny difference between 2 (Socher et al., 2013), TREC (Voorhees and Tice,
x and x00 . This approach essentially makes the con- 2000) and MRPC (Dolan and Brockett, 2005). In
ditional discriminator perform a “diff operation”, these transfer tasks, we will use a logistic regres-
hence the name DiffCSE. sion classifier trained on top of the frozen sentence
When we train our DiffCSE model, we fix the embeddings, following the standard setup (Con-
generator G, and only the sentence encoder f and neau and Kiela, 2018).
the discriminator D are optimized. After training,
4.3 Results
we discard D and only use f (which remains fixed)
to extract sentence embeddings to evaluate on the Baselines We compare our model with many
downstream tasks. strong unsupervised baselines including Sim-
CSE (Gao et al., 2021), IS-BERT (Zhang
4 Experiments et al., 2020), CMLM (Yang et al., 2020), De-
CLUTR (Giorgi et al., 2020), CT-BERT (Carlsson
4.1 Setup
et al., 2021), SG-OPT (Kim et al., 2021) and some
In our experiment, we follow the setting of unsu- post-processing methods like BERT-flow (Li et al.,
pervised SimCSE (Gao et al., 2021) and build our 2020) and BERT-whitening (Su et al., 2021) along
model based on their PyTorch implementation.3 with some naive baselines like averaged GloVe em-
We also use the checkpoints of BERT (Devlin et al., beddings (Pennington et al., 2014) and averaged
2019) and RoBERTa (Liu et al., 2019) as the initial- first and last layer BERT embeddings.
ization of our sentence encoder f . We add an MLP
layer with Batch Normalization (Ioffe and Szegedy, Semantic Textual Similarity (STS) We show
2015) (BatchNorm) on top of the [CLS] represen- the results of STS tasks in Table 1 including
tation as the sentence embedding. We will compare BERTbase (upper part) and RoBERTabase (lower
the model with/without BatchNorm in section 5. part). We also reproduce the previous state-of-
For the discriminator D, we use the same model the-art SimCSE (Gao et al., 2021). DiffCSE-
as the sentence encoder f (BERT/RoBERTa). For BERTbase can significantly outperform SimCSE-
the generator G, we use the smaller DistilBERT BERTbase and raise the averaged Spearman’s cor-
and DistilRoBERTa (Sanh et al., 2019) for effi- relation from 76.25% to 78.49%. For the RoBERTa
ciency. Note that the generator is fixed during train- model, DiffCSE-RoBERTabase can also improve
ing unlike the ELECTRA paper (Clark et al., 2020). upon SimCSE-RoBERTabase from 76.57% to
We will compare the results of using different size 77.80%.
model for the generator in section 5. More training Transfer Tasks We show the results of trans-
details are shown in Appendix A. fer tasks in Table 2. Compared with SimCSE-
BERTbase , DiffCSE-BERTbase can improve the
4.2 Data
averaged scores from 85.56% to 86.86%. When
For unsupervised pretraining, we use the same applying it to the RoBERTa model, DiffCSE-
106 randomly sampled sentences from English RoBERTabase also improves upon SimCSE-
Wikipedia that are provided by the source code RoBERTabase from 84.84% to 87.04%. Note that
of SimCSE.3 We evaluate our model on 7 seman- the CMLM-BERTbase (Yang et al., 2020) can
tic textual similarity (STS) and 7 transfer tasks achieve even better performance than DiffCSE.
in SentEval.4 STS tasks includes STS 2012– However, they use 1TB of the training data from
2016 (Agirre et al., 2016), STS Benchmark (Cer Common Crawl dumps while our model only use
et al., 2017) and SICK-Relatedness (Marelli et al., 115MB of the Wikipedia data for pretraining. We
2014). All the STS experiments are fully unsu- put their scores in Table 2 for reference. In Sim-
pervised, which means no STS training datasets CSE, the authors propose to use MLM as an auxil-
3
[Link] iary task for the sentence encoder to further boost
4
[Link] the performance of transfer tasks. Compared with
Model STS12 STS13 STS14 STS15 STS16 STS-B SICK-R Avg.

GloVe embeddings (avg.) 55.14 70.66 59.73 68.25 63.66 58.02 53.76 61.32
BERTbase (first-last avg.)♦ 39.70 59.38 49.67 66.03 66.19 53.87 62.06 56.70
BERTbase -flow♦ 58.40 67.10 60.85 75.16 71.22 68.66 64.47 66.55
BERTbase -whitening♦ 57.83 66.90 60.90 75.08 71.31 68.24 63.73 66.28
IS-BERTbase ♥ 56.77 69.24 61.21 75.23 70.16 69.21 64.25 66.58
CMLM-BERTbase ♠ (1TB data) 58.20 61.07 61.67 73.32 74.88 76.60 64.80 67.22
CT-BERTbase ♦ 61.63 76.80 68.47 77.50 76.48 74.31 69.19 72.05
SG-OPT-BERTbase † 66.84 80.13 71.23 81.56 77.17 77.23 68.16 74.62
SimCSE-BERTbase ♦ 68.40 82.41 74.38 80.91 78.56 76.85 72.23 76.25
∗ SimCSE-BERTbase (reproduce) 70.82 82.24 73.25 81.38 77.06 77.24 71.16 76.16
∗ DiffCSE-BERTbase 72.28 84.43 76.47 83.90 80.54 80.59 71.23 78.49
RoBERTabase (first-last avg.)♦ 40.88 58.74 49.07 65.63 61.48 58.55 61.63 56.57
RoBERTabase -whitening♦ 46.99 63.24 57.23 71.36 68.99 61.36 62.91 61.73
DeCLUTR-RoBERTabase ♦ 52.41 75.19 65.52 77.12 78.63 72.41 68.62 69.99
SimCSE-RoBERTabase ♦ 70.16 81.77 73.24 81.36 80.65 80.22 68.56 76.57
∗ SimCSE-RoBERTabase (reproduce) 68.60 81.36 73.16 81.61 80.76 80.58 68.83 76.41
∗ DiffCSE-RoBERTabase 70.05 83.43 75.49 82.81 82.12 82.38 71.19 78.21

Table 1: The performance on STS tasks (Spearman’s correlation) for different sentence embedding models. ♣:
results from Reimers and Gurevych (2019); ♥: results from Zhang et al. (2020); ♦: results from Gao et al. (2021);
♠: results from Yang et al. (2020); †: results from Kim et al. (2021); ∗: results from our experiments.

the results of SimCSE with MLM, DiffCSE still similar. We also tried using the same sentence and
can have a little improvement around 0.2%. the next sentence at the same time for conditioning
the ELECTRA objective (use same+next sent. for
5 Ablation Studies x0 ), and did not observe improvements.
In the following sections, we perform an extensive
series of ablation studies that support our model Other Conditional Pretraining Tasks Instead
design. We use BERTbase model to evaluate on of a conditional binary difference prediction loss,
the development set of STS-B and transfer tasks. we can also consider other conditional pretraining
Removing Contrastive Loss In our model, both tasks such as a conditional MLM objective pro-
the contrastive loss and the RTD loss are crucial posed by Yang et al. (2020), or corrective language
because they maintain what should be sensitive and modeling,5 proposed by COCO-LM (Meng et al.,
what should be insensitive respectively. If we re- 2021). We experiment with these objectives instead
move the RTD loss, the model becomes a SimCSE of the difference prediction objective in Table 3.
model; if we remove the contrastive loss, the perfor- We observe that conditional MLM on the same sen-
mance of STS-B drops significantly by 30%, while tence does not improve the performance either on
the average score of transfer tasks also drops by 2% STS-B or transfer tasks compared with DiffCSE.
(see Table 3). This result shows that it is important Conditional MLM on the next sentence performs
to have insensitive and sensitive attributes that exist even worse for STS-B, but slightly better than using
together in the representation space. the same sentence on transfer tasks. Using both the
same and the next sentence also does not improve
Next Sentence vs. Same Sentence Some meth- the performance compared with DiffCSE. For the
ods for unsupervised sentence embeddings like corrective LM objective, the performance of STS-B
Quick-Thoughts (Logeswaran and Lee, 2018) and decreases significantly compared with DiffCSE.
CMLM (Yang et al., 2020) predict the next sen-
tence as the training objective. We also experi- Augmentation Methods: Insert/Delete/Replace
ment with a variant of DiffCSE by conditioning In DiffCSE, we use MLM token replacement as
the ELECTRA loss based on the next sentence. the equivariant augmentation. It is possible to use
Note that this kind of model is not doing a “diff other methods like random insertion or deletion in-
operation” between two similar sentences, and is stead of replacement.6 For insertion, we choose to
5
not an instance of equivariant contrastive learning. This task is similar to ELECTRA. However, instead of a
As shown in Table 3 (use next sent. for x0 ), the binary classifier for replaced token detection, corrective LM
uses a vocabulary-size classifier with the copy mechanism to
score of STS-B decreases significantly compared recover the replaced tokens.
6
to DiffCSE while transfer performance remains Edit distance operators include insert, delete and replace.
Model MR CR SUBJ MPQA SST TREC MRPC Avg.

GloVe embeddings (avg.) 77.25 78.30 91.17 87.85 80.18 83.00 72.87 81.52
Skip-thought♥ 76.50 80.10 93.60 87.10 82.00 92.20 73.00 83.50
Avg. BERT embeddings♣ 78.66 86.25 94.37 88.66 84.40 92.80 69.54 84.94
BERT-[CLS]embedding♣ 78.68 84.85 94.21 88.23 84.13 91.40 71.13 84.66
IS-BERTbase ♥ 81.09 87.18 94.96 88.75 85.96 88.64 74.24 85.83
SimCSE-BERTbase ♦ 81.18 86.46 94.45 88.88 85.50 89.80 74.43 85.81
w/ MLM 82.92 87.23 95.71 88.73 86.81 87.01 78.07 86.64
∗ DiffCSE-BERTbase 82.69 87.23 95.23 89.28 86.60 90.40 76.58 86.86
CMLM-BERTbase (1TB data) 83.60 89.90 96.20 89.30 88.50 91.00 69.70 86.89

SimCSE-RoBERTabase 81.04 87.74 93.28 86.94 86.60 84.60 73.68 84.84
w/ MLM 83.37 87.76 95.05 87.16 89.02 90.80 75.13 86.90
∗ DiffCSE-RoBERTabase 82.82 88.61 94.32 87.71 88.63 90.40 76.81 87.04

Table 2: Transfer task results of different sentence embedding models (measured as accuracy). ♣: results from
Reimers and Gurevych (2019); ♥: results from Zhang et al. (2020); ♦: results from Gao et al. (2021).

STS-B Avg. transfer whether a token is an inserted token or the original


SimCSE 81.47 83.91 token. For deletion, we randomly delete 15% to-
DiffCSE 84.56 85.95 kens in the sentence, and the task is to predict for
w/o contrastive loss 54.48 83.46 each token whether a token preceding it has been
use next sent. for x0 82.91 85.83 deleted or not. The results are shown in Table 4.
use same+next sent. for x0 83.41 85.82
We can see that using either insertion or deletion
Conditional MLM
for same sent. 83.08 84.43
achieves a slightly worse STS-B performance than
for next sent. 75.82 85.68 using MLM replacement. For transfer tasks, their
for same+next sent. 82.88 84.82 results are similar. Finally, we find that combining
Conditional Corrective LM 79.79 85.30 all three augmentations in the training process does
not improve the MLM replacement strategy.
Table 3: Development set results of STS-B and transfer
tasks for DiffCSE model variants, where we vary the
Pooler Choice In SimCSE, the authors use the
objective and the use of same or next sentence.
pooler in BERT’s original implementation (one
Augmentation STS-B Avg. transfer linear layer with tanh activation function) as the
MLM 15% 84.48 85.95
final layer to extract features for computing con-
randomly insert 15% 82.20 85.96 trastive loss. In our implementation (see details
randomly delete 15% 82.59 85.97 in Appendix A), we find that it is better to use a
combining all 82.80 85.92
two-layer pooler with Batch Normalization (Batch-
Table 4: Development set results of STS-B and transfer Norm) (Ioffe and Szegedy, 2015), which is com-
tasks with different augmentation methods for learning monly used in contrastive learning framework in
equivariance. computer vision (Chen et al., 2020; Grill et al.,
STS-B Avg. transfer 2020; Chen and He, 2021; Hua et al., 2021). We
DiffCSE show the ablation results in Table 5. We can ob-
w/ BatchNorm 84.56 85.95 serve that adding BatchNorm is beneficial for either
w/o BatchNorm 83.23 85.24 DiffCSE or SimCSE to get better performance on
SimCSE STS-B and transfer tasks.
w/ BatchNorm 82.22 85.66
w/o BatchNorm 81.47 83.91
Size of the Generator In our DiffCSE model,
Table 5: Development set results of STS-B and trans- the generator can be in different model size
fer tasks for DiffCSE and SimCSE with and without from BERTlarge , BERTbase (Devlin et al., 2019),
BatchNorm. DistilBERTbase (Sanh et al., 2019), BERTmedium ,
BERTsmall , BERTmini , BERTtiny (Turc et al.,
randomly insert mask tokens to the sentence, and 2019). Their exact sizes are shown in Table 6 (L:
then use a generator to convert mask tokens into number of layers, H: hidden dimension). Notice
real tokens. The number of inserted masked tokens that although DistilBERTbase has only half the
is 15% of the sentence length. The task is to predict number of layers of BERT, it can retain 97% of
STS-B Avg. transfer λ 0 0.0001 0.0005 0.001
STS-B 82.22 83.90 84.40 84.24
SimCSE 81.47 83.91
λ 0.005 0.01 0.05 0.1
DiffCSE w/ generator: STS-B 84.56 83.44 84.11 83.66
BERTlarge (L=24, H=1024) 82.93 85.88
BERTbase (L=12, H=768) 83.63 85.85 Table 8: Development set results of STS-B under dif-
DistilBERTbase (L=6, H=768) 84.56 85.95
ferent λ.
BERTmedium (L=8, H=512) 82.25 85.80
BERTsmall (L=4, H=512) 82.64 85.66
BERTmini (L=4, H=256) 82.12 85.90 ELECTRA loss. As a result, we need a smaller
BERTtiny (L=2, H=128) 81.40 85.23
λ to balance these two loss terms. In the Table 8
Table 6: Development set results of STS-B and transfer we show the STS-B result under different λ values.
tasks with different generators. Note that when λ goes to zero, the model becomes
a SimCSE model. We find that using λ = 0.005
Ratio 15% 20% 25% 30% 40% 50% can give us the best performance.
STS-B 84.48 84.04 84.49 84.56 84.48 83.91

Table 7: Development set results of STS-B under dif- 6 Analysis


ferent masking ratio for augmentations. 6.1 Qualitative Study

BERT’s performance due to knowledge distillation. A very common application for sentence embed-
dings is the retrieval task. Here we show some
We show our results in Table 6, we can see
retrieval examples to qualitatively explain why Dif-
the performance of transfer tasks does not change
fCSE can perform better than SimCSE. In this
much with different generators. However, the score
study, we use the 2758 sentences from STS-B test-
of STS-B decreases as we switch from BERT-
ing set as the corpus, and then use sentence query
medium to BERT-tiny. This finding is not the same
to retrieve the nearest neighbors in the sentence
as ELECTRA, which works best with generators
embedding space by computing cosine similarities.
1/4-1/2 the size of the discriminator. Because our
We show the retrieved top-3 examples in Table 9.
discriminator is conditional on sentence vectors,
The first query sentence is “you can do it, too.”. The
it will be easier for the discriminator to perform
SimCSE model retrieves a very similar sentence
the RTD task. As a result, using stronger gen-
but has a slightly different meaning (“you can use
erators (BERTbase , DistilBERTbase ) to increase
it, too.”) as the rank-1 answer. In contrast, DiffCSE
the difficulty of RTD would help the discriminator
can distinguish the tiny difference, so it retrieves
learn better. However, when using a large model
the ground truth answer as the rank-1 answer. The
like BERTlarge , it may be a too-challenging task
second query sentence is “this is not a problem”.
for the discriminator. In our experiment, using
SimCSE retrieves a sentence with opposite mean-
DistilBERTbase , which has the ability close to but
ing but very similar wording, while DiffCSE can
slightly worse than BERTbase , gives us the best
retrieve the correct answer with less similar word-
performance.
ing. We also provide a third example where both
Masking Ratio In our conditional ELECTRA SimCSE and DiffCSE fail to retrieve the correct
task, we can mask the original sentence in different answer for a query sentence using double negation.
ratios for the generator to produce MLM-based
6.2 Retrieval Task
augmentations. A higher masking ratio will make
more perturbations to the sentence. Our empirical Besides the qualitative study, we also show the
result in Table 7 shows that the difference between quantitative result of the retrieval task. Here we
difference masking ratios is small (in 15%-40% ), also use all the 2758 sentences in the testing set
and a masking ratio of around 30% can give us the of STS-B as the corpus. There are 97 positive
best performance. pairs in this corpus (with 5 out of 5 semantic sim-
ilarity scores from human annotation). For each
Coefficient λ In Section 3, we use the λ coeffi- positive pair, we use one sentence to retrieve the
cient to weight the ELECTRA loss and then add it other one, and see whether the other sentence is
with contrastive loss. Because the contrastive learn- in the top-1/5/10 ranking. The recall@1/5/10 of
ing objective is a relatively easier task, the scale of the retrieval task are shown in Table 10. We can
contrastive loss will be 100 to 1000 smaller than observe that DiffCSE can outperform SimCSE for
SimCSE-BERTbase DiffCSE-BERTbase Model Alignment Uniformity STS
Query: you can do it, too.
Avg. BERTbase 0.172 -1.468 56.70
1) you can use it, too. 1) yes, you can do it. SimCSE-BERTbase 0.177 -2.313 76.16
2) can you do it? 2) you can use it, too. DiffCSE-BERTbase 0.097 -1.438 78.49
3) yes, you can do it. 3) can you do it?
Query: this is not a problem.
Table 11: Alignment and Uniformity (Wang and Isola,
1) this is a big problem. 1) i don ’t see why this could be a 2020) measured on STS-B test set for SimCSE and Dif-
problem.
2) you have a problem. 2) i don ’t see why that should be a fCSE. The smaller the number is better. We also show
problem. the averaged STS score in the right-most column.
3) i don ’t see why that should be a 3) this is a big problem.
problem.
Query: i think that is not a bad idea. may be caused by the fact that ELECTRA and
1) i do not think it’s a good idea. 1) i do not think it’s a good idea . other Transformer-based pretrained LMs have the
2) it’s not a good idea . 2) it is not a good idea.
3) it is not a good idea . 3) but it is not a good idea. problem of squeezing the representation space, as
mentioned by Meng et al. (2021). As we use the
Table 9: Retrieved top-3 examples by SimCSE and Dif-
fCSE from STS-B test set. sentence embeddings as the input of ELECTRA to
perform conditional ELECTRA training, the sen-
Model/Recall @1 @5 @10
tence embedding will be inevitably squeezed to
SimCSE-BERTbase 77.84 92.78 95.88 fit the input distribution of ELECTRA. We follow
DiffCSE-BERTbase 78.87 95.36 97.42
prior studies (Wang and Isola, 2020; Gao et al.,
Table 10: The retrieval results for SimCSE and Dif- 2021) to use uniformity and alignment (details in
fCSE. Appendix C) to measure the quality of representa-
tion space for DiffCSE and SimCSE in Table 11.
Compared to averaged BERT embeddings, Sim-
CSE has similar alignment (0.177 v.s. 0.172) but
better uniformity (-2.313). In contrast, DiffCSE
has similar uniformity as Avg. BERT (-1.438 v.s.
-1.468) but much better alignment (0.097). It in-
(a) SimCSE dicates that SimCSE and DiffCSE are optimizing
the representation space in two different directions.
And the improvement of DiffCSE may come from
its better alignment.

7 Conclusion

(b) DiffCSE In this paper, we present DiffCSE, a new unsu-


pervised sentence embedding framework that is
Figure 2: The distribution of cosine similarities from
SimCSE/DiffCSE for STS-B test set. Along the y-axis
aware of, but not invariant to, MLM-based word
are 5 groups of data splits based on human ratings. The replacement. Empirical results on semantic textual
x-axis is the cosine similarity. similarity tasks and transfer tasks both show the
effectiveness of DiffCSE compared to current state-
recall@1/5/10, showing the effectiveness of using of-the-art sentence embedding methods. We also
DiffCSE for the retrieval task. conduct extensive ablation studies to demonstrate
the different modeling choices in DiffCSE. Quali-
6.3 Distribution of Sentence Embeddings tative study and the retrieval results also show that
To look into the representation space of DiffCSE, DiffCSE can produce a better embedding space for
we plot the cosine similarity distribution of sen- sentence retrieval. One limitation of our work is
tence pairs from STS-B test set for both SimCSE that we do not explore the supervised setting that
and DiffCSE in Figure 2. We observe that both uses human-labeled NLI datasets to further boost
SimCSE and DiffCSE can assign cosine similari- the performance. We leave this topic for future
ties consistent with human ratings. However, we work. We believe that our work can provide re-
also observe that under the same human rating, searchers in the NLP community a new way to
DiffCSE assigns slightly higher cosine similari- utilize augmentations for natural language and thus
ties compared with SimCSE. This phenomenon produce better sentence embeddings.
Acknowledgements the 2017 Conference on Empirical Methods in Natu-
ral Language Processing, pages 670–680.
This research was partially supported by the Centre
for Perceptual and Interactive Intelligence (CPII) Rumen Dangovski, Li Jing, Charlotte Loh, Seung-
wook Han, Akash Srivastava, Brian Cheung, Pulkit
Ltd under the Innovation and Technology Fund Agrawal, and Marin Soljačić. 2021. Equivariant con-
(InnoHK). trastive learning. arXiv preprint arXiv:2111.00899.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
References Kristina Toutanova. 2019. Bert: Pre-training of
deep bidirectional transformers for language under-
Eneko Agirre, Carmen Banea, Daniel Cer, Mona Diab, standing. In Proceedings of the 2019 Conference of
Aitor Gonzalez-Agirre, Rada Mihalcea, German the North American Chapter of the Association for
Rigau, and Janyce Wiebe. 2016. SemEval-2016 Computational Linguistics: Human Language Tech-
task 1: Semantic textual similarity, monolingual nologies, Volume 1 (Long and Short Papers), pages
and cross-lingual evaluation. In Proceedings of the 4171–4186.
10th International Workshop on Semantic Evalua-
tion (SemEval-2016), pages 497–511, San Diego, William B Dolan and Chris Brockett. 2005. Automati-
California. Association for Computational Linguis- cally constructing a corpus of sentential paraphrases.
tics. In Proceedings of the Third International Workshop
on Paraphrasing (IWP2005).
Fredrik Carlsson, Amaru Cuba Gyllensten, Evan-
gelia Gogoulou, Erik Ylipää Hellqvist, and Magnus Tianyu Gao, Xingcheng Yao, and Danqi Chen. 2021.
Sahlgren. 2021. Semantic re-tuning with contrastive Simcse: Simple contrastive learning of sentence em-
tension. beddings. arXiv preprint arXiv:2104.08821.

Daniel Cer, Mona Diab, Eneko Agirre, Iñigo Lopez- John M Giorgi, Osvald Nitski, Gary D Bader, and
Gazpio, and Lucia Specia. 2017. SemEval-2017 Bo Wang. 2020. Declutr: Deep contrastive learn-
task 1: Semantic textual similarity multilingual and ing for unsupervised textual representations. arXiv
crosslingual focused evaluation. In Proceedings preprint arXiv:2006.03659.
of the 11th International Workshop on Semantic
Evaluation (SemEval-2017), pages 1–14, Vancouver, Jean-Bastien Grill, Florian Strub, Florent Altché,
Canada. Association for Computational Linguistics. Corentin Tallec, Pierre Richemond, Elena
Buchatskaya, Carl Doersch, Bernardo Avila Pires,
Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Zhaohan Guo, Mohammad Gheshlaghi Azar, Bilal
Nicole Limtiaco, Rhomni St John, Noah Constant, Piot, koray kavukcuoglu, Remi Munos, and Michal
Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, Valko. 2020. Bootstrap your own latent - a new ap-
et al. 2018. Universal sentence encoder for english. proach to self-supervised learning. In Advances in
In Proceedings of the 2018 Conference on Empirical Neural Information Processing Systems, volume 33,
Methods in Natural Language Processing: System pages 21271–21284. Curran Associates, Inc.
Demonstrations, pages 169–174.
Felix Hill, Kyunghyun Cho, and Anna Korhonen.
Ting Chen, Simon Kornblith, Mohammad Norouzi, 2016. Learning distributed representations of
and Geoffrey Hinton. 2020. A simple framework for sentences from unlabelled data. arXiv preprint
contrastive learning of visual representations. In In- arXiv:1602.03483.
ternational conference on machine learning, pages
1597–1607. PMLR. Minqing Hu and Bing Liu. 2004. Mining and summa-
rizing customer reviews. In Proceedings of the tenth
Xinlei Chen and Kaiming He. 2021. Exploring simple ACM SIGKDD international conference on Knowl-
siamese representation learning. In Proceedings of edge discovery and data mining, pages 168–177.
the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pages 15750–15758. Tianyu Hua, Wenxiao Wang, Zihui Xue, Sucheng Ren,
Yue Wang, and Hang Zhao. 2021. On feature decor-
Kevin Clark, Minh-Thang Luong, Quoc V Le, and relation in self-supervised learning. In Proceedings
Christopher D Manning. 2020. Electra: Pre-training of the IEEE/CVF International Conference on Com-
text encoders as discriminators rather than genera- puter Vision, pages 9598–9608.
tors. arXiv preprint arXiv:2003.10555.
Sergey Ioffe and Christian Szegedy. 2015. Batch nor-
Alexis Conneau and Douwe Kiela. 2018. SentEval: An malization: Accelerating deep network training by
evaluation toolkit for universal sentence representa- reducing internal covariate shift. In International
tions. conference on machine learning, pages 448–456.
PMLR.
Alexis Conneau, Douwe Kiela, Holger Schwenk, Loïc
Barrault, and Antoine Bordes. 2017. Supervised Taeuk Kim, Kang Min Yoo, and Sang-goo Lee. 2021.
learning of universal sentence representations from Self-guided contrastive learning for bert sentence
natural language inference data. In Proceedings of representations. arXiv preprint arXiv:2106.07345.
Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov, Richard Socher, Alex Perelygin, Jean Wu, Jason
Richard S Zemel, Antonio Torralba, Raquel Urtasun, Chuang, Christopher D Manning, Andrew Y Ng,
and Sanja Fidler. 2015. Skip-thought vectors. pages and Christopher Potts. 2013. Recursive deep mod-
3294–3302. els for semantic compositionality over a sentiment
treebank. In Proceedings of the 2013 conference on
Bohan Li, Hao Zhou, Junxian He, Mingxuan Wang, empirical methods in natural language processing,
Yiming Yang, and Lei Li. 2020. On the sentence pages 1631–1642.
embeddings from pre-trained language models. In
Proceedings of the 2020 Conference on Empirical Jianlin Su, Jiarun Cao, Weijie Liu, and Yangyiwen Ou.
Methods in Natural Language Processing (EMNLP), 2021. Whitening sentence representations for bet-
pages 9119–9130, Online. Association for Computa- ter semantics and faster retrieval. arXiv preprint
tional Linguistics. arXiv:2103.15316.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- Iulia Turc, Ming-Wei Chang, Kenton Lee, and Kristina
dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Toutanova. 2019. Well-read students learn better:
Luke Zettlemoyer, and Veselin Stoyanov. 2019. On the importance of pre-training compact models.
Roberta: A robustly optimized bert pretraining ap- arXiv preprint arXiv:1908.08962.
proach. arXiv preprint arXiv:1907.11692.
Ellen M Voorhees and Dawn M Tice. 2000. Building
a question answering test collection. In Proceedings
Lajanugen Logeswaran and Honglak Lee. 2018. An
of the 23rd annual international ACM SIGIR confer-
efficient framework for learning sentence represen-
ence on Research and development in information
tations.
retrieval, pages 200–207.
Marco Marelli, Stefano Menini, Marco Baroni, Luisa Tongzhou Wang and Phillip Isola. 2020. Understand-
Bentivogli, Raffaella Bernardi, Roberto Zamparelli, ing contrastive representation learning through align-
et al. 2014. A sick cure for the evaluation of com- ment and uniformity on the hypersphere. In Inter-
positional distributional semantic models. In Lrec, national Conference on Machine Learning, pages
pages 216–223. Reykjavik. 9929–9939. PMLR.
Yu Meng, Chenyan Xiong, Payal Bajaj, Saurabh Ti- Janyce Wiebe, Theresa Wilson, and Claire Cardie.
wary, Paul Bennett, Jiawei Han, and Xia Song. 2021. 2005. Annotating expressions of opinions and emo-
Coco-lm: Correcting and contrasting text sequences tions in language. Language resources and evalua-
for language model pretraining. arXiv preprint tion, 39(2):165–210.
arXiv:2102.08473.
Yuanmeng Yan, Rumei Li, Sirui Wang, Fuzheng
Bo Pang and Lillian Lee. 2004. A sentimental edu- Zhang, Wei Wu, and Weiran Xu. 2021. Con-
cation: Sentiment analysis using subjectivity sum- sert: A contrastive framework for self-supervised
marization based on minimum cuts. arXiv preprint sentence representation transfer. arXiv preprint
cs/0409058. arXiv:2105.11741.

Bo Pang and Lillian Lee. 2005. Seeing stars: Exploit- Ziyi Yang, Yinfei Yang, Daniel Cer, Jax Law, and
ing class relationships for sentiment categorization Eric Darve. 2020. Universal sentence representation
with respect to rating scales. In Proceedings of the learning with conditional masked language model.
43rd Annual Meeting of the Association for Compu- arXiv preprint arXiv:2012.14388.
tational Linguistics (ACL’05), pages 115–124.
Yan Zhang, Ruidan He, Zuozhu Liu, Kwan Hui Lim,
Jeffrey Pennington, Richard Socher, and Christopher D and Lidong Bing. 2020. An unsupervised sentence
Manning. 2014. Glove: Global vectors for word rep- embedding method by mutual information maxi-
resentation. In Proceedings of the 2014 conference mization. In Proceedings of the 2020 Conference on
on empirical methods in natural language process- Empirical Methods in Natural Language Processing
ing (EMNLP), pages 1532–1543. (EMNLP), pages 1601–1610.

Nils Reimers and Iryna Gurevych. 2019. Sentence-


bert: Sentence embeddings using siamese bert-
networks. In Proceedings of the 2019 Conference on
Empirical Methods in Natural Language Processing
and the 9th International Joint Conference on Natu-
ral Language Processing (EMNLP-IJCNLP), pages
3982–3992.

Victor Sanh, Lysandre Debut, Julien Chaumond, and


Thomas Wolf. 2019. Distilbert, a distilled version
of bert: smaller, faster, cheaper and lighter. arXiv
preprint arXiv:1910.01108.
A Training Details Method STS-B Avg. transfer
SimCSE 81.47 83.91
We use a single NVIDIA 2080Ti GPU for
each experiment. The averaged running time + Additional positives
for DiffCSE is 3-6 hours. We use grid- MLM 15% 73.59 83.33
random insert 15% 80.39 83.92
search of batch size ∈ {64, 128} learning random delete 15% 78.58 81.80
rate ∈ {2e-6, 3e-6, 5e-6, 7e-6, 1e-5} and mask-
+ Additional negatives
ing ratio ∈ {0.15, 0.20, 0.30, 0.40} and λ ∈
MLM 15% 83.02 84.49
{0.1, 0.05, 0.01, 0.005, 0.001}. The temperature random insert 15% 55.65 79.86
τ in SimCSE is set to 0.05 for all the experiments. random delete 15% 55.13 82.56
During the training process, we save the checkpoint + Equivariance (Ours)
with the highest score on the STS-B development MLM 15% 84.48 85.95
set. And then we use STS-B development set to randomly insert 15% 82.20 85.96
find the best hyperparameters (listed in Table 12) randomly delete 15% 82.59 85.97
for STS task; we use the averaged score of the de-
Table 15: Development set results of STS-B and trans-
velopment sets of 7 transfer tasks to find the best
fer tasks for using three types of augmentations (re-
hyperparameters (listed in Table 13) for transfer place, insert, delete) in different ways.
tasks. All numbers in Table 1 and Table 2 are from
a single run.
encoder, so the model size is the same as the Sim-
hyperparam BERTbase RoBERTabase CSE model.
learning rate 7e-6 1e-5
masking ratio 0.30 0.20 Projector with BatchNorm In Section 5, we
λ 0.005 0.005 mention that we use a projector with BatchNorm
training epochs 2 2 as the final layer of our model. Here we provided
batch size 64 64
the PyTorch code for its structure:
Table 12: The main hyperparameters in STS tasks. class ProjectionMLP([Link]):
def __init__(self, hidden_size):
super().__init__()
in_dim = hidden_size
hyperparam BERTbase RoBERTabase
middle_dim = hidden_size * 2
learning rate 2e-6 3e-6 out_dim = hidden_size
masking ratio 0.15 0.15 [Link] = [Link](
λ 0.05 0.05 [Link](in_dim, middle_dim,
training epochs 2 2 bias=False),
batch size 64 128 nn.BatchNorm1d(middle_dim),
[Link](inplace=True),
[Link](middle_dim, out_dim,
Table 13: The main hyperparameters in transfer tasks.
bias=False),
nn.BatchNorm1d(out_dim,
affine=False))
Method BERTbase RoBERTabase
SimCSE 110M 125M
DiffCSE (train) 220M 250M B Using Augmentations as
DiffCSE (test) 110M 125M Positive/Negative Examples
Table 14: The number of parameters used in our mod- In Section 5, we try to use different augmentations
els. (e.g. insertion, deletion, replacement) for learning
equivariance. In Table 15 we provide the results of
During testing, we follow SimCSE to discard the using these augmentations as additional positive or
MLP projector and only use the [CLS] output to negative examples along with the SimCSE training
extract the sentence embeddings. paradigm. We can observe that using these aug-
The numbers of model parameters for BERTbase mentations as additional positives only decreases
and RoBERTabase are listed in Table 14. Note that the performance. The only method that can im-
in training time DiffCSE needs two BERT models prove the performance a little bit is to use MLM
to work together (sentence encoder + discrimina- 15% replaced examples as additional negative ex-
tor), but in testing time we only need the sentence amples. Overall, none of these results can perform
better than our proposed method, e.g. using these
augmentations to learn equivariance.

C Uniformity and Alignment


Wang and Isola (2020) propose to use two prop-
erties, alignment and uniformity, to measure the
quality of representations. Given a distribution of
positive pairs ppos and the distribution of the whole
dataset pdata , alignment computes the expected
distance between normalized embeddings of the
paired sentences:
2
f (x) − f x+

`align , E .
(x,x+ )∼ppos

Uniformity measures how well the embeddings are


uniformly distributed in the representation space:
2
`uniform , log E e−2kf (x)−f (y)k .
i.i.d.
x,y ∼ pdata

The smaller the values of uniformity and alignment,


the better the quality of the representation space is
indicated.

D Source Code
We build our model using the PyTorch implementa-
tion of SimCSE7 Gao et al. (2021), which is based
on the HuggingFace’s Transformers package.8 We
also upload our code9 and pretrained models (links
in [Link]). Please follow the instructions in
[Link] to reproduce the results.

E Potential Risks
On the risk side, insofar as our method utilizes pre-
trained language models, it may inherit and prop-
agate some of the harmful biases present in such
models. Besides that, we do not see any other po-
tential risks in our paper.

7
[Link]
SimCSE
8
[Link]
transformers
9
[Link]

You might also like