Diffcse 12344
Diffcse 12344
Embeddings
Yung-Sung Chuang† Rumen Dangovski† Hongyin Luo† Yang Zhang‡ Shiyu Chang∗
Marin SoljačiㆠShang-Wen Li Wen-tau Yih Yoon Kim† James Glass†
Massachusetts Institute of Technology† Meta AI
MIT-IBM Watson AI Lab‡ UC Santa Barbara∗
yungsung@[Link]
embeddings that are sensitive to the difference 2020), such augmentations have generally been un-
between the original sentence and an edited successful when applied to contrastive learning of
sentence, where the edited sentence is ob- sentence embeddings. Indeed, Gao et al. (2021)
tained by stochastically masking out the origi- find that constructing positive pairs via a simple
nal sentence and then sampling from a masked
dropout-based augmentation works much better
language model. We show that DiffSCE is
an instance of equivariant contrastive learn- than more complex augmentations such as word
ing (Dangovski et al., 2021), which general- deletions or replacements based on synonyms or
izes contrastive learning and learns represen- masked language models. This is perhaps unsur-
tations that are insensitive to certain types of prising in hindsight; while the training objective
augmentations and sensitive to other “harm- in contrastive learning encourages representations
ful” types of augmentations. Our experiments to be invariant to augmentation transformations,
show that DiffCSE achieves state-of-the-art re-
direct augmentations on the input (e.g., deletion,
sults among unsupervised sentence representa-
tion learning methods, outperforming unsuper- replacement) often change the meaning of the sen-
vised SimCSE1 by 2.3 absolute points on se- tence. That is, ideal sentence embeddings should
mantic textual similarity tasks. 2 not be invariant to such transformations.
We propose to learn sentence representations
1 Introduction that are aware of, but not necessarily invariant to,
Learning “universal” sentence representations that such direct surface-level augmentations. This is an
capture rich semantic information and are at the instance of equivariant contrastive learning (Dan-
same time performant across a wide range of down- govski et al., 2021), which improves vision repre-
stream NLP tasks without task-specific finetuning sentation learning by using a contrastive loss on
is an important open issue in the field (Conneau insensitive image transformations (e.g., grayscale)
et al., 2017; Cer et al., 2018; Kiros et al., 2015; and a prediction loss on sensitive image trans-
Logeswaran and Lee, 2018; Giorgi et al., 2020; formations (e.g., rotations). We operationalize
Yan et al., 2021; Gao et al., 2021). Recent work equivariant contrastive learning on sentences by us-
has shown that finetuning pretrained language mod- ing dropout-based augmentation as the insensitive
els with contrastive learning makes it possible to transformation (as in SimCSE (Gao et al., 2021))
learn good sentence embeddings without any la- and MLM-based word replacement as the sensitive
beled data (Giorgi et al., 2020; Yan et al., 2021; transformation. This results in an additional cross-
Gao et al., 2021). Contrastive learning uses multi- entropy loss based on the difference between the
ple augmentations on a single datum to construct original and the transformed sentence.
positive pairs whose representations are trained to We conduct experiments on 7 semantic textual
1
SimCSE has two settings: unsupervised and supervised. similarity tasks (STS) and 7 transfer tasks from Sen-
In this paper, we focus on the unsupervised setting. Unless tEval (Conneau and Kiela, 2018) and find that this
otherwise stated, in this paper we use SimCSE to refer to difference-based learning greatly improves over
unsupervised SimCSE.
2
Pretrained models and code are available at https:// standard contrastive learning. Our DiffCSE ap-
[Link]/voidism/DiffCSE. proach can achieve around 2.3% absolute improve-
Replaced Token Detection Loss
Contrastive Loss 0: original
0 1 0 0 0 0 1
1: replaced
Discriminator
Generator (fixed)
Random
Masking
“You never know what you’re gonna get .” “You [MASK] know what you’re gonna [MASK] .”
Figure 1: Illustration of DiffCSE. On the left-hand side is a standard SimCSE model trained with regular contrastive
loss on dropout transformations. On the right hand side is a conditional difference prediction model which takes the
sentence vector h as input and predict the difference between x and x00 . During testing we discard the discriminator
and only use h as the sentence embedding.
ment on STS datasets over SimCSE, the previous amples and distant spans as negative examples for
state-of-the-art model. We also conduct a set of learning contrastive span representations. Finally,
ablation studies to justify our designed architecture. SimCSE (Gao et al., 2021) proposes an extremely
Qualitative study and analysis are also included to simple augmentation strategy by just switching
look into the embedding space of DiffCSE. dropout masks. While simple, sentence embed-
dings learned in this manner have been shown to be
2 Background and Related Work better than other more complicated augmentation
methods.
2.1 Learning Sentence Embeddings
Learning universal sentence embeddings has been 2.2 Equivariant Contrastive Learning
studied extensively in prior work, including unsu- DiffCSE is inspired by a recent generalization of
pervised approaches such as Skip-Thought (Kiros contrastive learning in computer vision (CV) called
et al., 2015), Quick-Thought (Logeswaran and Lee, equivariant contrastive learning (Dangovski et al.,
2018) and FastSent (Hill et al., 2016), or supervised 2021). We now explain how this CV technique can
methods such as InferSent (Conneau et al., 2017), be adapted to natural language.
Universal Sentence Encoder (Cer et al., 2018) and Understanding the role of input transformations
Sentence-BERT (Reimers and Gurevych, 2019). is crucial for successful contrastive learning. Past
Recently, researchers have focused on (unsuper- empirical studies have revealed useful transforma-
vised) contrastive learning approaches such as Sim- tions for contrastive learning, such as random re-
CLR (Chen et al., 2020) to learn sentence embed- sized cropping and color jitter for computer vision
dings. SimCLR (Chen et al., 2020) learns image (Chen et al., 2020) and dropout for NLP (Gao et al.,
representations by creating semantically close aug- 2021). Contrastive learning encourages representa-
mentations for the same images and then pulling tions to be insensitive to these transformations, i.e.
these representations to be closer than represen- the encoder is trained to be invariant to a set of man-
tations of random negative examples. The same ually chosen transformations. The above studies
framework can be adapted to learning sentence em- in CV and NLP have also revealed transformations
beddings by designing good augmentation meth- that are harmful for contrastive learning. For ex-
ods for natural language. ConSERT (Yan et al., ample, Chen et al. (2020) showed that making the
2021) uses a combination of four data augmen- representations insensitive to rotations decreases
tation strategies: adversarial attack, token shuf- the ImageNet linear probe accuracy, and Gao et al.
fling, cut-off, and dropout. DeCLUTR (Giorgi (2021) showed that using an MLM to replace 15%
et al., 2020) uses overlapped spans as positive ex- of the words drastically reduces performance on
STS-B. While previous works simply omit these tool for NLP researchers to discover useful trans-
transformations from contrastive pre-training, here formations.
we argue that we should still make use of these
transformations by learning representations that 3 Difference-based Contrastive Learning
are sensitive (but not necessarily invariant) to such Our approach is straightforward and can be seen as
transformations. combining the standard contrastive learning objec-
The notion of (in)sensitivity can be captured by tive from SimCSE (Figure 1, left) with a difference
the more general property of equivariance in math- prediction objective which conditions on the sen-
ematics. Let T be a transformation from a group tence embedding (Figure 1, right).
G and let T (x) denote the transformation of a sen- Given an unlabeled input sentence x, SimCSE
tence x. Equivariance is the property that there is creates a positive example x+ for it by applying
an induced group transformation T 0 on the output different dropout masks. By using the BERTbase
features (Dangovski et al., 2021): encoder f , we can obtain the sentence embedding
f (T (x)) = T 0 (f (x)). h = f (x) for x (see section 4 for how h is ob-
tained). The training objective for SimCSE is:
In the special case of contrastive learning, T 0 ’s +
target is the identity transformation, and we say esim(hi ,hi )/τ
Lcontrast = − log P + ,
that f is trained to be “invariant to T .” However, N
esim(hi ,hj )/τ
j=1
invariance is just a trivial case of equivariance,
and we can design training objectives where T 0 where N is the batch size for the input batch
is not the identity for some transformations (such {xi }Ni=1 as we are using in-batch negative exam-
as MLM), while it is the identity for others (such ples, sim(·, ·) is the cosine similarity function, and
as dropout). Dangovski et al. (2021) show that τ is a temperature hyperparameter.
generalizing contrastive learning to equivariance in On the right-hand side of Figure 1 is a con-
this way improves the semantic quality of features ditional version of the difference prediction ob-
in CV, and here we show that the complementary jective used in ELECTRA (Clark et al., 2020),
nature of invariance and equivariance extends to which contains a generator and a discrimina-
the NLP domain. The key observation is that the tor. Given a sentence of length T , x =
encoder should be equivariant to MLM-based aug- [x(1) , x(2) , ..., x(T ) ], we first apply a random mask
mentation instead of being invariant. We can oper- m = [m(1) , m(2) , ..., m(T ) ], m(t) ∈ [0, 1] on x to
ationalize this by using a conditional discriminator obtain x0 = m · x. We use another pretrained MLM
that combines the sentence representation with an as the generator G to perform masked language
edited sentence, and then predicts the difference modeling to recover randomly masked tokens in x0
between the original and edited sentences. This to obtain the edited sentence x00 = G(x0 ). Then,
is essentially a conditional version of the ELEC- we use a discriminator D to perform the Replaced
TRA model (Clark et al., 2020), which makes the Token Detection (RTD) task. For each token in the
encoder equivariant to MLM by using a binary dis- sentence, the model needs to predict whether it has
criminator which detects whether a token is from been replaced or not. The cross-entropy loss for a
the original sentence or from a generator. We hy- single sentence x is:
pothesize that conditioning the ELECTRA model T
X
with the representation from our sentence encoder x
−1 x00(t) = x(t) log D x00 , h, t
LRTD =
is a useful objective for encouraging f to be “equiv- t=1
ariant to MLM.”
x00(t) 00
To the best of our knowledge, we are the first to −1 6= x(t) log 1 − D x , h, t
observe and highlight the above parallel between
CV and NLP. In particular, we show that equivari- And
PN thextraining objective for a batch is LRTD =
i=1 LRTD . Finally we optimize these two losses
i
ant contrastive learning extends beyond CV, and
that it works for transformations even without al- together with a weighting coefficient λ:
gebraic structures, such as diff operations on sen- L = Lcontrast + λ · LRTD
tences. Further, insofar as the canonical set of
useful transformations is less established in NLP The difference between our model and ELECTRA
than is in CV, DiffCSE can serve as a diagnostic is that our discriminator D is conditional, so it can
use the information of x compressed in a fixed- are used and all embeddings are fixed once they
dimension vector h = f (x). The gradient of D are trained. The transfer tasks are various sen-
can be backward-propagated into f through h. By tence classification tasks, including MR (Pang and
doing so, f will be encouraged to make h infor- Lee, 2005), CR (Hu and Liu, 2004), SUBJ (Pang
mative enough to cover the full meaning of x, so and Lee, 2004), MPQA (Wiebe et al., 2005), SST-
that D can distinguish the tiny difference between 2 (Socher et al., 2013), TREC (Voorhees and Tice,
x and x00 . This approach essentially makes the con- 2000) and MRPC (Dolan and Brockett, 2005). In
ditional discriminator perform a “diff operation”, these transfer tasks, we will use a logistic regres-
hence the name DiffCSE. sion classifier trained on top of the frozen sentence
When we train our DiffCSE model, we fix the embeddings, following the standard setup (Con-
generator G, and only the sentence encoder f and neau and Kiela, 2018).
the discriminator D are optimized. After training,
4.3 Results
we discard D and only use f (which remains fixed)
to extract sentence embeddings to evaluate on the Baselines We compare our model with many
downstream tasks. strong unsupervised baselines including Sim-
CSE (Gao et al., 2021), IS-BERT (Zhang
4 Experiments et al., 2020), CMLM (Yang et al., 2020), De-
CLUTR (Giorgi et al., 2020), CT-BERT (Carlsson
4.1 Setup
et al., 2021), SG-OPT (Kim et al., 2021) and some
In our experiment, we follow the setting of unsu- post-processing methods like BERT-flow (Li et al.,
pervised SimCSE (Gao et al., 2021) and build our 2020) and BERT-whitening (Su et al., 2021) along
model based on their PyTorch implementation.3 with some naive baselines like averaged GloVe em-
We also use the checkpoints of BERT (Devlin et al., beddings (Pennington et al., 2014) and averaged
2019) and RoBERTa (Liu et al., 2019) as the initial- first and last layer BERT embeddings.
ization of our sentence encoder f . We add an MLP
layer with Batch Normalization (Ioffe and Szegedy, Semantic Textual Similarity (STS) We show
2015) (BatchNorm) on top of the [CLS] represen- the results of STS tasks in Table 1 including
tation as the sentence embedding. We will compare BERTbase (upper part) and RoBERTabase (lower
the model with/without BatchNorm in section 5. part). We also reproduce the previous state-of-
For the discriminator D, we use the same model the-art SimCSE (Gao et al., 2021). DiffCSE-
as the sentence encoder f (BERT/RoBERTa). For BERTbase can significantly outperform SimCSE-
the generator G, we use the smaller DistilBERT BERTbase and raise the averaged Spearman’s cor-
and DistilRoBERTa (Sanh et al., 2019) for effi- relation from 76.25% to 78.49%. For the RoBERTa
ciency. Note that the generator is fixed during train- model, DiffCSE-RoBERTabase can also improve
ing unlike the ELECTRA paper (Clark et al., 2020). upon SimCSE-RoBERTabase from 76.57% to
We will compare the results of using different size 77.80%.
model for the generator in section 5. More training Transfer Tasks We show the results of trans-
details are shown in Appendix A. fer tasks in Table 2. Compared with SimCSE-
BERTbase , DiffCSE-BERTbase can improve the
4.2 Data
averaged scores from 85.56% to 86.86%. When
For unsupervised pretraining, we use the same applying it to the RoBERTa model, DiffCSE-
106 randomly sampled sentences from English RoBERTabase also improves upon SimCSE-
Wikipedia that are provided by the source code RoBERTabase from 84.84% to 87.04%. Note that
of SimCSE.3 We evaluate our model on 7 seman- the CMLM-BERTbase (Yang et al., 2020) can
tic textual similarity (STS) and 7 transfer tasks achieve even better performance than DiffCSE.
in SentEval.4 STS tasks includes STS 2012– However, they use 1TB of the training data from
2016 (Agirre et al., 2016), STS Benchmark (Cer Common Crawl dumps while our model only use
et al., 2017) and SICK-Relatedness (Marelli et al., 115MB of the Wikipedia data for pretraining. We
2014). All the STS experiments are fully unsu- put their scores in Table 2 for reference. In Sim-
pervised, which means no STS training datasets CSE, the authors propose to use MLM as an auxil-
3
[Link] iary task for the sentence encoder to further boost
4
[Link] the performance of transfer tasks. Compared with
Model STS12 STS13 STS14 STS15 STS16 STS-B SICK-R Avg.
♣
GloVe embeddings (avg.) 55.14 70.66 59.73 68.25 63.66 58.02 53.76 61.32
BERTbase (first-last avg.)♦ 39.70 59.38 49.67 66.03 66.19 53.87 62.06 56.70
BERTbase -flow♦ 58.40 67.10 60.85 75.16 71.22 68.66 64.47 66.55
BERTbase -whitening♦ 57.83 66.90 60.90 75.08 71.31 68.24 63.73 66.28
IS-BERTbase ♥ 56.77 69.24 61.21 75.23 70.16 69.21 64.25 66.58
CMLM-BERTbase ♠ (1TB data) 58.20 61.07 61.67 73.32 74.88 76.60 64.80 67.22
CT-BERTbase ♦ 61.63 76.80 68.47 77.50 76.48 74.31 69.19 72.05
SG-OPT-BERTbase † 66.84 80.13 71.23 81.56 77.17 77.23 68.16 74.62
SimCSE-BERTbase ♦ 68.40 82.41 74.38 80.91 78.56 76.85 72.23 76.25
∗ SimCSE-BERTbase (reproduce) 70.82 82.24 73.25 81.38 77.06 77.24 71.16 76.16
∗ DiffCSE-BERTbase 72.28 84.43 76.47 83.90 80.54 80.59 71.23 78.49
RoBERTabase (first-last avg.)♦ 40.88 58.74 49.07 65.63 61.48 58.55 61.63 56.57
RoBERTabase -whitening♦ 46.99 63.24 57.23 71.36 68.99 61.36 62.91 61.73
DeCLUTR-RoBERTabase ♦ 52.41 75.19 65.52 77.12 78.63 72.41 68.62 69.99
SimCSE-RoBERTabase ♦ 70.16 81.77 73.24 81.36 80.65 80.22 68.56 76.57
∗ SimCSE-RoBERTabase (reproduce) 68.60 81.36 73.16 81.61 80.76 80.58 68.83 76.41
∗ DiffCSE-RoBERTabase 70.05 83.43 75.49 82.81 82.12 82.38 71.19 78.21
Table 1: The performance on STS tasks (Spearman’s correlation) for different sentence embedding models. ♣:
results from Reimers and Gurevych (2019); ♥: results from Zhang et al. (2020); ♦: results from Gao et al. (2021);
♠: results from Yang et al. (2020); †: results from Kim et al. (2021); ∗: results from our experiments.
the results of SimCSE with MLM, DiffCSE still similar. We also tried using the same sentence and
can have a little improvement around 0.2%. the next sentence at the same time for conditioning
the ELECTRA objective (use same+next sent. for
5 Ablation Studies x0 ), and did not observe improvements.
In the following sections, we perform an extensive
series of ablation studies that support our model Other Conditional Pretraining Tasks Instead
design. We use BERTbase model to evaluate on of a conditional binary difference prediction loss,
the development set of STS-B and transfer tasks. we can also consider other conditional pretraining
Removing Contrastive Loss In our model, both tasks such as a conditional MLM objective pro-
the contrastive loss and the RTD loss are crucial posed by Yang et al. (2020), or corrective language
because they maintain what should be sensitive and modeling,5 proposed by COCO-LM (Meng et al.,
what should be insensitive respectively. If we re- 2021). We experiment with these objectives instead
move the RTD loss, the model becomes a SimCSE of the difference prediction objective in Table 3.
model; if we remove the contrastive loss, the perfor- We observe that conditional MLM on the same sen-
mance of STS-B drops significantly by 30%, while tence does not improve the performance either on
the average score of transfer tasks also drops by 2% STS-B or transfer tasks compared with DiffCSE.
(see Table 3). This result shows that it is important Conditional MLM on the next sentence performs
to have insensitive and sensitive attributes that exist even worse for STS-B, but slightly better than using
together in the representation space. the same sentence on transfer tasks. Using both the
same and the next sentence also does not improve
Next Sentence vs. Same Sentence Some meth- the performance compared with DiffCSE. For the
ods for unsupervised sentence embeddings like corrective LM objective, the performance of STS-B
Quick-Thoughts (Logeswaran and Lee, 2018) and decreases significantly compared with DiffCSE.
CMLM (Yang et al., 2020) predict the next sen-
tence as the training objective. We also experi- Augmentation Methods: Insert/Delete/Replace
ment with a variant of DiffCSE by conditioning In DiffCSE, we use MLM token replacement as
the ELECTRA loss based on the next sentence. the equivariant augmentation. It is possible to use
Note that this kind of model is not doing a “diff other methods like random insertion or deletion in-
operation” between two similar sentences, and is stead of replacement.6 For insertion, we choose to
5
not an instance of equivariant contrastive learning. This task is similar to ELECTRA. However, instead of a
As shown in Table 3 (use next sent. for x0 ), the binary classifier for replaced token detection, corrective LM
uses a vocabulary-size classifier with the copy mechanism to
score of STS-B decreases significantly compared recover the replaced tokens.
6
to DiffCSE while transfer performance remains Edit distance operators include insert, delete and replace.
Model MR CR SUBJ MPQA SST TREC MRPC Avg.
♣
GloVe embeddings (avg.) 77.25 78.30 91.17 87.85 80.18 83.00 72.87 81.52
Skip-thought♥ 76.50 80.10 93.60 87.10 82.00 92.20 73.00 83.50
Avg. BERT embeddings♣ 78.66 86.25 94.37 88.66 84.40 92.80 69.54 84.94
BERT-[CLS]embedding♣ 78.68 84.85 94.21 88.23 84.13 91.40 71.13 84.66
IS-BERTbase ♥ 81.09 87.18 94.96 88.75 85.96 88.64 74.24 85.83
SimCSE-BERTbase ♦ 81.18 86.46 94.45 88.88 85.50 89.80 74.43 85.81
w/ MLM 82.92 87.23 95.71 88.73 86.81 87.01 78.07 86.64
∗ DiffCSE-BERTbase 82.69 87.23 95.23 89.28 86.60 90.40 76.58 86.86
CMLM-BERTbase (1TB data) 83.60 89.90 96.20 89.30 88.50 91.00 69.70 86.89
♦
SimCSE-RoBERTabase 81.04 87.74 93.28 86.94 86.60 84.60 73.68 84.84
w/ MLM 83.37 87.76 95.05 87.16 89.02 90.80 75.13 86.90
∗ DiffCSE-RoBERTabase 82.82 88.61 94.32 87.71 88.63 90.40 76.81 87.04
Table 2: Transfer task results of different sentence embedding models (measured as accuracy). ♣: results from
Reimers and Gurevych (2019); ♥: results from Zhang et al. (2020); ♦: results from Gao et al. (2021).
BERT’s performance due to knowledge distillation. A very common application for sentence embed-
dings is the retrieval task. Here we show some
We show our results in Table 6, we can see
retrieval examples to qualitatively explain why Dif-
the performance of transfer tasks does not change
fCSE can perform better than SimCSE. In this
much with different generators. However, the score
study, we use the 2758 sentences from STS-B test-
of STS-B decreases as we switch from BERT-
ing set as the corpus, and then use sentence query
medium to BERT-tiny. This finding is not the same
to retrieve the nearest neighbors in the sentence
as ELECTRA, which works best with generators
embedding space by computing cosine similarities.
1/4-1/2 the size of the discriminator. Because our
We show the retrieved top-3 examples in Table 9.
discriminator is conditional on sentence vectors,
The first query sentence is “you can do it, too.”. The
it will be easier for the discriminator to perform
SimCSE model retrieves a very similar sentence
the RTD task. As a result, using stronger gen-
but has a slightly different meaning (“you can use
erators (BERTbase , DistilBERTbase ) to increase
it, too.”) as the rank-1 answer. In contrast, DiffCSE
the difficulty of RTD would help the discriminator
can distinguish the tiny difference, so it retrieves
learn better. However, when using a large model
the ground truth answer as the rank-1 answer. The
like BERTlarge , it may be a too-challenging task
second query sentence is “this is not a problem”.
for the discriminator. In our experiment, using
SimCSE retrieves a sentence with opposite mean-
DistilBERTbase , which has the ability close to but
ing but very similar wording, while DiffCSE can
slightly worse than BERTbase , gives us the best
retrieve the correct answer with less similar word-
performance.
ing. We also provide a third example where both
Masking Ratio In our conditional ELECTRA SimCSE and DiffCSE fail to retrieve the correct
task, we can mask the original sentence in different answer for a query sentence using double negation.
ratios for the generator to produce MLM-based
6.2 Retrieval Task
augmentations. A higher masking ratio will make
more perturbations to the sentence. Our empirical Besides the qualitative study, we also show the
result in Table 7 shows that the difference between quantitative result of the retrieval task. Here we
difference masking ratios is small (in 15%-40% ), also use all the 2758 sentences in the testing set
and a masking ratio of around 30% can give us the of STS-B as the corpus. There are 97 positive
best performance. pairs in this corpus (with 5 out of 5 semantic sim-
ilarity scores from human annotation). For each
Coefficient λ In Section 3, we use the λ coeffi- positive pair, we use one sentence to retrieve the
cient to weight the ELECTRA loss and then add it other one, and see whether the other sentence is
with contrastive loss. Because the contrastive learn- in the top-1/5/10 ranking. The recall@1/5/10 of
ing objective is a relatively easier task, the scale of the retrieval task are shown in Table 10. We can
contrastive loss will be 100 to 1000 smaller than observe that DiffCSE can outperform SimCSE for
SimCSE-BERTbase DiffCSE-BERTbase Model Alignment Uniformity STS
Query: you can do it, too.
Avg. BERTbase 0.172 -1.468 56.70
1) you can use it, too. 1) yes, you can do it. SimCSE-BERTbase 0.177 -2.313 76.16
2) can you do it? 2) you can use it, too. DiffCSE-BERTbase 0.097 -1.438 78.49
3) yes, you can do it. 3) can you do it?
Query: this is not a problem.
Table 11: Alignment and Uniformity (Wang and Isola,
1) this is a big problem. 1) i don ’t see why this could be a 2020) measured on STS-B test set for SimCSE and Dif-
problem.
2) you have a problem. 2) i don ’t see why that should be a fCSE. The smaller the number is better. We also show
problem. the averaged STS score in the right-most column.
3) i don ’t see why that should be a 3) this is a big problem.
problem.
Query: i think that is not a bad idea. may be caused by the fact that ELECTRA and
1) i do not think it’s a good idea. 1) i do not think it’s a good idea . other Transformer-based pretrained LMs have the
2) it’s not a good idea . 2) it is not a good idea.
3) it is not a good idea . 3) but it is not a good idea. problem of squeezing the representation space, as
mentioned by Meng et al. (2021). As we use the
Table 9: Retrieved top-3 examples by SimCSE and Dif-
fCSE from STS-B test set. sentence embeddings as the input of ELECTRA to
perform conditional ELECTRA training, the sen-
Model/Recall @1 @5 @10
tence embedding will be inevitably squeezed to
SimCSE-BERTbase 77.84 92.78 95.88 fit the input distribution of ELECTRA. We follow
DiffCSE-BERTbase 78.87 95.36 97.42
prior studies (Wang and Isola, 2020; Gao et al.,
Table 10: The retrieval results for SimCSE and Dif- 2021) to use uniformity and alignment (details in
fCSE. Appendix C) to measure the quality of representa-
tion space for DiffCSE and SimCSE in Table 11.
Compared to averaged BERT embeddings, Sim-
CSE has similar alignment (0.177 v.s. 0.172) but
better uniformity (-2.313). In contrast, DiffCSE
has similar uniformity as Avg. BERT (-1.438 v.s.
-1.468) but much better alignment (0.097). It in-
(a) SimCSE dicates that SimCSE and DiffCSE are optimizing
the representation space in two different directions.
And the improvement of DiffCSE may come from
its better alignment.
7 Conclusion
Daniel Cer, Mona Diab, Eneko Agirre, Iñigo Lopez- John M Giorgi, Osvald Nitski, Gary D Bader, and
Gazpio, and Lucia Specia. 2017. SemEval-2017 Bo Wang. 2020. Declutr: Deep contrastive learn-
task 1: Semantic textual similarity multilingual and ing for unsupervised textual representations. arXiv
crosslingual focused evaluation. In Proceedings preprint arXiv:2006.03659.
of the 11th International Workshop on Semantic
Evaluation (SemEval-2017), pages 1–14, Vancouver, Jean-Bastien Grill, Florian Strub, Florent Altché,
Canada. Association for Computational Linguistics. Corentin Tallec, Pierre Richemond, Elena
Buchatskaya, Carl Doersch, Bernardo Avila Pires,
Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, Zhaohan Guo, Mohammad Gheshlaghi Azar, Bilal
Nicole Limtiaco, Rhomni St John, Noah Constant, Piot, koray kavukcuoglu, Remi Munos, and Michal
Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, Valko. 2020. Bootstrap your own latent - a new ap-
et al. 2018. Universal sentence encoder for english. proach to self-supervised learning. In Advances in
In Proceedings of the 2018 Conference on Empirical Neural Information Processing Systems, volume 33,
Methods in Natural Language Processing: System pages 21271–21284. Curran Associates, Inc.
Demonstrations, pages 169–174.
Felix Hill, Kyunghyun Cho, and Anna Korhonen.
Ting Chen, Simon Kornblith, Mohammad Norouzi, 2016. Learning distributed representations of
and Geoffrey Hinton. 2020. A simple framework for sentences from unlabelled data. arXiv preprint
contrastive learning of visual representations. In In- arXiv:1602.03483.
ternational conference on machine learning, pages
1597–1607. PMLR. Minqing Hu and Bing Liu. 2004. Mining and summa-
rizing customer reviews. In Proceedings of the tenth
Xinlei Chen and Kaiming He. 2021. Exploring simple ACM SIGKDD international conference on Knowl-
siamese representation learning. In Proceedings of edge discovery and data mining, pages 168–177.
the IEEE/CVF Conference on Computer Vision and
Pattern Recognition, pages 15750–15758. Tianyu Hua, Wenxiao Wang, Zihui Xue, Sucheng Ren,
Yue Wang, and Hang Zhao. 2021. On feature decor-
Kevin Clark, Minh-Thang Luong, Quoc V Le, and relation in self-supervised learning. In Proceedings
Christopher D Manning. 2020. Electra: Pre-training of the IEEE/CVF International Conference on Com-
text encoders as discriminators rather than genera- puter Vision, pages 9598–9608.
tors. arXiv preprint arXiv:2003.10555.
Sergey Ioffe and Christian Szegedy. 2015. Batch nor-
Alexis Conneau and Douwe Kiela. 2018. SentEval: An malization: Accelerating deep network training by
evaluation toolkit for universal sentence representa- reducing internal covariate shift. In International
tions. conference on machine learning, pages 448–456.
PMLR.
Alexis Conneau, Douwe Kiela, Holger Schwenk, Loïc
Barrault, and Antoine Bordes. 2017. Supervised Taeuk Kim, Kang Min Yoo, and Sang-goo Lee. 2021.
learning of universal sentence representations from Self-guided contrastive learning for bert sentence
natural language inference data. In Proceedings of representations. arXiv preprint arXiv:2106.07345.
Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov, Richard Socher, Alex Perelygin, Jean Wu, Jason
Richard S Zemel, Antonio Torralba, Raquel Urtasun, Chuang, Christopher D Manning, Andrew Y Ng,
and Sanja Fidler. 2015. Skip-thought vectors. pages and Christopher Potts. 2013. Recursive deep mod-
3294–3302. els for semantic compositionality over a sentiment
treebank. In Proceedings of the 2013 conference on
Bohan Li, Hao Zhou, Junxian He, Mingxuan Wang, empirical methods in natural language processing,
Yiming Yang, and Lei Li. 2020. On the sentence pages 1631–1642.
embeddings from pre-trained language models. In
Proceedings of the 2020 Conference on Empirical Jianlin Su, Jiarun Cao, Weijie Liu, and Yangyiwen Ou.
Methods in Natural Language Processing (EMNLP), 2021. Whitening sentence representations for bet-
pages 9119–9130, Online. Association for Computa- ter semantics and faster retrieval. arXiv preprint
tional Linguistics. arXiv:2103.15316.
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- Iulia Turc, Ming-Wei Chang, Kenton Lee, and Kristina
dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Toutanova. 2019. Well-read students learn better:
Luke Zettlemoyer, and Veselin Stoyanov. 2019. On the importance of pre-training compact models.
Roberta: A robustly optimized bert pretraining ap- arXiv preprint arXiv:1908.08962.
proach. arXiv preprint arXiv:1907.11692.
Ellen M Voorhees and Dawn M Tice. 2000. Building
a question answering test collection. In Proceedings
Lajanugen Logeswaran and Honglak Lee. 2018. An
of the 23rd annual international ACM SIGIR confer-
efficient framework for learning sentence represen-
ence on Research and development in information
tations.
retrieval, pages 200–207.
Marco Marelli, Stefano Menini, Marco Baroni, Luisa Tongzhou Wang and Phillip Isola. 2020. Understand-
Bentivogli, Raffaella Bernardi, Roberto Zamparelli, ing contrastive representation learning through align-
et al. 2014. A sick cure for the evaluation of com- ment and uniformity on the hypersphere. In Inter-
positional distributional semantic models. In Lrec, national Conference on Machine Learning, pages
pages 216–223. Reykjavik. 9929–9939. PMLR.
Yu Meng, Chenyan Xiong, Payal Bajaj, Saurabh Ti- Janyce Wiebe, Theresa Wilson, and Claire Cardie.
wary, Paul Bennett, Jiawei Han, and Xia Song. 2021. 2005. Annotating expressions of opinions and emo-
Coco-lm: Correcting and contrasting text sequences tions in language. Language resources and evalua-
for language model pretraining. arXiv preprint tion, 39(2):165–210.
arXiv:2102.08473.
Yuanmeng Yan, Rumei Li, Sirui Wang, Fuzheng
Bo Pang and Lillian Lee. 2004. A sentimental edu- Zhang, Wei Wu, and Weiran Xu. 2021. Con-
cation: Sentiment analysis using subjectivity sum- sert: A contrastive framework for self-supervised
marization based on minimum cuts. arXiv preprint sentence representation transfer. arXiv preprint
cs/0409058. arXiv:2105.11741.
Bo Pang and Lillian Lee. 2005. Seeing stars: Exploit- Ziyi Yang, Yinfei Yang, Daniel Cer, Jax Law, and
ing class relationships for sentiment categorization Eric Darve. 2020. Universal sentence representation
with respect to rating scales. In Proceedings of the learning with conditional masked language model.
43rd Annual Meeting of the Association for Compu- arXiv preprint arXiv:2012.14388.
tational Linguistics (ACL’05), pages 115–124.
Yan Zhang, Ruidan He, Zuozhu Liu, Kwan Hui Lim,
Jeffrey Pennington, Richard Socher, and Christopher D and Lidong Bing. 2020. An unsupervised sentence
Manning. 2014. Glove: Global vectors for word rep- embedding method by mutual information maxi-
resentation. In Proceedings of the 2014 conference mization. In Proceedings of the 2020 Conference on
on empirical methods in natural language process- Empirical Methods in Natural Language Processing
ing (EMNLP), pages 1532–1543. (EMNLP), pages 1601–1610.
D Source Code
We build our model using the PyTorch implementa-
tion of SimCSE7 Gao et al. (2021), which is based
on the HuggingFace’s Transformers package.8 We
also upload our code9 and pretrained models (links
in [Link]). Please follow the instructions in
[Link] to reproduce the results.
E Potential Risks
On the risk side, insofar as our method utilizes pre-
trained language models, it may inherit and prop-
agate some of the harmful biases present in such
models. Besides that, we do not see any other po-
tential risks in our paper.
7
[Link]
SimCSE
8
[Link]
transformers
9
[Link]