This dataset contains 23,659 human translated PAWS evaluation pairs and 296,406 machine translated training pairs in six typologically distinct languages: French, Spanish, German, Chinese, Japanese, and Korean. All translated pairs are sourced from examples in PAWS-Wiki.
The dataset can be downloaded here:
For further details, see the accompanying paper: PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification
Note: for multilingual experiments, please use dev_2k.tsv provided in the
PAWS-X repo as the development sets for all languages, including English.
Note: As discussed here, a small number of samples in the translated dev and tests contain the placeholder "NS". Please make sure you clean them up.
All files are in tsv format with four columns:
| Column Name | Data |
|---|---|
| id | An ID that matches the ID of the source pair in PAWS-Wiki |
| sentence1 | The first sentence |
| sentence2 | The second sentence |
| label | Label for each pair |
The source text of each translation can be retrieved by looking up the ID in the corresponding file in PAWS-Wiki.
The numbers of examples for each of the six languages are shown below:
| Language | Train | Dev | Test |
|---|---|---|---|
| fr | 49,401 | 1,992 | 1,985 |
| es | 49,401 | 1,962 | 1,999 |
| de | 49,401 | 1,932 | 1,967 |
| zh | 49,401 | 1,984 | 1,975 |
| ja | 49,401 | 1,980 | 1,946 |
| ko | 49,401 | 1,965 | 1,972 |
| Total | 296,406 | 11,815 | 11,844 |
Caveat: please note that the dev and test sets of PAWS-X are both sourced from the dev set of PAWS-Wiki. As a consequence, the same
sentence 1may appear in both the dev and test sets. Nevertheless our data split guarantees that there is no overlap on sentence pairs (sentence 1+sentence 2) between dev and test.
If you use or discuss this dataset in your work, please cite our paper:
@InProceedings{pawsx2019emnlp,
title = {{PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification}},
author = {Yang, Yinfei and Zhang, Yuan and Tar, Chris and Baldridge, Jason},
booktitle = {Proc. of EMNLP},
year = {2019}
}
The following table is necessary for this dataset to be indexed by search engines such as Google Dataset Search.
| property | value | ||||||
|---|---|---|---|---|---|---|---|
| name | PAWS-X | ||||||
| alternateName | Paraphrase Adversaries from Word Scrambling | ||||||
| description | PAWS-X dataset contains 23,659 human translated PAWS evaluation pairs and 296,406 machine translated training pairs in six typologically distinct languages: French, Spanish, German, Chinese, Japanese, and Korean. All translated pairs are sourced from examples in PAWS-Wiki. | ||||||
| url | https://github.com/google-research-datasets/paws/tree/master/pawsx |
||||||
| provider |
|
||||||
| license |
|
||||||
| citation | Yinfei Yang, Yuan Zhang, Chris Tar, Jason Baldridge "PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification", Proceedings of EMNLP, 2019 |