We propose the cross-lingual knowledge grounded conversation (CKGC) task that ground open-domain dialogue by cross-lingual knowledge. We collect a test collection (see dataset), and propose a curriculum self-knowledge distillation scheme for CKGC.
Paper linke: Conversations Powered by Cross-Lingual Knowledge
The annotated data CKGC is at /dataset. An example from zh.json.
{
'topic': 'Red meat', // The start topic of the conversation
'dialogue': [ // Conversation content
{
'role': 'Apprentice', // The role of the speaker. Can be 'Apprentice' or 'Wizard'
'text': '这个具体指哪些肉呢,日常生活中常见吗', // The content of the message
'knowledge_pool': [], // Apprentice does not use knowledge
'selected_knowledge': ""
},
{
'role': 'Wizard', // Now the speaker is Wizard.
'text': '从营养学的角度来说,红肉一般含有更多肌红蛋白,比如牛肉啦',
'knowledge_pool': { // The candidate knowledge that wizard sees.
'red meat':[ // Title of the article.
'red meat is a source of lipoic acid.', // A sentence in the article.
'in nutritional science, "red meat" is defined as any meat that has more of the protein myoglobin than white meat.'
...
],
...
},
'selected_knowledge': 'in nutritional science, "red meat" is defined as any meat that has more of the protein myoglobin than white meat.' // The sentence selected by the Wizard. It can be a sentence in knowledge pool, or no_ passage_ used.
},
...
]
}
The dataset is collected using annotation systems test system.
We are also collecting and sharing a Multilingual Conversation Corpus (in a uniform format), which includes
- ~40M Reddit dialogue data of 36 languages, cleaned from 6 years of Reddit data,
- ~0.8M personalized dialogue data of 7 languages, translated from persona dialogue,
- ~0.3M knowledge-grounded dialogue data of 4 languages, translated from Wizard-of-Wikipedia, and
- 12K CKGC test data of 4 languages, annotated in this project.
The code for reproducing the cross-lingual knowledge selection and response generation model are avaible at modeling/train_retriever.py and modeling/train_generator.py, respectively.
@inproceedings{Sun:2021:CPC,
author = {Sun, Weiwei and Meng, Chuan and Meng, Qi and Ren, Zhaochun and Ren, Pengjie and Chen, Zhumin and de Rijke, Maarten},
title = {Conversations Powered by Cross-Lingual Knowledge},
booktitle = {Proceedings of the 44rd International ACM SIGIR Conference on Research and Development in Information Retrieval},
series = {SIGIR '21},
year = {2021},
publisher = {ACM}
}