This repository is created for the reproduction of DialSummEval: Revisiting summarization evaluation for dialogues. The code and data of the original authors and ourselves are included in the repository.
The link to the reproduction paper will be included here.
- Annotations and Correlations
- data
- ablation_results_annX.xlsx (where x is 1-3, for three annotators)
- original_paper_results.json
- saved_df_annX.csv (where x is 1-3, for three annotators)
- utils
- iaa_util.py
- ablation_corr.py
- originalVpresent_corr.py
- iaa_calculation.py
- IAA_CORR_calc.ipynb
- Annotation_tool.ipynb
- data
- figures (output from analysis.py)
- original_annotations
- human_judgment.jsonl
- reproduce
- analysis
- models_eval_new
- ... (scores and summaries for and from each model)
- analysis.py
- models_eval_new
- metrics
- ... (metrics)
- analysis
- environment.yml
- requirements.txt
Anaconda environments have been used for this repository. You can set up the environment using the following steps:
- Install Anaconda
- Clone this repository
- Navigate to the repository directory and run the following command in the terminal/command prompt:
conda env create -f environment.yml - Activate the environment using the following command:
conda activate combotenv
NOTE: It is advised to use seperate environments for the metrics.
The required packages are listed in the requirements.txt file. You can install these packages using the following command in the terminal/command prompt: pip install -r requirements.txt. Some metrics require packages that are not in the PyPI, this will be specified.
NOTE: CUDA installation guide Windows, CUDA installation guide Linux
The annotation tool is located in the following directory: .\Annotations and Correlations and contains a separate README.md. This Jupyter Notebook can be used to annotate the dialogue summaries. The reproduced annotations are also stored in in this directory.
We conducted an ablation study to examine the impact of the annotation tool on the annotation procedure. 140 summaries (14 summaries per 10 randomly selected dialogues) were annotated using the same method as in the original paper with an Excel sheet, where each model’s summaries were displayed on separate sheets. The results reveal a strong correlation between the results obtained through the tool and the original annotation process, supporting the use of the tool.
| Dimension | Reproduction-Original | Full Reproduction-Ablation |
|---|---|---|
| Coherence | 0.42 | 0.70 |
| Consistency | 0.77 | 0.66 |
| Fluency | 0.55 | 0.77 |
| Relevance | 0.69 | 0.51 |
Most paths are setup for Unix-like systems (Mac OS X and Linux) path\to\file, but since the experiments were ran on Windows some path are path//to//file, these should be changed if not ran on Windows and also because they are local paths.
The analysis of the human annotations compared to the metric scores is located in the following path: .\reproduce\analysis\analysis.py and contains a seperate README.md for the analysis.py file. The script can be run using the following command in the terminal/command prompt: python .\reproduce\analysis\analysis.py
The metrics are stored in the following path: .\reproduce\metrics and contains a seperate README.md. The metrics were calculated using the evaluation metrics such as ROUGE, BLEU, METEOR, BERTScore, MoverScore, BARTScore, SMS, Embedding average, Vector extrema, FEQA, SummaQA, QuestEval, FactCC, and DAE.
Each of the 100 dialogues was summarized using 13 models, in addition to one human reference summary. The present paper uses the generated outputs that were used in the original paper. The original summary outputs are stored inside .\reproduce\analysis\models_eval_new, inside a a file called summs.txt stored inside its respective directory. Each directory represents a model with an ID, as can be seen in the table underneath. Model-ID A contains human made summaries. The scores aquired through the metric calculations are also stored in these directories.
| Model-ID | Model |
|---|---|
| A | Reference |
| B | LONGEST-3 |
| C | LEAD-3 |
| D | PGN |
| E | Transformer |
| F | BART |
| G | PEGASUS |
| H | UniLM |
| I | CODS |
| J | ConvoSumm |
| K | MV-BART |
| L | PLM-BART |
| M | Ctrl-DiaSumm |
| N | S-BART |
The original paper’s main claims were reproduced. While not all original authors arguments were replicated (e.g. ROUGE scoring higher for relevance), the correlation between metrics and human judgments showed similar tendencies as in the original paper. The annotations correlated with the original at a Pearson score of 0.6, sufficient for reproducing main claims.