Reproduction of DialSummEval

Evaluation of automatic summarization evaluation metrics

This repository is created for the reproduction of DialSummEval: Revisiting summarization evaluation for dialogues. The code and data of the original authors and ourselves are included in the repository.

The link to the reproduction paper will be included here.

Annotations and Correlations
- data
  - ablation_results_annX.xlsx (where x is 1-3, for three annotators)
  - original_paper_results.json
  - saved_df_annX.csv (where x is 1-3, for three annotators)
- utils
  - iaa_util.py
- ablation_corr.py
- originalVpresent_corr.py
- iaa_calculation.py
- IAA_CORR_calc.ipynb
- Annotation_tool.ipynb
figures (output from analysis.py)
original_annotations
- human_judgment.jsonl
reproduce
- analysis
  - models_eval_new
    - ... (scores and summaries for and from each model)
  - analysis.py
- metrics
  - ... (metrics)
environment.yml
requirements.txt

Environment Setup

Anaconda environments have been used for this repository. You can set up the environment using the following steps:

Install Anaconda
Clone this repository
Navigate to the repository directory and run the following command in the terminal/command prompt: conda env create -f environment.yml
Activate the environment using the following command: conda activate combotenv

NOTE: It is advised to use seperate environments for the metrics.

Requirements

The required packages are listed in the requirements.txt file. You can install these packages using the following command in the terminal/command prompt: pip install -r requirements.txt. Some metrics require packages that are not in the PyPI, this will be specified.

NOTE: CUDA installation guide Windows, CUDA installation guide Linux

Annotation

The annotation tool is located in the following directory: .\Annotations and Correlations and contains a separate README.md. This Jupyter Notebook can be used to annotate the dialogue summaries. The reproduced annotations are also stored in in this directory.

We conducted an ablation study to examine the impact of the annotation tool on the annotation procedure. 140 summaries (14 summaries per 10 randomly selected dialogues) were annotated using the same method as in the original paper with an Excel sheet, where each model’s summaries were displayed on separate sheets. The results reveal a strong correlation between the results obtained through the tool and the original annotation process, supporting the use of the tool.

Dimension	Reproduction-Original	Full Reproduction-Ablation
Coherence	0.42	0.70
Consistency	0.77	0.66
Fluency	0.55	0.77
Relevance	0.69	0.51

Code

Most paths are setup for Unix-like systems (Mac OS X and Linux) path\to\file, but since the experiments were ran on Windows some path are path//to//file, these should be changed if not ran on Windows and also because they are local paths.

Analysis

The analysis of the human annotations compared to the metric scores is located in the following path: .\reproduce\analysis\analysis.py and contains a seperate README.md for the analysis.py file. The script can be run using the following command in the terminal/command prompt: python .\reproduce\analysis\analysis.py

Metrics

The metrics are stored in the following path: .\reproduce\metrics and contains a seperate README.md. The metrics were calculated using the evaluation metrics such as ROUGE, BLEU, METEOR, BERTScore, MoverScore, BARTScore, SMS, Embedding average, Vector extrema, FEQA, SummaQA, QuestEval, FactCC, and DAE.

Summarization Models

Each of the 100 dialogues was summarized using 13 models, in addition to one human reference summary. The present paper uses the generated outputs that were used in the original paper. The original summary outputs are stored inside .\reproduce\analysis\models_eval_new, inside a a file called summs.txt stored inside its respective directory. Each directory represents a model with an ID, as can be seen in the table underneath. Model-ID A contains human made summaries. The scores aquired through the metric calculations are also stored in these directories.

Model-ID	Model
A	Reference
B	LONGEST-3
C	LEAD-3
D	PGN
E	Transformer
F	BART
G	PEGASUS
H	UniLM
I	CODS
J	ConvoSumm
K	MV-BART
L	PLM-BART
M	Ctrl-DiaSumm
N	S-BART

Results

The original paper’s main claims were reproduced. While not all original authors arguments were replicated (e.g. ROUGE scoring higher for relevance), the correlation between metrics and human judgments showed similar tendencies as in the original paper. The annotations correlated with the original at a Pearson score of 0.6, sufficient for reproducing main claims.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reproduction of DialSummEval

Evaluation of automatic summarization evaluation metrics

Environment Setup

Requirements

Annotation

Code

Analysis

Metrics

Summarization Models

Results

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 125 Commits
Annotations and Correlations		Annotations and Correlations
figures		figures
original_annotations		original_annotations
reproduce		reproduce
.gitattributes		.gitattributes
README.md		README.md
environment.yml		environment.yml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Reproduction of DialSummEval

Evaluation of automatic summarization evaluation metrics

Environment Setup

Requirements

Annotation

Code

Analysis

Metrics

Summarization Models

Results

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages