This repository contains codes for our EMNLP 2025 paper: A Necessary Step toward Faithfulness: Measuring and Improving Consistency in Free-Text Explanations.
Please cite our paper if you find our code useful:
@inproceedings{zhao2025necessary,
title={A Necessary Step toward Faithfulness: Measuring and Improving Consistency in Free-Text Explanations},
author={Zhao, Lingjun and Daum{\'e} III, Hal},
booktitle={Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
year={2025},
organization={Association for Computational Linguistics}
}
conda env create -n new_env_name -f environment.yml
-
Download SFT model (fine-tuned on TripAdvisor dataset from Llama2-13B) from here, unzip, and put it in
trained_modelsfolder. -
Call PEX score computation:
To compute the PEX score, specify the review to be classified, the model's prediction, and its explanation. For example:
python3 compute_pex_score.py \
--model_name llama2_13b_finetune_review \
--review "Striking architecture is only the beginning of what can only be described as one of the best hotel experiences of my professional life. The attention to detail in the room and the bathroom is remarkable and the bed is the best night's sleep I've ever experienced in a hotel. Gorgeous LCD tv and Bose wave machine added to my enjoyment. Sofitel is worth the extra \$'s." \
--prediction "Truthful" \
--explanation "[reason1] Specific details: The reviewer provides specific details about the room and the bathroom, such as the remarkable bathroom, which suggests that they experienced it firsthand. [reason2] Enhanced experience: The reviewer suggests that their experience was enhanced by the architecture, in-room technology, and bathroom amenities, which suggests that the hotel's features impressed them."
The output will be displayed as:
PEX score: 4.25
- (Optional) Run the following script to compute PEX consistency on the validation set:
bash scripts/compute_pex.sh
The output will be saved to test_models/compute_pex_example/val_explanations_adjusted_woe.json
The PEX consistency score is under adjusted_woe_score key for each explanation.
Your output shall look like the example output file in data/val_explanations_pex.jsonl.
This example uses the TripAdvisor hotel review dataset (Negative deceptive opinion spam, Ott et al., 2013) and the Llama-2 13B model. The steps are similar for other datasets and models.
- Supervised fine-tuning to improve prediction accuracy
bash scripts/finetune_review.sh
- Generate free-text explanations, and compute PEX consistency on validation and test sets
bash scripts/generate_and_compute_pex.sh
bash scripts/generate_and_compute_pex_testset.sh
- Sample explanations to train DPO model, and compute PEX consistency
bash scripts/sample_review_explanation.sh
bash scripts/sample_review_explanation_trainset.sh
- Train DPO model and infer explanations, compute PEX consistency
bash scripts/train_infer_dpo.sh
- Evaluate faithfulness by training student models
(a) Use explanations generated by the DPO model
bash scripts/finetune_student_model.sh
(b) Use explanations generated by the SFT model
bash scripts/finetune_student_model_sft.sh
(c) No explanation, use prediction label only as teaching signal
bash scripts/finetune_student_model_no_explanation.sh