Code repository for paper "Accelerating Unbiased LLM Evaluation via Synthetic Feedback"
The experiments are divided into 4 parts, corresponding to the 4 directories. Please replicate our result in the following order:
- (Optional) Synthetic evaluator finetuning. You can skip if you run Control Variates Evaluation with an off-the-shelf evaluator. See instructions under
finetune/. - Collect Synthetic Evaluations. See instructions under
evaluation/. - Compute averaged human annotation saving ratio. See instructions under
stats/. - Run control variates evaluation to visualize variance and bias. See instructions under
bootstrap/.
Code associated with GPT-4 evaluation is partially based on lm-sys/FastChat.