GitHub - serinachang5/interactive-eval

Code for "ChatBench: From Static Benchmarks to Human-AI Evaluation" (ACL 2025) by Serina Chang, Ashton Anderson, and Jake Hofman. Contents include:

get_mturk_azure_results_by_hit.py: function to pull raw data from our Azure database, where data from our user studies are logged.
analyze_results.py: code to process raw user study data and analyze results.
make_clean_data.py: code to make a clean version of answers for statistical analyses, following the filtering criteria defined in our pre-registration.
qa_reasoning.py: implementation of AI-alone methods; code to run experiments over all questions.
generate_conversations.py: implementation of user simulators.
constants_and_utils.py: constants and functions to query models (removed for anonymity) and load MMLU / MMLU-Redux datasets.

Provide feedback

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
README.md		README.md
analyze_results.py		analyze_results.py
constants_and_utils.py		constants_and_utils.py
generate_conversations.py		generate_conversations.py
get_mturk_azure_results_by_hit.py		get_mturk_azure_results_by_hit.py
make_clean_data.py		make_clean_data.py
qa_reasoning.py		qa_reasoning.py