Skip to content

serinachang5/interactive-eval

Repository files navigation

Code for "ChatBench: From Static Benchmarks to Human-AI Evaluation" (ACL 2025) by Serina Chang, Ashton Anderson, and Jake Hofman. Contents include:

  • get_mturk_azure_results_by_hit.py: function to pull raw data from our Azure database, where data from our user studies are logged.
  • analyze_results.py: code to process raw user study data and analyze results.
  • make_clean_data.py: code to make a clean version of answers for statistical analyses, following the filtering criteria defined in our pre-registration.
  • qa_reasoning.py: implementation of AI-alone methods; code to run experiments over all questions.
  • generate_conversations.py: implementation of user simulators.
  • constants_and_utils.py: constants and functions to query models (removed for anonymity) and load MMLU / MMLU-Redux datasets.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages