Author(s) This pre-registration is currently anonymous to enable blind peer-review. It has 3 authors.
Pre-registered on 2025/02/07 15:15 (PT)
1) Have any data been collected for this study already? No, no data have been collected for this study yet.
2) What's the main question being asked or hypothesis being tested in this study? How do evaluation conclusions change when we measure LLM performance based on their interactions with users, as opposed to on their own, as has been measured by standard LLM benchmarks? We explore this question across LLMs and question domains.
3) Describe the key dependent variable(s) specifying how they will be measured. The key dependent variable is question-answering accuracy, which will be measured by comparing the user's selected answer to the ground-truth correct answer.
4) How many and which conditions will participants be assigned to? Participants will engage in a two-phase experiment. In Phase 1, participants answer questions on their own without the help of AI. In Phase 2, participants answer questions with the help of an "AI chatbot" (powered by an LLM). There are two conditions, which only affect Phase 2: in "answer-first", participants try to answer each question on their own first before answering the question with AI; in "direct-to-AI", participants directly use AI. Participants are randomly assigned to one of the two conditions.
We test three domains (math, physics, and moral scenarios) and two LLMs (GPT-4o and Llama-3.1-8b). When a participant starts the task, they are randomly assigned to one of the LLMs, randomly assigned to one the domains, and randomly assigned to a batch of questions for that domain (described below). They are also randomly assigned to either "answer-first" or "direct-to-AI" for Phase 2. Within the batch, questions are randomly assigned to Phase 1 and Phase 2.
We use questions from five MMLU datasets: elementary mathematics, high school mathematics, college mathematics for the math domain, conceptual physics for the physics domain, and moral scenarios for the final domain. To select questions for the study, we included two quality checks: (1) we used annotations from MMLU-Redux, where the authors manually reviewed a random sample of 100 questions from each dataset and annotated them for whether they contained errors, (2) we ran OpenAI's advanced o1 model over the questions and manually review the ones where o1's answer disagrees with MMLU's. We kept all questions that pass MMLU-Redux's and our inspection (aided by o1), resulting in 96 questions in elementary math, 98 in high school math, 95 in college math, 90 in conceptual physics, and 95 in moral scenarios.
We organized the questions into batches of size 12. For the math domain, each batch consists of 5 elementary, 5 high school, and 2 college questions. Based on the total number of valid questions, we can construct 19 math batches, resulting in 95 elementary, 95 high school, and 38 college questions in our final dataset. For conceptual physics and moral scenarios, each batch consists of 12 questions from that domain, so we are able to make 7 batches for both domains, resulting in 84 questions per domain in our final dataset.
5) Specify exactly which analyses you will conduct to examine the main question/hypothesis. Our main question is whether evaluation conclusions change significantly between AI-alone vs. user-AI (for the same LLM and dataset), and whether they change significantly between user-alone vs. user-AI (for the same dataset and for each LLM).
AI-alone vs. User-AI. We estimate AI-alone performance by testing the LLM in isolation on the selected MMLU questions. We test AI-alone performance in two ways: (1) "letter-only", which requires the LLM to answer with only the letter of the answer option (corresponding to standard benchmark practices, e.g., HELM), (2) "free-text", which simply copy-and-pastes the question, allows the LLM to generate an open-ended response, then extracts an answer from the response. For "letter-only", we try zero-shot and few-shot, using the 5 MMLU questions from the "dev" split for that dataset as in-context examples.
For each AI-alone method and MMLU question, we sample 50 answers per LLM (with temperature=0.7) and compute the accuracy of each AI-alone method for each LLM.
Within each dataset and LLM, we will compare the accuracy of AI-alone methods to User-AI methods across all questions in that dataset.
User-alone vs. User-AI. We have user-alone data from all users in Phase 1 and from "answer-first" users in Phase 2, and we have user-AI data from all users in Phase 2 that are both "answer-first" and "direct-to-AI".
Within each dataset and LLM, we will compare:
1. The accuracy of user-alone responses to user-AI "direct-to-AI" responses
2. The accuracy of user-alone responses to user-AI "answer-first" responses
All of our tests will use a significance level of 0.05.
6) Describe exactly how outliers will be defined and handled, and your precise rule(s) for excluding observations. We will exclude data from participants who fail to correctly answer the "attention check" question, which is a simple math problem ("What is 5+2?"). The attention check appears at the end of Phase 1, in the same place for all participants. We will only analyze data from participants who complete the entire experiment and will exclude data from participants who complete the experiment more than once. We will also exclude any participants whose prompts to the LLM consistently indicate low effort (e.g., with short prompts like "hi", or "next" that have nothing to do with the question at hand).
7) How many observations will be collected or what will determine sample size? No need to justify decision, but be precise about exactly how the number will be determined. We aim to recruit 650 participants from Prolific within two weeks after launching the experiment. We will randomize to target 60% of participants for math, 20% participants for conceptual physics, and 20% participants for moral scenarios. If we are unable to recruit enough participants within 5 days after initially launching the experiment, we will terminate the experiment at whatever sample size we have collected at that time.
8) Anything else you would like to pre-register? (e.g., secondary analyses, variables collected for exploratory purposes, unusual analyses planned?) As a secondary analysis, we will also explore correlations in accuracy. To compute AI-alone vs. user-AI correlation, we will compute each one's average accuracy per question in the dataset, then compute the Pearson correlation over the question-level accuracies. Correlations serve as a complementary analysis to marginal means, revealing potentially different information: for example, even if the means are similar, the two approaches may differ on which questions they makes mistakes on; conversely, even if the means are different, the two approaches may be correlated on where they make more mistakes and one is just shifted down from the other.
We will also explore details of the user interactions and AI's responses, to better understand the nature of users' initial prompts, the exchange between the user and the AI, and the AI's final response.
We may also do exploratory analyses of: self-reported levels of confidence before answering the question (collected before all user-alone and user-AI answers), characteristics of the user-AI conversations and how they vary across conditions, and free-text feedback provided by participants at the end of the study on whether they found the AI helpful and whether they noticed any mistakes that it made.