Variability in LLM Evaluation Methods
Variability in LLM Evaluation Methods
M42
arXiv:2407.21072v1 [cs.AI] 29 Jul 2024
A BSTRACT
As large language models (LLMs) continue to evolve, the need for robust and standardized evaluation
benchmarks becomes paramount. Evaluating the performance of these models is a complex challenge
that requires careful consideration of various linguistic tasks, model architectures, and benchmarking
methodologies. In recent years, various frameworks have emerged as noteworthy contributions to the
field, offering comprehensive evaluation tests and benchmarks for assessing the capabilities of LLMs
across diverse domains. This paper provides an exploration and critical analysis of some of these
evaluation methodologies, shedding light on their strengths, limitations, and impact on advancing the
state-of-the-art in natural language processing.
1 Introduction
The rapid advancements in the field of natural language processing (NLP) have been fueled by the development of
increasingly sophisticated Large Language Models (LLMs). From OpenAI’s GPT series [1] and Google’s BARD [2, 3]
to a plethora of open-source models such as Llama [4, 5], the Falcon series [6], Jais [7] and ohers [8, 9, 10], these
models showcase remarkable language understanding and generation capabilities. As the abilities of LLMs continue to
50 = 6%
= 6%
40 = 26% = 6%
= 19% = 9%
30 = 12%
= 5%
20
10
0
HellaSwag MedQA MMLU OpenBooksQA
Figure 1: Performance disparities (δ) of Llama2 (7B, 13B and 70B) and Mistral-7B on various benchmark datasets (in
0-shot setting). δ values are shown using lighter colors and they represent the variation observed in the accuracy metric
across the benchmarks due to different evaluation frameworks.
Beyond Metrics: Variability in LLM Evaluation Frameworks
expand, so does the demand for robust evaluation frameworks that can assess their linguistic aptitude and generalization
capabilities [11, 12, 13]. A thorough evaluation is essential not only to benchmark the progress in model development
but also to identify and address potential biases, handle ethical concerns, and ensure the responsible deployment of
these models in real-world applications [14, 15, 16, 17, 18].
However, evaluating LLMs poses a unique set of challenges due to their parameter space, diverse applications, and
sensitivity to the choice of evaluation tasks. Traditional metrics often fall short in capturing the nuances of language
understanding, fluency, and context coherence exhibited by these models [13, 19]. In addition, model performance is
often affected by minor implementation details. It is exceedingly challenging to anticipate that results obtained from
one codebase will seamlessly transfer to another [20]. Frequently, research papers fail to provide the necessary code
and/or sufficient details to replicate their evaluations fully. In response to this, the research community has witnessed
the emergence of multiple evaluation frameworks for tracking the progress on specific NLP tasks across different
benchmarks and to encourage multi-task learning with diverse datasets. To compare the performance of different
LLMs, various leaderboards have also been created to rank LLMs according to their performance metrics (or scores) on
existing or new evaluation benchmarks. For example, the GLUE [21] and SuperGlue [22] benchmarks are widely used
by language model developers. In addition, BIG-bench [23], HELM [19], EleutherAI’s language model evaluation
harness [24] and OpenCompass [25] frameworks have been introduced to study the capabilities of language models.
As the demand for rigorous evaluation benchmarks intensifies, it becomes increasingly evident that the metrics employed
are pivotal components. While these benchmarks provide invaluable insights into the capabilities of language models
across diverse tasks, it is important to delve into the intricacies of metric calculation (or computation) to discern the
subtleties of model behaviour. The significance of this aspect arises from the fact that metrics, often treated as objective
measures, are subject to specific assumptions and methodologies during their calculation. These assumptions can
significantly influence the interpretation of model performance and, consequently, the reliability of benchmark results.
As the field progresses, the need for transparent and standardized reporting of metric calculation procedures becomes
paramount to foster reproducibility and comparability across different studies.
This paper aims to delve into the understanding of different metric calculation methodologies employed by some of the
most popular frameworks, providing some details on how these nuances can affect the performance of LLMs and the
interpretability of evaluation results. Specifically, we describe an investigation into the accuracy metrics employed by
prominent evaluation frameworks, detailing the methodology for their calculation. Our focus extends to evaluating four
widely recognized open-source language models across question-answering datasets, which are comprised of those
structured as multiple-choice scenarios, with the aim of providing a better understanding of their performance across
these tasks.
2 Background
In this investigation, we focus on multiple-choice question tests, in which for each question only one of the provided
choices is the correct answer (see Box 1). This represents a rather simple task, in contrast to open-ended questions, for
example. However, our examination reveals that even within this seemingly simple problem formulation, significant
variability exists, stemming from subtle implementation details and disparities. Notably, the majority of these disparities
arise from differences in the methods used to select the final answer option from the model’s output. To help us
formulate the problem, we focus on evaluating multiple-choice tasks using autoregressive language models, such as
GPT (Generative Pre-trained Transformer) like architectures, which underpins the architecture of the models mentioned
above.
One of the reasons that the government discourages and regulates monopolies is that
Choices:
A. producer surplus is lost and consumer surplus is gained.
B. monopoly prices ensure productive efficiency but cost society allocative efficiency.
C. monopoly firms do not engage in significant research and development.
D. consumer surplus is lost with higher prices and lower levels of output.
Correct answer: D
We frame mathematically the evaluation of multiple-choice questions (MCQs) using LLMs as follows. Let Q be the set
of multiple-choice questions, each denoted as qi , where i ranges from 1 to N , where N represents the total number of
2
Beyond Metrics: Variability in LLM Evaluation Frameworks
questions. For each question qi , there exists a set of answer choices Ai = {ai1 , ai2 , ..., aik }, where k is the number of
choices for question qi , with 2 ≤ k ≤ 5. The goal is to assess the model’s performance in selecting the correct answer
from the provided choices. Let ci be the correct answer for question qi , where ci ∈ Ai . The model’s predicted answer
for qi is denoted as ĉi . We note that ideally ĉi ∈ Ai , but it may not always be the case; i.e., the model may generate an
answer that is not necessarily part of the set of choices Ai . From this point onward, to simplify further our notation, we
refer to a single question of a dataset as q and set of answer choices for that question as A = {a1 , a2 , ..., ak }; i.e., we
ommit the index i.
LLMs operate by accepting a textual input, i.e., a question/instruction (also called a ’prompt’) which is segmented
into tokens (typically, sub-words). Let the input prompt be represented as q0:m , where m + 1 is the number of tokens
of the input prompt. From this tokenized input, the language model generates a conditional probability distribution
P (qm+1 |q0:m ) over the vocabulary of tokens of the model for predicting the next token. This allows LLMs to estimate
the likelihood of any token as continuation of the input prompt. By appending the selected token to the prompt and
iterating this process, the model generates subsequent tokens, enabling the creation of entire sentences as continuations
of the initial input prompt. Let qm+1:nk be the k-th possible continuation sequence generated by the LLM, with nk − m
tokens.
Hence, for model evaluation, two primary approaches emerge:
• Token probability comparison: this involves obtaining the probabilities P (qm+1 |q0:m ) for given sets of tokens
and comparing these probabilities for the predefined choices (A); i.e., the evaluation metric can be constructed
by assessing the relative likelihoods of various token groups as continuations of the prompt.
• Text generation comparison: one can obtain a text generation from the model using the iterative process
described above and then compare the generated text to the various predefined possible choices; this provides
a holistic assessment of the model’s ability to generate coherent and contextually relevant continuations.
Both token-level assessments and holistic text generation analyses underscore popular evaluation frameworks. In
the following sections, we detail three of these frameworks and their implementations: OpenCompass, EleutherAI’s
language model evaluation harness, and Stanford University’s HELM.
2.1 OpenCompass
OpenCompass, an open-source evaluation framework for language models1 , uses the token probability comparison
approach for extracting the prediction of the model [25]. Specifically, the probabilities predicted by the model for all
possible answers are compared, such that the probability for the option ak is given by P (qm+1 |q0:m ); in the example
above ak ∈ {“A”, “B”, “C”, “D”}, i.e., the first letter corresponding to each choice. We note that the framework uses
perplexity as the key metric, rather than relying solely on (log) likelihood.
In order to reduce the likelihood of the model generating responses that fall outside the intended range of answers, a
“few shots” approach is typically used, in which the prompt is augmented with one or more instances/examples with their
correct answers as well (see Appendix B). In instances where the model might have otherwise generated an unrelated
word (or token), the inclusion of a few shots attempts to ensure that the model is guided by known examples and by
better understanding of the expected behaviour. The approach of incorporating a handful of examples in the prompt
typically improves the model’s performance and is a standard evaluation method, as demonstrated in assessments such
as MMLU, where five shots (i.e., five examples) are prepended to each prompt for evaluation across various benchmarks
[26].
The evaluation harness framework from EleutherAI [24] also uses the token probability comparison methodology. In
this case, however, it computes the likelihood of the entire continuation sequence using the process described above (i.e.,
it uses the full answer sequence which contains the letter followed by the text of the answer). In its simplest method, for
each k choice in A, the likelihood is determined using
nk
X
log P (qj |q0:j−1 ),
j=m+1
where the aggregation of the probabilities is achieved by summing the log of the individual probabilities for numerical
stability2 . Conceptually, this entails determining the probability of a generated sequence, sampled from the given
1
OpenCompass’ evaluation framework also supports multimodal models, which is not within the scope of this investigation.
2
This approach is used in [24] in all multiple choice tasks and tagged as “acc".
3
Beyond Metrics: Variability in LLM Evaluation Frameworks
prompt, incorporating the specific continuation (or choice) under consideration (in this case, of the entire answer).
The few-shot prompt is generally similar to the approach described in the previous section. Although straightforward,
this method does not take into account (possible) differences in length between the predefined choices that may be
substantical (i.e., nk can vary substantially for each k). This can bias the model toward favoring longer choices as they
tend to have higher log probabilities (or likelihoods).
In order to tackle this problem, the framework’s authors introduced a normalization step in which the overall likelihood
is divided by a measure of the length of the answer sequence. This can be accomplished in two ways: either using
token-length normalization or byte-length normalization. In the former, the normalized likelihood of the k-th option is
determined using the average log probability per token:
nk
X
log P (qj |q0:j−1 )/(nk − m).
j=m+1
However, the authors noted that this approach is not tokenization agnostic, which means that two models with distinct
tokenization procedures (and/or vocabulary sizes), despite assigning the same log likelihood to every single input string,
may yield different token-length normalized log likelihoods.
The byte-length normalization approach attempts to normalize the likelihood by computing the average log probability
per character, which ensures that it is tokenization agnostic3 . In this case, the normalized likelihood of the k-th option is
determined using
nk
X Xnk
log P (qj |q0:j−1 )/ Lqj ,
j=m+1 j=m+1
where Lqj is the number of bytes represented by the token qj . In practice, rather than using the number of bytes, the
number of characters in ak is used for normalization.
2.3 HELM
Another popular framework is the Holistic Evaluation of Language Models (or HELM) project [19]. The method
used for evaluating the model using MCQs in HELM is different from the implementations described in the previous
two approaches. Instead of comparing the token probabilities from the given answer choices, HELM utilizes the
model’s next token output probabilities to generate text. This generated text is then compared to the expected answer.
Specifically for MCQ tasks, HELM implements metric functions such as exact match to assess the correspondence
between the generated text and the correct answer. This approach allows for a more natural evaluation of the model’s
understanding and ability to generate relevant responses, rather than simply selecting the most probable option from a
predefined set of choices.
While the evaluation methodology takes a distinct approach, the few-shot prompt remains generally similar. However,
for a given instance, we note that if the model assigns the highest probability to a token that deviates from the intended
range of answers, even though it is not part of the set of choices, the model’s response would be deemed incorrect,
resulting in a lack of scoring for that particular instance. In other words, the evaluation process hinges on the model’s
ability to prioritize the correct tokens within the specified answer choices. In the event that the model, despite a generally
similar few-shot prompt, allocates the highest probability to a token outside the intended range, the response in deemed
incorrect. This is substantially different from the methods described above, in which only the probabilities associated
with the given set of answers are included for computing the model’s performance.
We also note that frameworks such as HELM [19] and Langtest [27] attempt to enhance the evaluation process of LLMs
by offering a broader array of tasks and metrics (in addition to accuracy). These frameworks offer a comprehensive set
of evaluation criteria that delves into various aspects of language understanding and generation, providing additional
metrics such as calibration, robustness, fairness, bias, toxicity, and efficiency. Such metrics are out of the scope of this
study.
3 Methods
In this study, we focus on two prominent evaluation frameworks, OpenCompass and Eval harness. As described above,
these frameworks utilize a token-probability comparison method to assess LLMs in the context of multiple-choice
question answering benchmarks. Our investigation centers on providing a detailed account of the accuracy metrics
3
This approach is also used in [24] in all multiple choice tasks and is tagged as “acc_norm".
4
Beyond Metrics: Variability in LLM Evaluation Frameworks
obtained through these frameworks, with particular attention to four widely recognized LLMs, for quantifying the
performance of such models. We analyse the results across four popular benchmarks, with the goal of offering a better
understanding of the performance and variability in performance of the selected models within the defined evaluation
paradigms.
In this section, we describe the test datasets used to evaluate the models on different tasks. When selecting these
datasets, preference was given to widely acknowledged sources used across various domains, such as general and
medical contexts. Additionally, these encompass a range of context lengths and types of reasoning. Table 1 includes a
summary of the datasets used in this study.
HellaSwag. This dataset was introduced in 2019 to test commonsense natural language inference about physical
situations [28]. HellaSwag4 employed "Adversarial Filtering", in which the idea is to (machine) generate challenging
incorrect answers for a multi-choice test setting (see Box A1). These incorrect answers are dubbed ’adversarial endings’.
Though humans score above 95.6% on HellaSwag, initial state-of-the-art models struggled (with accuracies lower than
48% back in 2019).
MedQA. We included a domain-specific dataset, MedQA, originally designed for addressing medical problems [29].
The dataset comprises free-form multiple-choice question-answers which were sourced from professional medical
board exams (Box A2).
MMLU. The Measuring Multitask Language Understanding (MMLU) benchmark [26] aimed to introduce a compre-
hensive assessment of LLMs across 57 subjects, including elementary mathematics, US history, computer science, law,
and others (Box A3).
OpenBookQA. Mihaylov et al. (2018) [30] presented a question-answering dataset that attempts to probe a deeper
understanding of a topic (Box A4). Modeled after open book exams for assessing human understanding of a subject, it
contains questions that require multi-step reasoning, use of additional common and commonsense knowledge, and rich
text comprehension.
We employed two distinct family of language models, namely Llama2 [5] and Mistral [31], to evaluate their performance
within the context of the OpenCompass and Eval harness benchmarks. These pretrained generative text models have
been shown to perform well across various benchmarks.
Llama2 models, part of the family of language models developed by Meta AI, are a set of LLMs with varying size,
ranging from 7 billion to 70 billion parameters. It is an auto-regressive language model based on the transformer decoder
architecture with some notable differences from models like GPT-3. For example, Llama2 employs the SwiGLU
activation function rather than ReLU and uses rotary positional embeddings in lieu of absolute learnable positional
embeddings [5]. The latest release of Llama2 also introduces architectural refinements geared towards enhanced
performance, extending the context length to up to 4,096 tokens. Bigger models (70B) use Grouped-Query Attention
(GQA) to better leverage long sequences and improved inference scalability.
Mistral-7B (v0.1), introduced by Mistral AI, is a 7.3-billion parameter model with a similar architecture to that of
Llama2 [31]. It also uses grouped-query attention which enhances the inference process by caching key and value
vectors for previously decoded tokens in the sequence, thereby reducing processing time. In addition, it uses a sliding
window-based attention mechanism which replaces full attention, characterized by square compute cost. In this
mechanism, each token can attend to at most 4,096 tokens from the preceding layer, resulting in a linear compute cost.
4
"HellaSwag" is short for Harder Endings, Longer contexts, and Low-shot Activities for Situations With Adversarial Generations.
5
Beyond Metrics: Variability in LLM Evaluation Frameworks
This implementation enhances Mistral-7B’s capability to handle long sequences (upto 32k), allowing higher layers to
access historical information beyond the 4,096-token context window size.
In this investigation, from the family of Llama2 models we use the 7-billion (Llama2-7B), 13-billion (Llama2-13B) and
70-billion-parameter models (Llama2-70B) along with Mistral-7B.
The assessment of the three models on the aforementioned datasets involves the utilization of the accuracy metrics
derived from OpenCompass and Eval harness. The final accuracy for a dataset is determined by the percentage of
questions answered correctly. To determine this accuracy, we can follow different methods to identify the correct
option selected by a model (as described in the previous section). Our focus centers on the accuracy metrics computed
using the different methods employed by different evaluation frameworks. We attend to the following methods for this
purpose:
• OpenCompass’ accuracy (denoted OC accuracy): this approach invoves extracting the model’s prediction
by assessing the next token probabilities and determining whether the selected choice (that with the highest
likelihood) is the correct answer (as described above and used in [25]);
• Raw (unnormalized) accuracy (Raw accuracy): similarly to the previous method, it involves comparing token
probabilities, but in this case, the probabilities of the full answers’ sequences are used to determine the correct
option, using the sum of the log likelihoods of all tokens; this corresponds to the method used and reported in
the eval harness’ framework (see section 2.2);
• Token-normalized accuracy (T-norm accuracy): this method involves normalizing the likelihhood for each
answer (as obtained for the metric above) by dividing the sum by the number of tokens to avoid giving too
much "weight" to longer answers (section 2.2);
• Byte-normalized accuracy (B-norm accuracy): this method also attempts to avoid biasing the model toward
favoring longer choices by normalizing the likelihood using the average log probability per character (using
the number of characters in the full answer sequence); as mentioned in section 2.2, it is also used and reported
in the eval harness’ framework.
The evaluation is carried out in a zero-shot setting, meaning that no examples are added to the prompt. Also, to ensure
consistency and mitigate the impact of variations in prompts on model results, a standardized prompt design was
employed across the evaluation frameworks and methods. Refer to Appendix B for further details on the adopted
prompt design for each dataset.
4 Results
The performance of the models across selected benchmark datasets is summarized in the Table 2, highlighting the
accuracy metrics obtained within each one of the four evaluation methods.
For each evaluation scenario, we note that Llama2-70B consistently outperforms other Llama2 (smaller) variants and
Mistral-7B across all benchmark datasets. However, within each benchmark dataset, a substantial variability in the
performance of the different models across the four methods is observed (Figure 1). As examples, for MMLU, the
performance of Mistral-7B ranges from 61.4% and 65.8%; while for HellaSwag, the performance of Llama2-70B
fluctuates between 64.8% and 83.8%.
We also note that the effect of the normalization methods (e.g., B-norm accuracy) is not consistent across the benchmark
datasets; while for HellaSwag, the normalization-based accuracy metrics are higher than the raw-based accuracy metric,
which does not take into account the normalization of the log likelihoods of the responses (according to their length),
the opposite behaviour is observed for other datasets (Table 2). To delve deeper into the factors influencing correct
option selection, we investigate the impact of normalizing response likelihoods. For each question-answer pair, we
assessed the length of options and compared the length of the correct option with that of the wrong options (this is set to
be the median length of the wrong options).
The Bland-Altman plot (depicted in black in Figure 2) illustrates the length difference between right and wrong options
for the (whole) HellaSwag dataset. No significant inherent bias in the length of the correct options compared to the
length of the wrong options is observed (and the same is observed for the other datasets too; figure not included in the
manuscript). I.e., the lengths of correct and wrong answer options of the records included in the benchmark dataset are
similar. Notably, the figure also shows the length difference (in red) for those instances in which Mistral-7B incorrectly
6
Beyond Metrics: Variability in LLM Evaluation Frameworks
Table 2: 0-shot performance comparison of the different models on the selected benchmark datasets. Accuracy results
(in %) are shown according to each of the four evaluation methods used for determining the correct answer.
selected the wrong option according to the unnormalized likelihood method (top panel) and B-norm likelihood method
(bottom panel).
Figure 3 shows the results of a similar analysis for the MedQA benchmark dataset.
5 Discussion
In this paper, we aimed to explore the details of metric calculation methodologies employed by prominent evaluation
frameworks and their implications on the performance assessment of LLMs and the subsequent interpretability of
evaluation results. Specifically, we scrutinized the accuracy metrics, describing the intricacies of their calculation
methodologies. Extending our examination beyond theoretical considerations, we directed our efforts towards the
evaluation of widely recognized open-access language models. This evaluation encompassed different question-
answering datasets, including those structured as multiple-choice scenarios.
The results of our analyses shed light on some critical aspects of the evaluated LLMs and the associated methodologies.
Notably, considering each method individually, Llama2-70B consistently demostrates superior performance across all
benchmark datasets when compared to both its smaller variants and Mistral-7B (Table 2). This has been observed in
other studies, in which the performance of Llama2-70B has been deemed to be superior in diverse evaluation benchmark
tasks, including question-answering tasks [5, 31].
Nevertheless, despite this overall superiority, the results from our analysis demonstrate a substantial variability within
each benchmark dataset across the four evaluation methods (Figure 1). The observed fluctuations in performance,
ranging between 5% and 26%, highlight the sensitivity of the reported accuracy metrics to the method-specific
implementations.
In evaluation frameworks such as eval-harness by Eleuther AI [24], for calculating accuracy of a model, the likelihood
of the (complete) response (for each option) is determined, and the option with the highest likelihood is deemed to be
the correct one. In order to avoid the introduction of a bias toward favoring shorter answers, eval-harness has introduced
a normalization step, in which the overall likelihood for each choice is divided by a measure of the length of the choice’s
sequence. These different approaches, as expected, produce different results.
In addition, Figure 2 highlights the impact of normalization in the selection bias of the correct option by a model (in red)
on HellaSwag. In the top panel (without normalization of the response’s likelihood), a substantial bias is evident, with
the model preferring shorter answers; whereas, if normalization of the response’s likelihood (bottom panel) is employed,
the bias is noticeably reduced, underscoring the effect of normalization on enhancing the reliability of the response
selection, and ultimately the model’s accuracy. However, it is important to note that the impact of normalization on the
bias is not universally consistent across all datasets. In contrast to the observed mitigation of bias in the HellaSwag
dataset, the same analysis for other datasets reveals that the normalization process seems to introduce a bias rather than
reducing it in the selection of the correct option. This can be observed, for example, in the MedQA dataset (Figure 3).
7
Beyond Metrics: Variability in LLM Evaluation Frameworks
200 200
100 100
Len Delta
0 0
100 100
200 200
200 200
100 100
Len Delta
0 0
100 100
200 200
Figure 2: Bland-Altman plots (left) and frequency plots or histograms (left) of the length difference between correct and
wrong options for the HellaSwag benchmark dataset. The length difference for the entire dataset is shown in black (in
both top and bottom panels). In the top panel, the length differences for the instances in which Mistral-7B incorrectly
selected the wrong option in the unnormalized likelihood method (raw-based accuracy) are overlayed in red. On the
bottom panel, the length differences for the instances in which Mistral-7B incorrectly selected the wrong option in the
B-norm accuracy method are overlayed in red.
In this case, the normalization of response likelihoods appears to contribute to a discernible bias in the selection of the
correct option.
The impact of the responses’ likelihood normalization methods on the accuracy metrics reveals a lack of consistent
behaviour across benchmark datasets. While for some datasets, the normalization step appears to reduce the bias in the
choice selection and produce better performance results, for other the normalization step seems to introduce a bias in
the choice selection and, hence, provide less accurate results.
This variation in performance across frameworks highlights the importance of comprehensive benchmarking. The
performance of an LLM is typically a function of its architecture, the model’s training data and the extent to which
it has been fine-tuned to a particular domain of knowledge; i.e., one model may be exceedingly good at answering
medical questions, but it barely exceeds “Hello World” in a Python programming test. Notwithstanding this, the results
obtained in this investigation, as also observed in [32], suggest that the performance of an LLM is also highly dependent
on the methodology used and implementation details employed to evaluate them (even when using the same benchmark
dataset).
To conclude, evaluation frameworks play an important role in the assessment and enhancement of LLMs. They are
pivotal in addressing key challenges related to performance evaluation, offering valuable insights for model development,
and enhancing transparency. By providing a multi-dimensional perspective on LLMs, existing frameworks make
substantial contributions to the advancement and comprehension of these intricate AI systems. As the demand for
rigorous evaluation of LLMs intensifies, it becomes imperative not only to scrutinize the metrics themselves but
also to expose the methodologies of their calculation. The diversity in evaluation frameworks also introduces a
considerable challenge, as each framework may employ distinct metrics and computation methodologies tailored
to its specific objectives. Understanding the underlying assumptions and approaches during metric calculation is
8
Beyond Metrics: Variability in LLM Evaluation Frameworks
40 40
20 20
Len Delta
0 0
20 20
40 40
40 40
20 20
Len Delta
0 0
20 20
40 40
Figure 3: Bland-Altman plots (left) and frequency plots or histograms (left) of the length difference between correct
and wrong options for the MedQA benchmark dataset. The length difference for the entire dataset is shown in black (in
both top and bottom panels). In the top panel, the length differences for the instances in which Llama2-70B incorrectly
selected the wrong option in the unnormalized likelihood scenario (raw-based accuracy) are overlayed in red. On the
bottom panel, the length differences for the instances in which Llama2-70B incorrectly selected the wrong option in the
B-norm accuracy scenario are overlayed in red.
crucial for interpreting and comparing the results effectively. Variability in metric definitions across frameworks can
introduce ambiguity, making it difficult to draw meaningful cross-model and cross-study comparisons. Additionally,
certain metrics may inadvertently favor specific model characteristics or exhibit sensitivity to dataset idiosyncrasies.
Consequently, in the absence of thorough evaluation frameworks, researchers must delve into the nuances of metric
definitions and computation procedures to grasp the full context of reported results, ensuring an informed understanding
of LLM performance. This shift towards a more transparent and methodologically explicit evaluation paradigm is
pivotal for fostering reproducibility and advancing our collective understanding of the capabilities and limitations of
large language models.
Acknowledgments
This study has been supported by M42.
References
[1] OpenAI. GPT-4 technical report, 2023.
[2] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts,
Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha
Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran,
Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradbury, Jacob Austin, Michael Isard, Guy Gur-Ari,
Pengcheng Yin, Toju Duke, Anselm Levskaya, Sanjay Ghemawat, Sunipa Dev, Henryk Michalewski, Xavier
9
Beyond Metrics: Variability in LLM Evaluation Frameworks
Garcia, Vedant Misra, Kevin Robinson, Liam Fedus, Denny Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim,
Barret Zoph, Alexander Spiridonov, Ryan Sepassi, David Dohan, Shivani Agrawal, Mark Omernick, Andrew M.
Dai, Thanumalayan Sankaranarayana Pillai, Marie Pellat, Aitor Lewkowycz, Erica Moreira, Rewon Child, Olek-
sandr Polozov, Katherine Lee, Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark Diaz, Orhan Firat, Michele
Catasta, Jason Wei, Kathy Meier-Hellstern, Douglas Eck, Jeff Dean, Slav Petrov, and Noah Fiedel. Palm: Scaling
language modeling with pathways, 2022.
[3] Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid,
Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Sermanet, Daniel
Duckworth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor
Mordatch, and Pete Florence. Palm-e: An embodied multimodal language model, 2023.
[4] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix,
Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language
models. arXiv preprint arXiv:2302.13971, 2023.
[5] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov,
Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya
Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao,
Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas,
Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux,
Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar
Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan
Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor,
Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie
Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. Llama 2:
Open foundation and fine-tuned chat models, 2023.
[6] Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Alshamsi, Alessandro Cappelli, Ruxandra Cojocaru, Maitha
Alhammadi, Mazzotta Daniele, Daniel Heslow, Julien Launay, Quentin Malartic, Badreddine Noune, Baptiste
Pannier, and Guilherme Penedo. The falcon series of language models: Towards open frontier models, 2023.
[7] Neha Sengupta, Sunil Kumar Sahu, Bokang Jia, Satheesh Katipomu, Haonan Li, Fajri Koto, William Marshall,
Gurpreet Gosal, Cynthia Liu, Zhiming Chen, Osama Mohammed Afzal, Samta Kamboj, Onkar Pandit, Rahul Pal,
Lalit Pradhan, Zain Muhammad Mujahid, Massa Baali, Xudong Han, Sondos Mahmoud Bsharat, Alham Fikri
Aji, Zhiqiang Shen, Zhengzhong Liu, Natalia Vassilieva, Joel Hestness, Andy Hock, Andrew Feldman, Jonathan
Lee, Andrew Jackson, Hector Xuguang Ren, Preslav Nakov, Timothy Baldwin, and Eric Xing. Jais and jais-chat:
Arabic-centric foundation and instruction-tuned open generative large language models, 2023.
[8] Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Moham-
mad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, Aviya Skowron, Lintang Sutawika, and
Oskar van der Wal. Pythia: A suite for analyzing large language models across training and scaling, 2023.
[9] Nolan Dey, Gurpreet Gosal, Zhiming, Chen, Hemant Khachane, William Marshall, Ribhu Pathria, Marvin Tom,
and Joel Hestness. Cerebras-gpt: Open compute-optimal language models trained on the cerebras wafer-scale
cluster, 2023.
[10] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei
Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of
Machine Learning Research, 21(140):1–67, 2020.
[11] Yew Ken Chia, Pengfei Hong, Lidong Bing, and Soujanya Poria. Instructeval: Towards holistic evaluation of
instruction-tuned large language models, 2023.
[12] Joshua Maynez, Priyanka Agrawal, and Sebastian Gehrmann. Benchmarking large language model capabilities
for conditional generation, 2023.
[13] Sebastian Gehrmann, Elizabeth Clark, and Thibault Sellam. Repairing the cracked foundation: A survey of
obstacles in evaluation practices for generated text, 2022.
[14] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Nee-
lakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances
in neural information processing systems, 33:1877–1901, 2020.
[15] Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang,
Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan
Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie, and Ji-Rong Wen. A survey of large language models, 2023.
10
Beyond Metrics: Variability in LLM Evaluation Frameworks
[16] Yutao Zhu, Huaying Yuan, Shuting Wang, Jiongnan Liu, Wenhan Liu, Chenlong Deng, Zhicheng Dou, and
Ji-Rong Wen. Large language models for information retrieval: A survey, 2023.
[17] Md Tahmid Rahman Laskar, M Saiful Bari, Mizanur Rahman, Md Amran Hossen Bhuiyan, Shafiq Joty, and
Jimmy Xiangji Huang. A systematic study and comprehensive evaluation of chatgpt on benchmark datasets, 2023.
[18] Toby Shevlane, Sebastian Farquhar, Ben Garfinkel, Mary Phuong, Jess Whittlestone, Jade Leung, Daniel Kokotajlo,
Nahema Marchal, Markus Anderljung, Noam Kolt, Lewis Ho, Divya Siddarth, Shahar Avin, Will Hawkins, Been
Kim, Iason Gabriel, Vijay Bolina, Jack Clark, Yoshua Bengio, Paul Christiano, and Allan Dafoe. Model evaluation
for extreme risks, 2023.
[19] Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang,
Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang,
Christian Cosgrove, Christopher D. Manning, Christopher Ré, Diana Acosta-Navas, Drew A. Hudson, Eric
Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong, Hongyu Ren, Huaxiu Yao, Jue Wang, Keshav Santhanam,
Laurel Orr, Lucia Zheng, Mert Yuksekgonul, Mirac Suzgun, Nathan Kim, Neel Guha, Niladri Chatterji, Omar
Khattab, Peter Henderson, Qian Huang, Ryan Chi, Sang Michael Xie, Shibani Santurkar, Surya Ganguli, Tatsunori
Hashimoto, Thomas Icard, Tianyi Zhang, Vishrav Chaudhary, William Wang, Xuechen Li, Yifan Mai, Yuhui
Zhang, and Yuta Koreeda. Holistic evaluation of language models, 2023.
[20] Zhengzhong Liu, Aurick Qiao, Willie Neiswanger, Hongyi Wang, Bowen Tan, Tianhua Tao, Junbo Li, Yuqi Wang,
Suqi Sun, Omkar Pangarkar, Richard Fan, Yi Gu, Victor Miller, Yonghao Zhuang, Guowei He, Haonan Li, Fajri
Koto, Liping Tang, Nikhil Ranjan, Zhiqiang Shen, Xuguang Ren, Roberto Iriondo, Cun Mu, Zhiting Hu, Mark
Schulze, Preslav Nakov, Tim Baldwin, and Eric P Xing. LLM360: Towards fully transparent Open-Source LLMs,
December 2023.
[21] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. GLUE: A multi-task
benchmark and analysis platform for natural language understanding. In Tal Linzen, Grzegorz Chrupała, and
Afra Alishahi, editors, Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting
Neural Networks for NLP, pages 353–355, Brussels, Belgium, November 2018. Association for Computational
Linguistics.
[22] Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. SuperGlue: Learning feature
matching with graph neural networks. In CVPR, 2020.
[23] Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R.
Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, Agnieszka Kluska, Aitor Lewkowycz, Akshat
Agarwal, Alethea Power, Alex Ray, Alex Warstadt, Alexander W. Kocurek, Ali Safaya, Ali Tazarv, Alice Xiang,
Alicia Parrish, Allen Nie, Aman Hussain, Amanda Askell, Amanda Dsouza, Ambrose Slone, Ameet Rahane,
Anantharaman S. Iyer, Anders Andreassen, Andrea Madotto, Andrea Santilli, Andreas Stuhlmüller, Andrew
Dai, Andrew La, Andrew Lampinen, Andy Zou, Angela Jiang, Angelica Chen, Anh Vuong, Animesh Gupta,
Anna Gottardi, Antonio Norelli, Anu Venkatesh, Arash Gholamidavoodi, Arfa Tabassum, Arul Menezes, Arun
Kirubarajan, Asher Mullokandov, Ashish Sabharwal, Austin Herrick, Avia Efrat, Aykut Erdem, Ayla Karakaş,
B. Ryan Roberts, Bao Sheng Loe, Barret Zoph, Bartłomiej Bojanowski, Batuhan Özyurt, Behnam Hedayatnia,
Behnam Neyshabur, Benjamin Inden, Benno Stein, Berk Ekmekci, Bill Yuchen Lin, Blake Howald, Bryan
Orinion, Cameron Diao, Cameron Dour, Catherine Stinson, Cedrick Argueta, César Ferri Ramírez, Chandan
Singh, Charles Rathkopf, Chenlin Meng, Chitta Baral, Chiyu Wu, Chris Callison-Burch, Chris Waites, Christian
Voigt, Christopher D. Manning, Christopher Potts, Cindy Ramirez, Clara E. Rivera, Clemencia Siro, Colin Raffel,
Courtney Ashcraft, Cristina Garbacea, Damien Sileo, Dan Garrette, Dan Hendrycks, Dan Kilman, Dan Roth,
Daniel Freeman, Daniel Khashabi, Daniel Levy, Daniel Moseguí González, Danielle Perszyk, Danny Hernandez,
Danqi Chen, Daphne Ippolito, Dar Gilboa, David Dohan, David Drakard, David Jurgens, Debajyoti Datta, Deep
Ganguli, Denis Emelin, Denis Kleyko, Deniz Yuret, Derek Chen, Derek Tam, Dieuwke Hupkes, Diganta Misra,
Dilyar Buzan, Dimitri Coelho Mollo, Diyi Yang, Dong-Ho Lee, Dylan Schrader, Ekaterina Shutova, Ekin Dogus
Cubuk, Elad Segal, Eleanor Hagerman, Elizabeth Barnes, Elizabeth Donoway, Ellie Pavlick, Emanuele Rodola,
Emma Lam, Eric Chu, Eric Tang, Erkut Erdem, Ernie Chang, Ethan A. Chi, Ethan Dyer, Ethan Jerzak, Ethan
Kim, Eunice Engefu Manyasi, Evgenii Zheltonozhskii, Fanyue Xia, Fatemeh Siar, Fernando Martínez-Plumed,
Francesca Happé, Francois Chollet, Frieda Rong, Gaurav Mishra, Genta Indra Winata, Gerard de Melo, Germán
Kruszewski, Giambattista Parascandolo, Giorgio Mariani, Gloria Wang, Gonzalo Jaimovitch-López, Gregor Betz,
Guy Gur-Ari, Hana Galijasevic, Hannah Kim, Hannah Rashkin, Hannaneh Hajishirzi, Harsh Mehta, Hayden Bogar,
Henry Shevlin, Hinrich Schütze, Hiromu Yakura, Hongming Zhang, Hugh Mee Wong, Ian Ng, Isaac Noble, Jaap
Jumelet, Jack Geissinger, Jackson Kernion, Jacob Hilton, Jaehoon Lee, Jaime Fernández Fisac, James B. Simon,
James Koppel, James Zheng, James Zou, Jan Kocoń, Jana Thompson, Janelle Wingfield, Jared Kaplan, Jarema
Radom, Jascha Sohl-Dickstein, Jason Phang, Jason Wei, Jason Yosinski, Jekaterina Novikova, Jelle Bosscher,
Jennifer Marsh, Jeremy Kim, Jeroen Taal, Jesse Engel, Jesujoba Alabi, Jiacheng Xu, Jiaming Song, Jillian Tang,
11
Beyond Metrics: Variability in LLM Evaluation Frameworks
Joan Waweru, John Burden, John Miller, John U. Balis, Jonathan Batchelder, Jonathan Berant, Jörg Frohberg, Jos
Rozen, Jose Hernandez-Orallo, Joseph Boudeman, Joseph Guerr, Joseph Jones, Joshua B. Tenenbaum, Joshua S.
Rule, Joyce Chua, Kamil Kanclerz, Karen Livescu, Karl Krauth, Karthik Gopalakrishnan, Katerina Ignatyeva,
Katja Markert, Kaustubh D. Dhole, Kevin Gimpel, Kevin Omondi, Kory Mathewson, Kristen Chiafullo, Ksenia
Shkaruta, Kumar Shridhar, Kyle McDonell, Kyle Richardson, Laria Reynolds, Leo Gao, Li Zhang, Liam Dugan,
Lianhui Qin, Lidia Contreras-Ochando, Louis-Philippe Morency, Luca Moschella, Lucas Lam, Lucy Noble,
Ludwig Schmidt, Luheng He, Luis Oliveros Colón, Luke Metz, Lütfi Kerem Şenel, Maarten Bosma, Maarten
Sap, Maartje ter Hoeve, Maheen Farooqi, Manaal Faruqui, Mantas Mazeika, Marco Baturan, Marco Marelli,
Marco Maru, Maria Jose Ramírez Quintana, Marie Tolkiehn, Mario Giulianelli, Martha Lewis, Martin Potthast,
Matthew L. Leavitt, Matthias Hagen, Mátyás Schubert, Medina Orduna Baitemirova, Melody Arnaud, Melvin
McElrath, Michael A. Yee, Michael Cohen, Michael Gu, Michael Ivanitskiy, Michael Starritt, Michael Strube,
Michał Sw˛edrowski, Michele Bevilacqua, Michihiro Yasunaga, Mihir Kale, Mike Cain, Mimee Xu, Mirac
Suzgun, Mitch Walker, Mo Tiwari, Mohit Bansal, Moin Aminnaseri, Mor Geva, Mozhdeh Gheini, Mukund Varma
T, Nanyun Peng, Nathan A. Chi, Nayeon Lee, Neta Gur-Ari Krakover, Nicholas Cameron, Nicholas Roberts,
Nick Doiron, Nicole Martinez, Nikita Nangia, Niklas Deckers, Niklas Muennighoff, Nitish Shirish Keskar,
Niveditha S. Iyer, Noah Constant, Noah Fiedel, Nuan Wen, Oliver Zhang, Omar Agha, Omar Elbaghdadi, Omer
Levy, Owain Evans, Pablo Antonio Moreno Casares, Parth Doshi, Pascale Fung, Paul Pu Liang, Paul Vicol,
Pegah Alipoormolabashi, Peiyuan Liao, Percy Liang, Peter Chang, Peter Eckersley, Phu Mon Htut, Pinyu Hwang,
Piotr Miłkowski, Piyush Patil, Pouya Pezeshkpour, Priti Oli, Qiaozhu Mei, Qing Lyu, Qinlang Chen, Rabin
Banjade, Rachel Etta Rudolph, Raefer Gabriel, Rahel Habacker, Ramon Risco, Raphaël Millière, Rhythm Garg,
Richard Barnes, Rif A. Saurous, Riku Arakawa, Robbe Raymaekers, Robert Frank, Rohan Sikand, Roman Novak,
Roman Sitelew, Ronan LeBras, Rosanne Liu, Rowan Jacobs, Rui Zhang, Ruslan Salakhutdinov, Ryan Chi, Ryan
Lee, Ryan Stovall, Ryan Teehan, Rylan Yang, Sahib Singh, Saif M. Mohammad, Sajant Anand, Sam Dillavou,
Sam Shleifer, Sam Wiseman, Samuel Gruetter, Samuel R. Bowman, Samuel S. Schoenholz, Sanghyun Han,
Sanjeev Kwatra, Sarah A. Rous, Sarik Ghazarian, Sayan Ghosh, Sean Casey, Sebastian Bischoff, Sebastian
Gehrmann, Sebastian Schuster, Sepideh Sadeghi, Shadi Hamdan, Sharon Zhou, Shashank Srivastava, Sherry Shi,
Shikhar Singh, Shima Asaadi, Shixiang Shane Gu, Shubh Pachchigar, Shubham Toshniwal, Shyam Upadhyay,
Shyamolima, Debnath, Siamak Shakeri, Simon Thormeyer, Simone Melzi, Siva Reddy, Sneha Priscilla Makini,
Soo-Hwan Lee, Spencer Torene, Sriharsha Hatwar, Stanislas Dehaene, Stefan Divic, Stefano Ermon, Stella
Biderman, Stephanie Lin, Stephen Prasad, Steven T. Piantadosi, Stuart M. Shieber, Summer Misherghi, Svetlana
Kiritchenko, Swaroop Mishra, Tal Linzen, Tal Schuster, Tao Li, Tao Yu, Tariq Ali, Tatsu Hashimoto, Te-Lin Wu,
Théo Desbordes, Theodore Rothschild, Thomas Phan, Tianle Wang, Tiberius Nkinyili, Timo Schick, Timofei
Kornev, Titus Tunduny, Tobias Gerstenberg, Trenton Chang, Trishala Neeraj, Tushar Khot, Tyler Shultz, Uri
Shaham, Vedant Misra, Vera Demberg, Victoria Nyamai, Vikas Raunak, Vinay Ramasesh, Vinay Uday Prabhu,
Vishakh Padmakumar, Vivek Srikumar, William Fedus, William Saunders, William Zhang, Wout Vossen, Xiang
Ren, Xiaoyu Tong, Xinran Zhao, Xinyi Wu, Xudong Shen, Yadollah Yaghoobzadeh, Yair Lakretz, Yangqiu Song,
Yasaman Bahri, Yejin Choi, Yichi Yang, Yiding Hao, Yifu Chen, Yonatan Belinkov, Yu Hou, Yufang Hou, Yuntao
Bai, Zachary Seid, Zhuoye Zhao, Zijian Wang, Zijie J. Wang, Zirui Wang, and Ziyi Wu. Beyond the imitation
game: Quantifying and extrapolating the capabilities of language models, 2023.
[24] Leo Gao, Jonathan Tow, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey
Hsu, Kyle McDonell, Niklas Muennighoff, Jason Phang, Laria Reynolds, Eric Tang, Anish Thite, Ben Wang,
Kevin Wang, and Andy Zou. A framework for few-shot language model evaluation, September 2021.
[25] OpenCompass Contributors. Opencompass: A universal evaluation platform for foundation models. https:
//github.com/open-compass/opencompass, 2023.
[26] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt.
Measuring massive multitask language understanding. Proceedings of the International Conference on Learning
Representations (ICLR), 2021.
[27] John Snow Labs Inc. Langtest: Deliver safe and effective language models, 2023. Accessed: November, 2023.
[28] Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish
your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics,
2019.
[29] Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. What disease does
this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences,
11(14):6421, 2021.
[30] Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new
dataset for open book question answering. In EMNLP, 2018.
12
Beyond Metrics: Variability in LLM Evaluation Frameworks
[31] Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las
Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne
Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed.
Mistral 7b, 2023.
[32] Kun Zhou, Yutao Zhu, Zhipeng Chen, Wentong Chen, Wayne Xin Zhao, Xu Chen, Yankai Lin, Ji-Rong Wen, and
Jiawei Han. Don’t make your llm an evaluation benchmark cheater, 2023.
13
Beyond Metrics: Variability in LLM Evaluation Frameworks
A bearded man is seen speaking to the camera and making several faces. The man
Correct answer:
- then holds up a razor and begins shaving his face.
Incorrect answers:
- then switches off and shows himself via the washer and dryer rolling down a towel and scrubbing the floor.
- then rubs and wipes down an individual’s face and leads into another man playing another person’s flute.
- is then seen eating food on a ladder while still speaking.
Box A2: Example question/answer from the MedQA dataset (the option in bold is deemed to be the
correct answer).
A 23-year-old pregnant woman at 22 weeks gestation presents with burning upon urination. She states it started
1 day ago and has been worsening despite drinking more water and taking cranberry extract. She otherwise
feels well and is followed by a doctor for her pregnancy. Her temperature is 97.7°F (36.5°C), blood pressure is
122/77 mmHg, pulse is 80/min, respirations are 19/min, and oxygen saturation is 98% on room air. Physical
exam is notable for an absence of costovertebral angle tenderness and a gravid uterus. Which of the following is
the best treatment for this patient?
(A) Ampicillin. (B) Ceftriaxone (C) Ciprofloxacin (D) Doxycycline (E) Nitrofurantoin
Box A3: Example test from MMLU dataset’s Microeconomics task (the option in bold is deemed to be
the correct answer).
One of the reasons that the government discourages and regulates monopolies is that
Box A4: Example question/answer from the OpenBookQA dataset (the option in bold is deemed to be the
correct answer).
14
Beyond Metrics: Variability in LLM Evaluation Frameworks
Box B2: Example prompt input and outputs from MMLU dataset’s Microeconomics task in OpenCom-
pass.
————————————————————————————————————————————
One of the reasons that the government discourages and regulates monopolies is that
(A) producer surplus is lost and consumer surplus is gained.
(B) monopoly prices ensure productive efficiency but cost society allocative efficiency.
(C) monopoly firms do not engage in significant research and development.
(D) consumer surplus is lost with higher prices and lower levels of output.
Answer:
————————————————————————————————————————————
Possible answers:
A.
B.
C.
D.
15