Challenges in Evaluating LLMs

UPDATE: I’ve uploaded a (slight modified version of) my actual workshop talk. (PDF)

Last week I gave a talk on Challenges in Evaluating LLMs, where I basically said that good evaluations of LLMs are possible but not easy. In particular, they require the experimenter to be familiar with challenges (see below) and also carefully design and execute the experiment. Unfortunately many NLP researchers do not do this, which means that a lot of published LLM evaluations are seriously flawed.

Anyways a few people asked me about this talk, so I thought I’d write it up in a blog.

Challenge 1: Data contamination

People evaluating LLMs need to ensure that the model has not seen the test data before; if it has, this is called data contamination and the evaluation is not valid. I’ve written a blog on data contamination, so I wont go into detail here, except to say that the problem is widespread and subtle (Balloccu et al 2024).

One consequence of data contamination is that LLMs should not be tested on public benchmark data sets where test data is published in a GitHub repository. Publishing benchmark data on GitHub was good practice two years ago, but it should not be done in 2024.

Unfortunately, many researchers seem to ignore this problem. For example, I am very interested in using LLMs to communicate with patients in a health setting, so I was excited by a recent Google paper that claimed to do this well. But I had major worries about data contamination after reading the paper (eg, some test data came from the web), which were reinforced because there was no mention anywhere in the paper about data contamination issues. So can I believe this paper??

Challenge 2: Replicability

I hope we all agree that scientific experiments should be replicable! If I do an experiment, you should be able to repeat the experiment and get similar results.

However, it is difficult to repeat experiments with closed LLMs such as GPT. One problem is that the closed models are constantly evolving and changing. Which means that if I do an experiment using GPT4 in January and you repeat it in July, you may get different results because GPT4 has changed.

Even worse is that some models get deprecated or eliminated. For example, I recently reviewed a paper which used the GPT3 text-davinci-003 model, which was very popular a few years ago. Unfortunately this model no longer exists (at least I couldn’t find it). So not only is it impossible for me to replicate the experiment, its also impossible for the authors to replicate their own experiment.

One solution to this problem is to use open models with checkpoints which are available on Huggingface or other public repositories. Unfortunately, reviewers may then complain that researchers are not using the best models…

Challenge 3: Evaluating high quality texts

LLMs can produce excellent texts, which can surpass the quality of low-quality human texts. Ie, if I use an LLM to translate a text, the result won’t be as good as a careful translation by an excellent translator; but it may be better than a rushed translation by a novice translator who has been given a heavy workload.

One consequence of this is that it does not make sense to evaluate LLM outputs using reference-based metrics (BLEU, ROUGE, BLEURT, etc), unless the reference texts are very good. A reference-based metric essentially evaluates a text by assessing how similar it is to a human “reference” text. But this makes no sense if the LLM output is better than the reference text! And while some reference texts are very good (high-quality texts carefully written by domain experts), there are loads of low-quality reference texts produced by crowd-workers or overworked novices.

In short, it makes no sense to evaluate LLM output using BLEU, ROUGE, BLEURT, etc, unless the experimenter can show that the reference texts used by the metrics are very good. Despite this, I see numerous papers which evaluate LLMs using these metrics and do not say anything about the quality of their reference texts.

Challenge 4: Evaluating mixed-quality texts

Another challenge is that LLMs can produce excellent texts 99% of the time, but also produce poor texts (inappropriate, inaccurate, unsafe, etc) 1% of the time. This is not acceptable in many contexts and use cases, including medical applications.

This means that we need to evaluate worst-case as well as average-case performance of an LLM. But this is difficult, not least because LLMs (incredibly complex black boxes) fail in unpredictable contexts. Of course we can use software testers (or “red teams”) to try to find problematic cases, but there is no guarantee that the testers will find all of the problems.

I dont have a good solution to this problem, which is a real “blocker” for the use of LLMs in many high-value use cases.

Challenge 5: Accurate but not appropriate

Evaluations of generated texts have generally focused on readability (fluency) and accuracy. However, in my group we see many cases where LLM-generated texts are readable and accurate but still not appropriate things to say. I’ve written several blogs about this, so I wont go into detail here.

However I will say that as with data contamination, what worries me is that many researchers ignore this issue. For instance, I took an example health-related output from another Google paper (which presumably the authors thought was a good example which highlighted what their system could do) and showed it to a doctor, who was horrified and said she would never say this to a patient (blog).

Researchers who care about this issue should make an effort to find out what makes texts acceptable in their domain, and then set up an appropriate evaluation (perhaps with domain experts).

Final thoughts

I’ve listed five challenges which make evaluating LLMS more difficult than evaluating earlier NLP systems, there are of course others. Meaningful LLM evaluations should be designed (and executed) in a manner which is robust and minimizes the effects of these challenges.

Unfortunately, I see many papers which ignore these challenges. The fundamental problem is that many NLP researchers don’t seem very interested in high-quality experiments (there are of course exceptions!). In other words, while psychologists and medical researchers (for example) place huge emphasis on experimental skills, and regard experimental design and execution as core research competences, many NLP researchers focus on models, algorithms, or prompts, and regard experiments as an “afterthought” which should be done as quickly and cheaply as possible. Perhaps by copying a previous experiment, with no thought about whether this older experiment is still valid for LLMs.

This attitude does not lead to good science! If NLP is a scientific endeavour, we need to do good experiments.

Ehud Reiter's Blog

Ehud's thoughts about Natural Language Generation. Also see my book on NLG.

Challenges in Evaluating LLMs

Challenge 1: Data contamination

Challenge 2: Replicability

Challenge 3: Evaluating high quality texts

Challenge 4: Evaluating mixed-quality texts

Challenge 5: Accurate but not appropriate

Final thoughts

One thought on “Challenges in Evaluating LLMs”

Leave a comment Cancel reply

Challenge 1: Data contamination

Challenge 2: Replicability

Challenge 3: Evaluating high quality texts

Challenge 4: Evaluating mixed-quality texts

Challenge 5: Accurate but not appropriate

Final thoughts

Share this:

Related

Share this:

One thought on “Challenges in Evaluating LLMs”

Leave a comment Cancel reply