evaluation

Benchmarks distract us from what matters

Mar 26, 2025 ehudreiter2 Comments

I suspect that our fixation with LLM benchmarks may be driving us to optimise LLMs for capabilities that are easier to benchmark (such as math problems) even if they are not of much interest to users; and also to ignore capabilities (such as emotional appropriateness) which are important to real users but hard to assess with benchmarks.

other

People do not understand how LLMs can/cannot help them

Mar 13, 2025 ehudreiter1 Comment

People will make much better use of LLMs if they understand what the technology can and can not do. Unfortunately many people have little understanding of this; I make a few suggestions which perhaps could help a bit.

other

Improving Bayesian Networks

Mar 3, 2025 ehudreiterLeave a comment

Nikolay Babakov has recently published several papers on Bayesian networks, including challenges in reusing BNs, ideas for explaining BNs (work with Jaime Sevilla), and using LLMs to help build BNs. I help to supervise Nikolai, and think BNs can potentially be a useful way to do reasoning with uncertainty which is configurable and explainable.

evaluation

I want a benchmark for emotional upset

Feb 17, 2025Feb 17, 2025 ehudreiter1 Comment

I would love to see benchmarks which assess whether generated texts are emotionally upsetting. This is a major problem which we frequently encounter in our work on using AI to support patients. It would be challenging to build such a benchmark (nothing like it exists today), but we need a braoder range of benchamarks which assess complex real-world quality criteria such as emotional impact.

evaluation

NLG Evaluation 2025 vs 2015: much improved but needs to be better

Feb 4, 2025Feb 4, 2025 ehudreiterLeave a comment

How has NLG evaluation changed in past ten years? Short answer is that tech is much better (eg, LLM-as-judge), but practice (eg experimental rigour) remains poor, and commercial interests are more prominent.

other

Vision: AI personal health assistants

Jan 23, 2025Jan 29, 2025 ehudreiter2 Comments

I think there is enormous potential in using AI personal health assistants to improve health, including things like helping patients manage chronic illness, live more healthily, make informed decisions, and communicate with clinicians. There are huge challenges (technical and non-technical), but if this could be done well, it could radically improve health and enable healthcare systems to cope with increasingly elderly populations.

evaluation

Do LLM coding benchmarks measure real-world utility?

Jan 13, 2025Jan 22, 2025 ehudreiter2 Comments

LLM benchmarks for coding are closer to real-world use than other LLM benchmarks, but they still do not measure real-world utility. I explain this by contrasting what is measured by SWE-bench with what is measured by a recent study of real-world utility in software development.

evaluation

We need better LLM benchmarks

Jan 3, 2025Jan 31, 2025 ehudreiter6 Comments

Current benchmark (suites) for evaluating LLMs are disappointing. I describe the properties that I think good benchmarks and benchmark suites should have, but often do not, such as being correct, challenging, diverse, and real-world.

evaluation

Do LLM benchmarks ignore NLG?

Dec 26, 2024Dec 27, 2024 ehudreiter2 Comments

I was very disappointed to realise that the evaluation suite for Amazon Nova (and I assume for other LLMs) has poor coverage of NLG tasks. Which is surprising since LLMs are largely used to generate texts; shouldnt they be evaluated, at least in part, on their ability to do this well?

academics

Interesting papers in 2024

Dec 16, 2024 ehudreiterLeave a comment

There were lots of interesting papers in 2024. I describe a few of them, and also list others I have mentioned in previous blogs; all are about evaluation, experimental rigour, real-world utility, and/or healthcare applications.

Ehud Reiter's Blog

Ehud's thoughts about Natural Language Generation. Also see my book on NLG.

Benchmarks distract us from what matters

People do not understand how LLMs can/cannot help them

Improving Bayesian Networks

I want a benchmark for emotional upset

NLG Evaluation 2025 vs 2015: much improved but needs to be better

Vision: AI personal health assistants

Do LLM coding benchmarks measure real-world utility?

We need better LLM benchmarks

Do LLM benchmarks ignore NLG?

Interesting papers in 2024