evaluation

Benchmarks distract us from what matters

I suspect that our fixation with LLM benchmarks may be driving us to optimise LLMs for capabilities that are easier to benchmark (such as math problems) even if they are not of much interest to users; and also to ignore capabilities (such as emotional appropriateness) which are important to real users but hard to assess with benchmarks.

other

Improving Bayesian Networks

Nikolay Babakov has recently published several papers on Bayesian networks, including challenges in reusing BNs, ideas for explaining BNs (work with Jaime Sevilla), and using LLMs to help build BNs. I help to supervise Nikolai, and think BNs can potentially be a useful way to do reasoning with uncertainty which is configurable and explainable.

evaluation

I want a benchmark for emotional upset

I would love to see benchmarks which assess whether generated texts are emotionally upsetting. This is a major problem which we frequently encounter in our work on using AI to support patients. It would be challenging to build such a benchmark (nothing like it exists today), but we need a braoder range of benchamarks which assess complex real-world quality criteria such as emotional impact.

other

Vision: AI personal health assistants

I think there is enormous potential in using AI personal health assistants to improve health, including things like helping patients manage chronic illness, live more healthily, make informed decisions, and communicate with clinicians. There are huge challenges (technical and non-technical), but if this could be done well, it could radically improve health and enable healthcare systems to cope with increasingly elderly populations.