Benchmarks distract us from what matters
I suspect that our fixation with LLM benchmarks may be driving us to optimise LLMs for capabilities that are easier to benchmark (such as math problems) even if they are not of much interest to users; and also to ignore capabilities (such as emotional appropriateness) which are important to real users but hard to assess with benchmarks.