Question:
Answer:
Checklist:
Large Language Models (LLMs) have achieved remarkable progress in conversational AI, yet their robustness remains limited when handling complex multi-turn dialogues that require retrieving evidence from distant utterances and reasoning across scattered information while adapting to frequent task switches. To address gaps in existing benchmarks that focus on short conversations or provide full dialogue history upfront, we propose MARS-Bench, a multi-turn dialogue benchmark constructed from real-world play-by-play sports data. MARS-Bench emphasizes three key features: Ultra Multi-turn Dialogues with over 30 turns per instance, Cross-turn Tasks requiring reasoning over non-adjacent information, and Interactive Multi-turn Generation where LLMs must respond at every turn. The benchmark defines four core tasks: Instruction Following (IF), Context Retrieval (CR), Information Reasoning (IR), and Task Switching (TS).
Each sample in MARS-Bench includes (1) detailed play-by-play records, (2) team rosters, and (3) player statistics. The play-by-play records and team rosters serve as direct inputs for the language models, while player statistics are utilized for verifying the correctness of model-generated answers, ensuring a robust evaluation framework.
The construction of MARS-Bench follows a meticulous three-stage pipeline: (1) Data Collection, where comprehensive sports data, including game events and commentary, is gathered; (2) Question Construction, which involves the careful generation of (Question, Answer, Checklist) triplets designed to probe various reasoning and dialogue capabilities; and (3) Quality Assurance, entailing a thorough review process to ensure the correctness, clarity, and appropriate difficulty level of all benchmark tasks.
We evaluated 23 representative models on MARS-Bench, including both closed-source and open-source LLMs with and without explicit reasoning capabilities.
Annotations:
Based on the evaluation results of various representative models on MARS-Bench, we summarize the following key observations.
Even top models achieve around 70 points, with performance decreasing as dialogue turns increase. This highlights limitations in handling extended multi-turn interactions. Lower scores on Instruction Following (IF), Information Reasoning (IR), and particularly Task Switching (TS) tasks further underscore deficiencies in cross-turn context management and interactive scenarios.
In challenging multi-turn dialogue tasks, closed-source models consistently outperform open-source counterparts. For example, Google's Gemini-2.5-Pro achieves a 72.44 overall on MARS-Bench, while the top open-source DeepSeek-R1 reaches just 45.42. Open-source models often lack the scale and targeted optimization needed for intricate information reasoning and task-switching.
Models with chain-of-thought reasoning tend to engage in more deliberate, System 2-style inference, exhibiting higher consistency and correctness. In contrast, models relying on System 1-style heuristic generation are more susceptible to variations in task complexity and context, leading to weaker overall performance. For instance, DeepSeek-R1 (45.42) outperforms DeepSeek-V3 (37.31) by 8.11 points overall, and by over 12 points on information reasoning (40.01 vs. 27.70).
Both reasoning-enhanced and standard models show relatively poor performance on instruction following. Current models struggle to track turn-level structures as required by system prompts, often failing to produce the correct output in the specified dialogue turn. This suggests limitations in aligning generation behavior with round-dependent instructions.
Each game segment/period refers to one of the three periods in NHL games or one of the four quarters in NBA games (excluding overtime).
Follow turn-specific instructions with format constraints.
Locate and retrieve factual information from previous dialogue turns.
Aggregate and reason over distributed contextual information.
Handle abrupt interleaving of unrelated queries.
To better understand how Large Language Models process multi-turn dialogues, we conducted an attention visualization analysis on Qwen2.5-7B-Instruct. This helps reveal how the structure of the input (e.g., number of turns, presence of special tokens) influences where the model focuses its attention during text generation.
<|im_end|>). Our analysis shows that these special tokens can attract a disproportionately high amount of attention. This might divert the model's focus from the actual meaningful content, potentially impacting task performance.
The attention visualizations below illustrate these patterns.
One visualization highlights how special tokens, such as turn separators (e.g., <|im_end|>), can become hotspots for model attention.
Another demonstrates that as dialogues become longer, the model's attention to crucial semantic content tends to diminish.
These insights suggest that the input structure for multi-turn dialogues significantly affects model behavior and performance.
Qwen2.5-7B-Instruct.
Multi-turn inputs introduce a higher volume of special tokens (e.g., <|im_end|>), which are observed to absorb a substantial portion of the model's attention, potentially reducing focus on semantic content.
Qwen2.5-7B-Instruct.
The model's attention to essential semantic information within the dialogue history is shown to decrease as the number of turns increases, suggesting a degraded focus in extended conversations.
@article{MARS-Bench,
title={MARS-Bench: A Multi-turn Athletic Real-world Scenario Benchmark for Dialogue Evaluation},
author={Chenghao Yang, Yinbo Luo, Zhoufutu Wen, Qi Chu, Tao Gong, Longxiang Liu, Kaiyuan Zhang, Jianpeng Jiao, Ge Zhang, Wenhao Huang, Nenghai Yu},
year={2025},
eprint={2505.23810},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2505.23810},
}
Task Example