MARS-Bench Logo

Chenghao Yang1,2*, Yinbo Luo2*, Zhoufutu Wen1, Qi Chu2, Tao Gong2, Longxiang Liu1, Kaiyuan Zhang1, Jianpeng Jiao1, Ge Zhang1, Wenhao Huang1, Nenghai Yu2

1ByteDance Seed, 2University of Science and Technology of China

*Equal Contribution
Corresponding Authors
Work done at ByteDance Seed
MARS-Bench Overview

Introduction

Large Language Models (LLMs) have achieved remarkable progress in conversational AI, yet their robustness remains limited when handling complex multi-turn dialogues that require retrieving evidence from distant utterances and reasoning across scattered information while adapting to frequent task switches. To address gaps in existing benchmarks that focus on short conversations or provide full dialogue history upfront, we propose MARS-Bench, a multi-turn dialogue benchmark constructed from real-world play-by-play sports data. MARS-Bench emphasizes three key features: Ultra Multi-turn Dialogues with over 30 turns per instance, Cross-turn Tasks requiring reasoning over non-adjacent information, and Interactive Multi-turn Generation where LLMs must respond at every turn. The benchmark defines four core tasks: Instruction Following (IF), Context Retrieval (CR), Information Reasoning (IR), and Task Switching (TS).

📖 Dataset Process Pipeline

MARS-Bench Data Format

Overview of the Data Format

Each sample in MARS-Bench includes (1) detailed play-by-play records, (2) team rosters, and (3) player statistics. The play-by-play records and team rosters serve as direct inputs for the language models, while player statistics are utilized for verifying the correctness of model-generated answers, ensuring a robust evaluation framework.

MARS-Bench Data Construction Pipeline

Overview of the Data Construction Pipeline

The construction of MARS-Bench follows a meticulous three-stage pipeline: (1) Data Collection, where comprehensive sports data, including game events and commentary, is gathered; (2) Question Construction, which involves the careful generation of (Question, Answer, Checklist) triplets designed to probe various reasoning and dialogue capabilities; and (3) Quality Assurance, entailing a thorough review process to ensure the correctness, clarity, and appropriate difficulty level of all benchmark tasks.

🏅Leaderboard

We evaluated 23 representative models on MARS-Bench, including both closed-source and open-source LLMs with and without explicit reasoning capabilities.


Reasoning Models Standard Models

Segment: overall (Tap to switch to task-specific)

Annotations:

  • 🥇: Top 1 on Overall Score
  • 🥈: Top 2 on Overall Score
  • 🥉: Top 3 on Overall Score
  • Bold: Top 1 on task-specific scores
  • Underlined: Top 2 on task-specific scores

💡 Experiment Results

Based on the evaluation results of various representative models on MARS-Bench, we summarize the following key observations.

LLMs Struggle in Complex Multi-turn Dialogues

Even top models achieve around 70 points, with performance decreasing as dialogue turns increase. This highlights limitations in handling extended multi-turn interactions. Lower scores on Instruction Following (IF), Information Reasoning (IR), and particularly Task Switching (TS) tasks further underscore deficiencies in cross-turn context management and interactive scenarios.

Closed-Source Models Lead in Complex Multi-Turn Scenarios

In challenging multi-turn dialogue tasks, closed-source models consistently outperform open-source counterparts. For example, Google's Gemini-2.5-Pro achieves a 72.44 overall on MARS-Bench, while the top open-source DeepSeek-R1 reaches just 45.42. Open-source models often lack the scale and targeted optimization needed for intricate information reasoning and task-switching.

Reasoning Models Demonstrate Greater Performance

Models with chain-of-thought reasoning tend to engage in more deliberate, System 2-style inference, exhibiting higher consistency and correctness. In contrast, models relying on System 1-style heuristic generation are more susceptible to variations in task complexity and context, leading to weaker overall performance. For instance, DeepSeek-R1 (45.42) outperforms DeepSeek-V3 (37.31) by 8.11 points overall, and by over 12 points on information reasoning (40.01 vs. 27.70).

Models Perform Worse on the Instruction Following Task

Both reasoning-enhanced and standard models show relatively poor performance on instruction following. Current models struggle to track turn-level structures as required by system prompts, often failing to produce the correct output in the specified dialogue turn. This suggests limitations in aligning generation behavior with round-dependent instructions.

📊 Task Categories

Each game segment/period refers to one of the three periods in NHL games or one of the four quarters in NBA games (excluding overtime).

Instruction Following

Follow turn-specific instructions with format constraints.

Fixed-format Single-Turn Response: Follow the format specified for the current dialogue turn.
1 / period
Turn-conditioned Prompted Formatting: Adapt the response format according to system instructions at each turn.
8 / period
Turn-conditioned Inferred Formatting: Adjust the response format based on instructions inferred from prior dialogue turns.
1 / period

Context Retrieval

Locate and retrieve factual information from previous dialogue turns.

Anchored Event Retrieval: Given a time anchor and interval, retrieve a specific event.
2 / period
Interval-based Event Retrieval: Given a start and end time, retrieve events of a specific type.
1 / period

Information Reasoning

Aggregate and reason over distributed contextual information.

Current Score Tracking: Provide the current score for both teams.
1 (last period)
Score Lead Fluctuation Detection: Identify the number and timing of score lead changes between the two teams within a given time period.
1 / period
Player Performance Impact Analysis: Given a time span, analyze how a change in a player's performance affected the game situation.
2 / period

Task Switching

Handle abrupt interleaving of unrelated queries.

In-context Reasoning Query: Ask questions related to the match.
3 / period
Out-of-context Math Query: Ask unrelated mathematical questions from MathQA.
3 / period

🔬 Attention Visualization Analysis

To better understand how Large Language Models process multi-turn dialogues, we conducted an attention visualization analysis on Qwen2.5-7B-Instruct. This helps reveal how the structure of the input (e.g., number of turns, presence of special tokens) influences where the model focuses its attention during text generation.

Key Findings:

  • Impact of Dialogue Turns: As the number of dialogue turns increases (from 1 to 10, then to 20 turns), the model's attention towards key semantic content within the dialogue history tends to decrease. This observation aligns with the performance degradation in longer dialogues, suggesting that an increased number of turns might hinder effective context integration.
  • Attention to Special Tokens: Multi-turn dialogues inherently require more special tokens (e.g., to denote the end of a turn like <|im_end|>). Our analysis shows that these special tokens can attract a disproportionately high amount of attention. This might divert the model's focus from the actual meaningful content, potentially impacting task performance.

The attention visualizations below illustrate these patterns. One visualization highlights how special tokens, such as turn separators (e.g., <|im_end|>), can become hotspots for model attention. Another demonstrates that as dialogues become longer, the model's attention to crucial semantic content tends to diminish. These insights suggest that the input structure for multi-turn dialogues significantly affects model behavior and performance.

Attention visualization: Special tokens absorbing disproportionate attention
Figure 1: Attention on Special Tokens in Qwen2.5-7B-Instruct. Multi-turn inputs introduce a higher volume of special tokens (e.g., <|im_end|>), which are observed to absorb a substantial portion of the model's attention, potentially reducing focus on semantic content.
Attention visualization: Diminishing attention to key content in longer dialogues
Figure 2: Attention Decay on Key Content in Qwen2.5-7B-Instruct. The model's attention to essential semantic information within the dialogue history is shown to decrease as the number of turns increases, suggesting a degraded focus in extended conversations.

BibTeX


          @article{MARS-Bench,
          title={MARS-Bench: A Multi-turn Athletic Real-world Scenario Benchmark for Dialogue Evaluation}, 
          author={Chenghao Yang, Yinbo Luo, Zhoufutu Wen, Qi Chu, Tao Gong, Longxiang Liu, Kaiyuan Zhang, Jianpeng Jiao, Ge Zhang, Wenhao Huang, Nenghai Yu}, 
          year={2025},
          eprint={2505.23810},
          archivePrefix={arXiv},
          primaryClass={cs.CL},
          url={https://arxiv.org/abs/2505.23810}, 
      }