Skip to content

Support reasoning_content and tool calling in openai message format dataset #1640

@akoumpa

Description

@akoumpa

Is your feature request related to a problem? Please describe.

When fine-tuning models that use extended thinking (e.g., DeepSeek-R1, QwQ) or tool calling, the current ChatDataset pipeline has gaps in how it handles these OpenAI message format fields:

  1. reasoning_content — The _normalize_messages() function (chat_dataset.py:176) copies all message fields via out = dict(m), so a reasoning_content field would survive into the dict passed to
    apply_chat_template(). However, there is no explicit validation, documentation, or testing for this field. Whether it actually renders into the tokenized output depends entirely on the Jinja chat template —
    and most templates silently ignore it. Users training on reasoning traces have no way to know if their data is being used correctly or silently dropped.
  2. tool_calls on assistant messages — Partially supported. The field is preserved through normalization and passed to apply_chat_template(). However:
    - There is no validation that tool_calls entries have the required id, type, and function subfields.
    - The fallback loss-masking path (non-{% generation %} templates, formatting_utils.py:347-363) pops the last message and asserts it is role=assistant. This breaks for multi-turn tool-calling conversations
    where the last assistant message follows a tool response — the mask is only computed for the final assistant turn, not intermediate assistant turns that emit tool_calls.
    - content being empty string ("") or None on a tool-calling assistant message is not explicitly handled — str(None) produces "None" (chat_dataset.py:190).
  3. tool role messages — The role is accepted (chat_dataset.py:191), but there is no validation that tool_call_id is present, which is required by the OpenAI spec to correlate tool responses with their calls.

Describe the solution you'd like

A. reasoning_content support

  1. In _normalize_messages(): Explicitly preserve reasoning_content (string) on assistant messages. If present, validate it is a string. Convert None to empty string rather than "None".
  2. Loss masking: When reasoning_content is present, the reasoning tokens should be supervised (included in loss) by default, with an option to mask them. This is critical — if users are training on reasoning
    traces, they expect the model to learn to produce them.
  3. Chat template awareness: Document that the user's Jinja chat template must handle the reasoning_content field (e.g., render it inside ... tags). Provide an example template snippet.
  4. Validation/warning: If reasoning_content is present in the data but the chat template doesn't reference it, log a warning so users know their reasoning traces are being silently dropped.

B. Tool calling robustness

  1. Validate tool_calls structure: In _normalize_messages(), when an assistant message has tool_calls, validate each entry has id (str), type (str, typically "function"), and function (dict with name and
    arguments). Raise a clear error on malformed entries rather than letting them fail deep inside the tokenizer.
  2. Validate tool_call_id on tool messages: When role == "tool", validate that tool_call_id is present and is a string.
  3. Fix content: None handling: In _normalize_messages(), str(None) produces the literal string "None". For assistant messages with tool_calls, content is commonly None or absent — this should normalize to ""
    (empty string), not "None".
  4. Multi-turn loss masking: The fallback (non-{% generation %}) loss-masking path in format_chat_template() only masks the final assistant turn. For multi-turn tool-calling conversations, all assistant turns
    should be supervised. Recommend that tool-calling templates use {% generation %} blocks and document this requirement.

C. Testing

Add unit tests covering:

  • Round-trip of reasoning_content through _normalize_messages() and format_chat_template()
  • tool_calls with valid and malformed structures
  • tool role with and without tool_call_id
  • content: None vs content: "" on tool-calling assistant messages
  • Multi-turn conversations: user -> assistant(tool_calls) -> tool -> assistant
  • Loss mask correctness for reasoning + tool-calling conversations

D. Documentation

  • Update docs/guides/dataset-overview.md with reasoning_content example format
  • Add example dataset rows for reasoning + tool-calling combination
  • Document chat template requirements for both features

Example data format (target)

  {
    "messages": [
      {"role": "system", "content": "You are a helpful assistant."},
      {"role": "user", "content": "What's the weather in Seattle and should I bring an umbrella?"},
      {
        "role": "assistant",
        "reasoning_content": "The user wants weather info and advice. I should call get_weather first, then reason about umbrella need based on conditions.",
        "content": "",
        "tool_calls": [
          {
            "id": "call_1",
            "type": "function",
            "function": {
              "name": "get_weather",
              "arguments": "{\"city\": \"Seattle\"}"
            }
          }
        ]
      },
      {
        "role": "tool",
        "tool_call_id": "call_1",
        "content": "{\"temperature\": 55, \"condition\": \"rain\", \"precipitation_chance\": 0.85}"
      },
      {
        "role": "assistant",
        "reasoning_content": "It's raining with 85% precipitation chance. Definitely should recommend an umbrella.",
        "content": "It's currently 55 deg F and raining in Seattle with an 85% chance of continued precipitation. Yes, definitely bring an umbrella!"
      }
    ],
    "tools": [
      {
        "type": "function",
        "function": {
          "name": "get_weather",
          "description": "Get current weather for a city",
          "parameters": {
            "type": "object",
            "properties": {"city": {"type": "string"}},
            "required": ["city"]
          }
        }
      }
    ]
  }

Describe alternatives you've considered

  • Rely entirely on tags in content: This is the current workaround for reasoning traces — users embed ... in the content field. It works but is fragile (templates must not strip tags),
    not interoperable with APIs that use the structured reasoning_content field, and makes loss masking on reasoning vs. answer tokens impossible.
  • Separate preprocessing script: Users could preprocess their data to flatten reasoning_content into content and validate tool fields before training. This works but pushes complexity onto users and is
    error-prone.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions