Skip to content

Paged generate() emits a stale warning for num_return_sequences #45563

@oleksii-tumanov

Description

@oleksii-tumanov

System Info

Transformers version: 5.6.0.dev0
Platform: macOS-26.2-arm64-arm-64bit-Mach-O
Python version: 3.13.5 (v3.13.5:6cb20a219a8, Jun 11 2025, 12:23:45) [Clang 16.0.0 (clang-1600.0.26.6)]
PyTorch version: 2.11.0
CUDA available: False
MPS available: True

Who can help?

@Cyrilvallez @remi-or

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Summary

generate(..., cache_implementation="paged") still warns that num_return_sequences is unsupported for continuous batching.

That warning looks stale: generate_batch() already uses generation_config.num_return_sequences to expand the number of requests.

Minimal reproduction

import torch
from transformers import GenerationConfig, PretrainedConfig
from transformers.generation.utils import GenerationMixin
from transformers.generation.continuous_batching.requests import GenerationOutput, RequestStatus

class DummyContinuousBatchingGenerateModel(GenerationMixin):
    def __init__(self):
        self.config = PretrainedConfig()
        self.generation_config = GenerationConfig()
        self.device = torch.device("cpu")

    def generate_batch(self, inputs, generation_config=None, **kwargs):
        num_return_sequences = generation_config.num_return_sequences or 1
        return {
            f"req_{i}": GenerationOutput(
                request_id=f"req_{i}",
                prompt_ids=inputs[0],
                generated_tokens=[10 + i],
                status=RequestStatus.FINISHED,
            )
            for i in range(num_return_sequences)
        }

model = DummyContinuousBatchingGenerateModel()
model.generate(
    inputs=torch.tensor([[1, 2, 3]]),
    cache_implementation="paged",
    do_sample=True,
    num_return_sequences=2,
)

On current main, this still emits:
num_return_sequences and num_beams are not supported for continuous batching yet.

I have a draft fix here:
oleksii-tumanov@f7a939d

Expected behavior

For valid generate(..., cache_implementation="paged") calls:

  • keep warning for num_beams > 1
  • stop warning for valid num_return_sequences cases

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions