System Info
Transformers version: 5.6.0.dev0
Platform: macOS-26.2-arm64-arm-64bit-Mach-O
Python version: 3.13.5 (v3.13.5:6cb20a219a8, Jun 11 2025, 12:23:45) [Clang 16.0.0 (clang-1600.0.26.6)]
PyTorch version: 2.11.0
CUDA available: False
MPS available: True
Who can help?
@Cyrilvallez @remi-or
Information
Tasks
Reproduction
Summary
generate(..., cache_implementation="paged") still warns that num_return_sequences is unsupported for continuous batching.
That warning looks stale: generate_batch() already uses generation_config.num_return_sequences to expand the number of requests.
Minimal reproduction
import torch
from transformers import GenerationConfig, PretrainedConfig
from transformers.generation.utils import GenerationMixin
from transformers.generation.continuous_batching.requests import GenerationOutput, RequestStatus
class DummyContinuousBatchingGenerateModel(GenerationMixin):
def __init__(self):
self.config = PretrainedConfig()
self.generation_config = GenerationConfig()
self.device = torch.device("cpu")
def generate_batch(self, inputs, generation_config=None, **kwargs):
num_return_sequences = generation_config.num_return_sequences or 1
return {
f"req_{i}": GenerationOutput(
request_id=f"req_{i}",
prompt_ids=inputs[0],
generated_tokens=[10 + i],
status=RequestStatus.FINISHED,
)
for i in range(num_return_sequences)
}
model = DummyContinuousBatchingGenerateModel()
model.generate(
inputs=torch.tensor([[1, 2, 3]]),
cache_implementation="paged",
do_sample=True,
num_return_sequences=2,
)
On current main, this still emits:
num_return_sequences and num_beams are not supported for continuous batching yet.
I have a draft fix here:
oleksii-tumanov@f7a939d
Expected behavior
For valid generate(..., cache_implementation="paged") calls:
- keep warning for
num_beams > 1
- stop warning for valid
num_return_sequences cases
System Info
Transformers version: 5.6.0.dev0
Platform: macOS-26.2-arm64-arm-64bit-Mach-O
Python version: 3.13.5 (v3.13.5:6cb20a219a8, Jun 11 2025, 12:23:45) [Clang 16.0.0 (clang-1600.0.26.6)]
PyTorch version: 2.11.0
CUDA available: False
MPS available: True
Who can help?
@Cyrilvallez @remi-or
Information
Tasks
examplesfolder (such as GLUE/SQuAD, ...)Reproduction
Summary
generate(..., cache_implementation="paged")still warns thatnum_return_sequencesis unsupported for continuous batching.That warning looks stale:
generate_batch()already usesgeneration_config.num_return_sequencesto expand the number of requests.Minimal reproduction
On current
main, this still emits:num_return_sequences and num_beams are not supported for continuous batching yet.I have a draft fix here:
oleksii-tumanov@f7a939d
Expected behavior
For valid
generate(..., cache_implementation="paged")calls:num_beams > 1num_return_sequencescases