Skip to content

Fixing vLLM: Incorrect Generation Results #66

@a43992899

Description

@a43992899

Description

vLLM accelerates generation by 5× on H800, but the output quality degrades significantly.

Observed Issues

  • Stage 1: As the sequence length increases, the generated audio gradually turns into noise (e.g., after ~30s).
  • Stage 2: More invalid token IDs are observed when using vLLM.

Expected Behavior

  • The generated audio should maintain quality just like huggingface transformers default implentation, regardless of sequence length.
  • No increase in invalid token IDs in Stage 2.

Possible Causes

The issue is likely in the LM part, not the audio tokenizer or GAN.
Potential causes:

  • Positional encoding misalignment?
  • Page attention inaccurate?
  • Decoding hyperparameter misalignment?

Steps to Reproduce

  1. A vllm branch has been created. @hf-lin will adapt reproducible vLLM inference code based on Hugging Face.
  2. A command to compare vLLM COT (vllm branch) vs HF COT (main branch) implementations will be added here. @hf-lin

Additional Context

  • YuE System Overview: We generate lyrics-to-song sequences with interleaved text conditions and audio tokens.
  • Dual-Token Strategy:
    • One token represents vocal at the current frame.
    • One token represents instrumental accompaniment at the current frame.

See system diagram.

Image

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingenhancementNew feature or requesthelp wantedExtra attention is needed

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions