Description
vLLM accelerates generation by 5× on H800, but the output quality degrades significantly.
Observed Issues
- Stage 1: As the sequence length increases, the generated audio gradually turns into noise (e.g., after ~30s).
- Stage 2: More invalid token IDs are observed when using vLLM.
Expected Behavior
- The generated audio should maintain quality just like huggingface transformers default implentation, regardless of sequence length.
- No increase in invalid token IDs in Stage 2.
Possible Causes
The issue is likely in the LM part, not the audio tokenizer or GAN.
Potential causes:
- Positional encoding misalignment?
- Page attention inaccurate?
- Decoding hyperparameter misalignment?
Steps to Reproduce
- A
vllm branch has been created. @hf-lin will adapt reproducible vLLM inference code based on Hugging Face.
- A command to compare vLLM COT (vllm branch) vs HF COT (main branch) implementations will be added here. @hf-lin
Additional Context
- YuE System Overview: We generate lyrics-to-song sequences with interleaved text conditions and audio tokens.
- Dual-Token Strategy:
- One token represents vocal at the current frame.
- One token represents instrumental accompaniment at the current frame.
See system diagram.

Description
vLLM accelerates generation by 5× on H800, but the output quality degrades significantly.
Observed Issues
Expected Behavior
Possible Causes
The issue is likely in the LM part, not the audio tokenizer or GAN.
Potential causes:
Steps to Reproduce
vllmbranch has been created. @hf-lin will adapt reproducible vLLM inference code based on Hugging Face.Additional Context
See system diagram.