SAMUeL-GEN: Efficient Vocal-Conditioned Music Generation via Soft Alignment Attention and Latent Diffusion
This repository contains the official implementation of SAMUeL-GEN, a lightweight latent diffusion model for vocal-conditioned musical accompaniment generation. Our approach achieves 220× parameter reduction and 52× faster inference compared to state-of-the-art systems while maintaining competitive performance.
Efficient Vocal-Conditioned Music Generation via Soft Alignment Attention and Latent Diffusion Hei Shing Cheung, Boya Zhang
arXiv:2507.19991 [cs.SD]
This project is licensed under the MIT License - see the LICENSE file for details.
- Ultra-lightweight architecture: Only 15M parameters vs. billions in existing models
- Novel soft alignment attention: Adaptively combines local and global temporal dependencies
- V-parameterization: More stable training compared to ε-prediction
- Real-time capable: Designed for consumer hardware deployment
- Latent space operation: Operates in compressed VAE latent space for efficiency
Our model introduces several key innovations:
-
Soft Alignment Attention Mechanism: The core innovation is a soft alignment attention mechanism that dynamically balances local and global attention patterns through time-dependent weighting. This mechanism is based on the hierarchical diffusion process.
-
Dual-mode Attention:
-
Local attention operates within sliding windows of size 16 to capture fine-grained temporal dependencies. The local attention for position i is computed as:
$$\text{LocalAttn}(i) = \text{softmax}\left(\frac{Q_i K_{j \in W(i)}^T}{\sqrt{d}}\right)V$$ where
$W(i)$ is the local window around position$i$ , and$Q, K, V$ are derived from convolutional projections. -
Global attention computes relationships across the entire sequence using rotary position embeddings (RoPE) for improved positional encoding. The global attention formula is:
$$\text{GlobalAttn} = \text{softmax}\left(\frac{\text{ROPE}(Q)\text{ROPE}(K)^T}{\sqrt{d}}\right)V$$
-
-
FiLM Conditioning: The architecture uses Feature-wise Linear Modulation (FiLM) layers to provide fine-grained temporal control and integrate timestep embeddings.
-
V-objective Training: The model is trained to predict velocity,
$v_t$ , which is defined as:$$v_t = \alpha_t \epsilon - \sigma_t x_0$$
SAMUeL-GEN/
├── src/
│ ├── config.py # Model configuration and hyperparameters
│ ├── model.py # Main diffusion model with soft alignment attention
│ ├── diffusion.py # Diffusion process implementation
│ ├── modules.py # Building blocks (UNet, attention, etc.)
│ ├── train.py # Training script
│ ├── inference.py # Inference and sampling
│ ├── sample.py # Sampling utilities
│ ├── data_utils.py # Data preprocessing and loading
│ ├── utils.py # General utilities
│ └── check_utils.py # Validation and testing utilities
├── results/ # Training results and visualizations
│ ├── loss_curve.png
│ └── mse_curve.png
├── README.md
└── LICENSE
- Python: 3.11
- PyTorch: 2.7
- CUDA: Optional, for GPU acceleration (recommended)
- Clone the repository:
git clone https://github.com/HaysonC/SAMUeL-GEN.git
cd SAMUeL-GEN- Install dependencies:
# Install from requirements file, or cd to source directory and run without src/ prefix
python -m venv venv
source venv/bin/activate # On Windows use `venv\Scripts\activate`
pip install -r src/requirements.txt
# Or install core dependencies manually:
pip install torch>=2.0.0 torchvision>=0.15.0 torchaudio>=2.0.0
pip install numpy matplotlib tqdm scipy soundfile- Download the dataset:
The model expects encoded audio data. The default configuration uses the Kaggle dataset
boyazhangnb/encodedsongs.
- Training: Train the model from scratch
cd src
python train.py- Inference: Generate music with a trained model
cd src
python inference.py- Sampling: Run sampling utilities
cd src
python sample.pyTo train the model from scratch:
cd src
python train.pyKey training parameters can be modified in config.py:
TRAINING_TIMESTEPS: Number of diffusion steps (default: 800)TRAINING_LR: Learning rate (default: 3.5e-4)TRAINING_EPOCHS: Training epochs (default: 100)MODEL_WINDOW_SIZE: Local attention window size (default: 16)
To generate music using a trained model:
cd src
python inference.pyThe inference script will:
- Load a trained model checkpoint
- Allow you to specify conditioning input
- Generate musical accompaniment using the diffusion sampling process
- Save the output as audio files
Model architecture and training parameters are centralized in config.py:
MODEL_UNET_SHAPES = [
(64, 1024), # Input: Feature=64, Temporal=1024
(128, 512), # Downsample
(256, 256), # Downsample
(512, 128), # Bottleneck (with attention)
(256, 256), # Upsample
(128, 512), # Upsample
(64, 1024), # Output
]You can listen to a sample of the model's generated accompaniment below:
Our model achieves competitive results with significantly reduced computational requirements:
- Parameters: 15M (vs. ~3.3B in OpenAI Jukebox)
- Inference Speed: 52× faster than comparable models
- Memory Efficiency: 220× parameter reduction
- Quality: Competitive performance in production quality and content unity metrics
The model demonstrates stable training convergence and effective learning dynamics:
Training loss progression showing stable convergence over epochs
Mean Squared Error curve demonstrating model learning efficiency
V-parameterization sampling statistics showing prediction quality
- Training Stability: V-parameterization provides more stable training compared to ε-prediction
- Convergence Speed: Faster convergence due to soft alignment attention mechanism
- Generation Quality: Maintains high-quality outputs with significantly reduced parameters
The model was evaluated on standard music generation benchmarks, demonstrating:
- Superior efficiency compared to autoregressive models
- Competitive musical coherence and quality
- Real-time generation capabilities on consumer hardware
Training curves and detailed results are available in the results/ directory.
The core innovation is our adaptive attention mechanism that balances local and global context:
# Timestep-dependent mixing
alpha_t = get_timestep_alpha(t)
context = alpha_t * context_local + (1 - alpha_t) * context_globalThis allows the model to focus on fine-grained local patterns during early diffusion steps and global structure during later steps.
Instead of predicting noise ε, our model predicts velocity v:
v_t = α_t * ε - σ_t * x_0
This parameterization provides more stable training dynamics and better convergence properties.
We welcome contributions to improve SAMUeL-GEN! Please feel free to:
- Report bugs and issues
- Suggest new features
- Submit pull requests
- Improve documentation
If you use this code in your research, please cite our paper:
@article{cheung2025samuel,
title={Efficient Vocal-Conditioned Music Generation via Soft Alignment Attention and Latent Diffusion},
author={Cheung, Hei Shing and Zhang, Boya},
journal={arXiv preprint arXiv:2507.19991},
year={2025}
}For questions and collaborations:
We thank the research community for their foundational work in diffusion models and music generation. Special thanks to the creators of the datasets and tools that made this research possible.
Note: This is research code intended for academic and educational purposes. For production use, additional optimization and testing may be required.