-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Closed
Labels
questionFurther information is requestedFurther information is requested
Description
I'm mostly interested in using Spleeter's pretrained models directly within my C++ application, skipping as much as possible the provided python scripts.
My understanding of how to use those pretrained models is as follow (based on the 4stems base_config.json) :
- convert audio data to 44100Hz/stereo/float32
- transform it to 2 complex spectrograms (one per channel, fft size 4096 samples, hann windowing, fft step 1024 samples)
- convert the 2 complex spectrograms to 2 magnitude spectrograms
- take the first 512 ffts/lowest 1024 bins of those 2 magnitude spectrogram, resulting in a 512x1024x2 float32 data block
- feed that 512x1024x2 float32 data block to the pretrained model (using tensorflow), and get 4 512x1024x2 float32 prediction data blocks back
- use the resulting 4 predictions, the original magnitude spectrogram and the separation_exponent to compute 4 instruments masks: instrument_masks=(predictions^separation_exponent)/(original_magnitude_spectrogram^separation_exponent)
- apply the instrument mask to the original complex spectrograms and compute the inverse transforms to get 4 audio stems
- move by 512 ffts, and do it again from step 4 until the end of the file.
Am I correct or did I misunderstand or missed something ? Is the tensorflow model inputting and outputting float32 numbers ?
thanks !
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
questionFurther information is requestedFurther information is requested