ICASSP 2024 T-Foley
ICASSP 2024 T-Foley
ABSTRACT
Foley sound, audio content inserted synchronously with videos in
post-production, plays a crucial role in the user experience of mul-
timedia content. Recently, foley sound synthesis has been actively
studied, leveraging the advance of deep generative models. How-
ever, they mainly focus on mimicking a particular sound class as a
single event or a holistic context without temporal information of
individual sources. We present T-Foley, a Temporal-event guided
waveform generation model for Foley sound synthesis. T-Foley Fig. 1. A description of temporal-event-guided Foley sound synthe-
generates high-quality audio with two conditions: sound class and sis. The model generates waveforms with timbre of the sound class
temporal event condition. The temporal event condition is imple- and loudness that follows the temporal event condition.
mented with Block-FiLM as a novel conditioning method derived
from Temporal-FiLM. We show that the T-Foley model achieves
superior performances in both objective and subjective evaluation
metrics and generates well-synchronized foley sounds to the tempo- the timing of the sound when it includes multiple sound events. Lo-
ral event condition. We particularly use vocal mimicking datasets cating sound events at the right time is critical, considering the prac-
paired with foley sounds for the temporal event control, considering tical use of foley sound synthesis. Some studies generated implicit
its intuitive usage in real-world application scenarios. temporal features from input video during synthesis [6, 9, 10, 15].
Index Terms— Foley Sound Synthesis, Controllable Sound However, they do not have explicit temporal event conditions for
Generation, General Audio Synthesis, Waveform Domain Diffusion controllability and did not count quantitative analysis due to the ab-
sence of an adequate metric.
1. INTRODUCTION This research aims to address the challenge and produce real-
istic and timing-aligned Foley sound effects given a specific sound
Foley sound refers to human-created sound effects, such as footsteps category. To the best of our knowledge, this is the first attempt to
or gunshots, to accentuate visual media. The significance of foley generate audio with explicit temporal event conditions. Our con-
sound lies in its ability to enhance the overall immersive experience tributions are the following: First, we propose T-Foley, a Temporal-
for various forms of media [1]. Foley sounds are usually created event-guided diffusion model with a conditioning sound class to gen-
by foley artists who record and produce required sounds manually, erate high-quality Foley sound. For the temporal guidance, we de-
synchronizing with the visual elements. fine the temporal event feature to represent the timing of sounds.
The advent of neural audio generation has presented an oppor- To devise a conditioning method that reflects temporal informative
tunity to automate and streamline the foley sound creation process, condition, we introduce Block-FiLM, a modification of FiLM [16]
reducing the time and effort required for sound production. To syn- for block-wise affine transformation. Second, we conduct extensive
thesize proper sounds from specific categories, early studies usu- experiments to validate the performance and provide a comparative
ally focused on single sound sources such as foosteps [2], laugh- analysis of results on temporal conditioning methods and temporal
ter [3], and drum [4, 5]. Subsequent research then further improved event features. The evaluation includes both objective and subjec-
a model to be capable of generating multiple sound categories uti- tive metrics. Lastly, we show the potential applications of the neural
lizing auto-regressive models [6, 7, 8], Generative Adversarial Net- Foley sound synthesis by demonstrating its performance with a hu-
works(GANs) [9, 10], or diffusion models [11]. Recently, it has been man voice that mimics the target sound events as an intuitive way to
possible to generate holistic scene sounds based solely on a text de- capture temporal event features.
scription, even without pre-defined sound categories [12, 13, 14].
While previous work showed that the neural audio models can
faithfully synthesize the target sound, few of them pay attention to
∗ These authors contributed equally. This work was supported by Insti- 2. T-FOLEY
tute of Information & communications Technology Planning & Evaluation
(IITP) grant funded by the Korea government (MSIT) (No.2019-0-00075,
Artificial Intelligence Graduate School Program (KAIST)) and the National As shown in Figure 1, the model is designed to input the sound
Research Foundation of Korea (NRF) grant funded by the Korea government class category and temporal event condition, representing “what”
(MSIT) (No. RS-2023-00222383). and “when” the sound should be generated, respectively.
Fig. 2. (a) Overall architecture of the proposed model. (c: sound class, σ: diffusion timestep, T : temporal event feature) (b) A detailed
structure of a Down/Up sampling block. Each Down block includes strided convolutional layer at first while Up blocks exploit the transposed
one. (hin /hout : latent features) (c) Comparison of FiLM, TFiLM, and the proposed BFiLM. (Y : conditioning input, X: input activation)
2.1. Overall Architecture power, and some categories (e.g., rain) do not have definite onsets
and offsets by the nature of the sounds.
The architecture of T-Foley is depicted in Figure 2. The U-Net
structure, including an auto-regressive module in the bottleneck,
was borrowed from DAG [11], a state-of-the-art foley sound synthe- 2.3. Block FiLM
sis model with a high-resolution (44.1kHz) based on a waveform-
We propose Block FiLM as a neural module to condition the gen-
domain diffusion model without leveraging pretrained modules.
eration model on the temporal event feature. FiLM is one of the
Compared with CRASH [5], which is the first diffusion model to
Conditional Batch Normalization (CBN) techniques to modulate the
generate waveform from scratch, DAG contains a sequential module
activations of individual feature maps or channels with affine trans-
at the bottleneck of deeper U-Net architecture to address the prob-
formation based on a conditioning input. It has been widely used in
lem of inconsistent timbre within a generated sample. To predict the
various tasks, including image synthesis, style transfer, and diffusion
noise of each diffusion timestep, the model first downsamples the
models for waveform generation [17, 5, 11]. Mathematically, with
noised signal x into the latent vector and passes it to the bidirectional
a conditioning input yCin ,Lin ∈ RCin ×Lin , where Cin is the input
LSTM to maintain the timbre consistency within a sample. The out-
channel size, and Lin is the input length, the FiLM modulation can
put of the bottleneck layer is resized by linear projections and finally
be expressed as follows: FiLM(x, y, γ, β) = γ ⊙ x + β where x
upsampled into the prediction of noise ϵ̂. Within each downsample
represents the input activations, c is the conditioning input, γ, β ∈
and upsample block, convolution layers are conditioned on diffusion
RCout are normalizing parameters obtained as γ, β = MLP(y). The
time step embedding σ (as in [5]) and class embedding c through
⊙ symbol denotes the Hadamard product (element-wise matrix mul-
FiLM, and on temporal event feature T through Block-FiLM. The
tiplication).
model is trained end-to-end to minimize the loss function proposed
in CRASH [5]. Temporal FiLM (TFiLM) was proposed to overcome the lim-
itation of FiLM in time-varying information in conditioning sig-
nals, which is crucial for processing audio signals. Given a se-
2.2. Temporal Event Feature quential conditioning input yCin ,Lin ∈ RCin ×Lin , where Cin is
the input channel size, and Lin is the sequential length in the time
As the primary objective of T-Foley is to generate audio upon a tem-
domain, TFiLM firstly splits y into N blocks. THe i-th block
poral event condition, it is essential for the model to learn temporal
Ybi ∈ RCin ×Lin /N (i = 1, ..., N ) is max-pooled in the time di-
information regarding the occurrence and transition of sound events
in time. This necessitates the appropriate conditioning temporal fea- mension as Ybpool i
= Max-Pool1:Lin /N (Ybi ) ∈ RCin and followd
ture for the sound events. We used root-mean-square (RMS) of the by an RNN as sequential modeling to obtain normalizing parame-
waveform, which is widely used frame-level amplitude envelope fea- ters (γi , βi ), hi = RNN(Ybpool
i
; hi−1 ). Finally, a linear modulation
tures as below: is applied channel-wise according to the normalizing parameters:
We utilized the Foley sound dataset provided for the Foley sound 4.1. Temporal Event Conditioning Methods
synthesis task in the 2023 DCASE challenge [1]. The given dataset We compare the performance of different conditioning methods for
comprises approximately 5k class-labeled sound samples (5.4 hours) the temporal event condition. Table 2 presents the quantitative scores
sourced from three different datasets. The samples are labeled into and MOS (Mean Opinion Score) from qualitative evaluation. Over-
seven different Foley sound classes: DogBark, Footstep, GunShot, all, TFiLM and BFiLM, which consider the temporal aspect of the
Keyboard, MovingMotorVehicle, Rain, Sneeze Cough. All audio event condition, received higher scores in most of the objective and
samples were constructed in mono 16-bit 22,050 Hz format for 4 sec- subjective metrics. Notably, BFiLM demonstrated markedly supe-
onds. In this work, we used about 95% of the development dataset rior performance in most of the results, particularly achieving im-
to train and 5% to validate. provements with approximately 0.7 times fewer parameters and re-
While controllable Foley sound synthesis holds great potential, duced inference time than TFiLM. These results validate that BFiLM
it is not trivial for users to express the desired temporal event ef- is efficient and suitable for our task. FiLM may have high IS value
fectively. For more intuitive and easy conditioning, we used human due to generating diverse low-quality audio different from ground-
voice that mimics Foley sound as a reference to extract temporal truths. In other experiments, we only use BFiLM.
event conditions. In particular, we used subsets of two vocal mim-
icking datasets paired with the original target sounds: Vocal Imita- 4.2. Effect of Block Numbers
tion Set[19] and VocalSketch[20]. We adjusted each audio sample
representing the 6 sound classes (excluding Sneeze Cough) in dura- Block Number N in Section 2.3 is an important hyperparameter
tion to match the training data. to manipulate the resolution of the condition. Less blocks lead to
Fig. 4. The first row shows the control sounds used to extract the
Fig. 3. Tradeoff between performance(E-L1, FAD-P) and effi- target event feature. Subsequent rows show three classes of Foley
ciency(inference time) among different number of blocks. sounds generated with different conditioning blocks (FiLM, TFiLM,
and BFiLM). They are all displayed as a mel-spectrogram.
Vocal Imitation VocalSketch
Block #
E-L1↓ FAD-P↓ E-L1↓ FAD-P↓
245 0.0228 74.94 0.0186 59.33
98 0.0300 69.13 0.0274 56.51
49 0.0306 64.55 0.0299 49.62
14 0.0764 59.56 0.0635 58.93
7 0.0935 66.23 0.0914 60.41
4.4. Case Study This study presents T-Foley, a foley sound generation system ad-
dressing controllability in the temporal domain. By introducing the
To showcase the performance and usability of T-Foley, we present temporal event feature and the E-L1 metric, we show that Block-
two case studies. Firstly, we compare the output of our proposed FiLM, our proposed conditioning method, is effective and powerful
BFiLM method with that of the FiLM and TFiLM methodologies. in terms of quality and efficiency. We also demonstrate the high per-
In Figure 4, BFiLM exhibits the highest alignment with the mel- formance of T-Foley on vocal mimicking datasets to claim its power
spectrogram of the target conditioning sound. Both FiLM and in usability.
TFiLM generate unclear and undesirable sound events in the Foot-
step sound class. For Gunshot and Sneeze Cough, only BFiLM 1 [Link]
seems to reflect the sustain and decay of the attack in sound well. [Link]
6. REFERENCES [14] Felix Kreuk, Gabriel Synnaeve, Adam Polyak, Uriel Singer,
Alexandre Défossez, Jade Copet, Devi Parikh, Yaniv Taigman,
[1] Keunwoo Choi, Jaekwon Im, Laurie Heller, Brian McFee, and Yossi Adi, “Audiogen: Textually guided audio generation,”
Keisuke Imoto, Yuki Okamoto, Mathieu Lagrange, and Shi- arXiv preprint arXiv:2209.15352, 2022.
nosuke Takamichi, “Foley sound synthesis at the dcase 2023 [15] Chenye Cui, Zhou Zhao, Yi Ren, Jinglin Liu, Rongjie Huang,
challenge,” In arXiv e-prints: 2304.12521, 2023. Feiyang Chen, Zhefeng Wang, Baoxing Huai, and Fei Wu,
[2] Marco Comunità, Huy Phan, and Joshua D Reiss, “Neural “Varietysound: Timbre-controllable video to sound gener-
synthesis of footsteps sound effects with generative adversarial ation via unsupervised information disentanglement,” in
networks,” arXiv preprint arXiv:2110.09605, 2021. ICASSP 2023-2023 IEEE International Conference on Acous-
tics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp.
[3] M Mehdi Afsar, Eric Park, Étienne Paquette, Gauthier Gidel, 1–5.
Kory W Mathewson, and Eilif Muller, “Generating di-
verse realistic laughter for interactive art,” arXiv preprint [16] Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin,
arXiv:2111.03146, 2021. and Aaron Courville, “Film: Visual reasoning with a general
conditioning layer,” in Proceedings of the AAAI Conference on
[4] Javier Nistal, Stefan Lattner, and Gael Richard, “Drum- Artificial Intelligence, 2018, vol. 32.
gan: Synthesis of drum sounds with timbral feature condi-
[17] Nanxin Chen, Yu Zhang, Heiga Zen, Ron J Weiss, Mohammad
tioning using generative adversarial networks,” arXiv preprint
Norouzi, and William Chan, “Wavegrad: Estimating gradients
arXiv:2008.12073, 2020.
for waveform generation,” arXiv preprint arXiv:2009.00713,
[5] Simon Rouard and Gaëtan Hadjeres, “Crash: raw audio score- 2020.
based generative modeling for controllable high-resolution [18] Sawyer Birnbaum, Volodymyr Kuleshov, Zayd Enam, Pang
drum sound synthesis,” arXiv preprint arXiv:2106.07431, Wei W Koh, and Stefano Ermon, “Temporal film: Capturing
2021. long-range sequence dependencies with feature-wise modula-
[6] Sanchita Ghose and John Jeffrey Prevost, “Autofoley: Arti- tions.,” Advances in Neural Information Processing Systems,
ficial synthesis of synchronized sound tracks for silent videos vol. 32, 2019.
with deep learning,” IEEE Transactions on Multimedia, vol. [19] Bongjun Kim, Madhav Ghei, Bryan Pardo, and Zhiyao Duan,
23, pp. 1895–1907, 2020. “Vocal imitation set: a dataset of vocally imitated sound events
[7] Yuexi Du, Ziyang Chen, Justin Salamon, Bryan Russell, and using the audioset ontology.,” in DCASE, 2018, pp. 148–152.
Andrew Owens, “Conditional generation of audio from video [20] Mark Cartwright and Bryan Pardo, “Vocalsketch: Vocally im-
via foley analogies,” in Proceedings of the IEEE/CVF Confer- itating audio concepts,” in Proceedings of the 33rd Annual
ence on Computer Vision and Pattern Recognition, 2023, pp. ACM Conference on Human Factors in Computing Systems,
2426–2436. 2015, pp. 43–46.
[8] Xubo Liu, Turab Iqbal, Jinzheng Zhao, Qiushi Huang, Mark D [21] Diederik Kingma, Tim Salimans, Ben Poole, and Jonathan Ho,
Plumbley, and Wenwu Wang, “Conditional sound generation “Variational diffusion models,” Advances in neural informa-
using neural discrete time-frequency representation learning,” tion processing systems, vol. 34, pp. 21696–21707, 2021.
in 2021 IEEE 31st International Workshop on Machine Learn- [22] Jonathan Ho and Tim Salimans, “Classifier-free diffusion guid-
ing for Signal Processing (MLSP). IEEE, 2021, pp. 1–6. ance,” arXiv preprint arXiv:2207.12598, 2022.
[9] Sanchita Ghose and John J Prevost, “Foleygan: Visu- [23] Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F
ally guided generative adversarial network-based synchronous Gemmeke, Aren Jansen, R Channing Moore, Manoj Plakal,
sound generation in silent videos,” IEEE Transactions on Mul- Devin Platt, Rif A Saurous, Bryan Seybold, et al., “Cnn ar-
timedia, 2022. chitectures for large-scale audio classification,” in 2017 ieee
[10] Peihao Chen, Yang Zhang, Mingkui Tan, Hongdong Xiao, international conference on acoustics, speech and signal pro-
Deng Huang, and Chuang Gan, “Generating visually aligned cessing (icassp). IEEE, 2017, pp. 131–135.
sound from videos,” IEEE Transactions on Image Processing, [24] Qiuqiang Kong, Yin Cao, Turab Iqbal, Yuxuan Wang, Wenwu
vol. 29, pp. 8292–8302, 2020. Wang, and Mark D Plumbley, “Panns: Large-scale pre-
trained audio neural networks for audio pattern recognition,”
[11] Santiago Pascual, Gautam Bhattacharya, Chunghsin Yeh, Jordi
IEEE/ACM Transactions on Audio, Speech, and Language
Pons, and Joan Serrà, “Full-band general audio synthesis with
Processing, vol. 28, pp. 2880–2894, 2020.
score-based diffusion,” in ICASSP 2023-2023 IEEE Interna-
tional Conference on Acoustics, Speech and Signal Processing
(ICASSP). IEEE, 2023, pp. 1–5.
[12] Dongchao Yang, Jianwei Yu, Helin Wang, Wen Wang, Chao
Weng, Yuexian Zou, and Dong Yu, “Diffsound: Discrete dif-
fusion model for text-to-sound generation,” IEEE/ACM Trans-
actions on Audio, Speech, and Language Processing, 2023.
[13] Haohe Liu, Zehua Chen, Yi Yuan, Xinhao Mei, Xubo Liu,
Danilo Mandic, Wenwu Wang, and Mark D Plumbley, “Audi-
oldm: Text-to-audio generation with latent diffusion models,”
arXiv preprint arXiv:2301.12503, 2023.