ICASSP 2024 T-Foley

Uploaded by

Yoonjin Chung

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views5 pages

ICASSP 2024 T-Foley

Uploaded by

Yoonjin Chung

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

T-FOLEY: A CONTROLLABLE WAVEFORM-DOMAIN DIFFUSION MODEL FOR

TEMPORAL-EVENT-GUIDED FOLEY SOUND SYNTHESIS

♯
Yoonjin Chung∗ , ♯ Junwon Lee∗ , ♯♭ Juhan Nam
♯
Graduate School of Artificial Intelligence, KAIST, Republic of Korea
♭
Graduate School of Culture Technology, KAIST, Republic of Korea

ABSTRACT
Foley sound, audio content inserted synchronously with videos in
post-production, plays a crucial role in the user experience of mul-
timedia content. Recently, foley sound synthesis has been actively
studied, leveraging the advance of deep generative models. How-
ever, they mainly focus on mimicking a particular sound class as a
single event or a holistic context without temporal information of
individual sources. We present T-Foley, a Temporal-event guided
waveform generation model for Foley sound synthesis. T-Foley Fig. 1. A description of temporal-event-guided Foley sound synthe-
generates high-quality audio with two conditions: sound class and sis. The model generates waveforms with timbre of the sound class
temporal event condition. The temporal event condition is imple- and loudness that follows the temporal event condition.
mented with Block-FiLM as a novel conditioning method derived
from Temporal-FiLM. We show that the T-Foley model achieves
superior performances in both objective and subjective evaluation
metrics and generates well-synchronized foley sounds to the tempo- the timing of the sound when it includes multiple sound events. Lo-
ral event condition. We particularly use vocal mimicking datasets cating sound events at the right time is critical, considering the prac-
paired with foley sounds for the temporal event control, considering tical use of foley sound synthesis. Some studies generated implicit
its intuitive usage in real-world application scenarios. temporal features from input video during synthesis [6, 9, 10, 15].
Index Terms— Foley Sound Synthesis, Controllable Sound However, they do not have explicit temporal event conditions for
Generation, General Audio Synthesis, Waveform Domain Diffusion controllability and did not count quantitative analysis due to the ab-
sence of an adequate metric.
1. INTRODUCTION This research aims to address the challenge and produce real-
istic and timing-aligned Foley sound effects given a specific sound
Foley sound refers to human-created sound effects, such as footsteps category. To the best of our knowledge, this is the first attempt to
or gunshots, to accentuate visual media. The significance of foley generate audio with explicit temporal event conditions. Our con-
sound lies in its ability to enhance the overall immersive experience tributions are the following: First, we propose T-Foley, a Temporal-
for various forms of media [1]. Foley sounds are usually created event-guided diffusion model with a conditioning sound class to gen-
by foley artists who record and produce required sounds manually, erate high-quality Foley sound. For the temporal guidance, we de-
synchronizing with the visual elements. fine the temporal event feature to represent the timing of sounds.
The advent of neural audio generation has presented an oppor- To devise a conditioning method that reflects temporal informative
tunity to automate and streamline the foley sound creation process, condition, we introduce Block-FiLM, a modification of FiLM [16]
reducing the time and effort required for sound production. To syn- for block-wise affine transformation. Second, we conduct extensive
thesize proper sounds from specific categories, early studies usu- experiments to validate the performance and provide a comparative
ally focused on single sound sources such as foosteps [2], laugh- analysis of results on temporal conditioning methods and temporal
ter [3], and drum [4, 5]. Subsequent research then further improved event features. The evaluation includes both objective and subjec-
a model to be capable of generating multiple sound categories uti- tive metrics. Lastly, we show the potential applications of the neural
lizing auto-regressive models [6, 7, 8], Generative Adversarial Net- Foley sound synthesis by demonstrating its performance with a hu-
works(GANs) [9, 10], or diffusion models [11]. Recently, it has been man voice that mimics the target sound events as an intuitive way to
possible to generate holistic scene sounds based solely on a text de- capture temporal event features.
scription, even without pre-defined sound categories [12, 13, 14].
While previous work showed that the neural audio models can
faithfully synthesize the target sound, few of them pay attention to
∗ These authors contributed equally. This work was supported by Insti- 2. T-FOLEY
tute of Information & communications Technology Planning & Evaluation
(IITP) grant funded by the Korea government (MSIT) (No.2019-0-00075,
Artificial Intelligence Graduate School Program (KAIST)) and the National As shown in Figure 1, the model is designed to input the sound
Research Foundation of Korea (NRF) grant funded by the Korea government class category and temporal event condition, representing “what”
(MSIT) (No. RS-2023-00222383). and “when” the sound should be generated, respectively.
Fig. 2. (a) Overall architecture of the proposed model. (c: sound class, σ: diffusion timestep, T : temporal event feature) (b) A detailed
structure of a Down/Up sampling block. Each Down block includes strided convolutional layer at first while Up blocks exploit the transposed
one. (hin /hout : latent features) (c) Comparison of FiLM, TFiLM, and the proposed BFiLM. (Y : conditioning input, X: input activation)

2.1. Overall Architecture power, and some categories (e.g., rain) do not have definite onsets
and offsets by the nature of the sounds.
The architecture of T-Foley is depicted in Figure 2. The U-Net
structure, including an auto-regressive module in the bottleneck,
was borrowed from DAG [11], a state-of-the-art foley sound synthe- 2.3. Block FiLM
sis model with a high-resolution (44.1kHz) based on a waveform-
We propose Block FiLM as a neural module to condition the gen-
domain diffusion model without leveraging pretrained modules.
eration model on the temporal event feature. FiLM is one of the
Compared with CRASH [5], which is the first diffusion model to
Conditional Batch Normalization (CBN) techniques to modulate the
generate waveform from scratch, DAG contains a sequential module
activations of individual feature maps or channels with affine trans-
at the bottleneck of deeper U-Net architecture to address the prob-
formation based on a conditioning input. It has been widely used in
lem of inconsistent timbre within a generated sample. To predict the
various tasks, including image synthesis, style transfer, and diffusion
noise of each diffusion timestep, the model first downsamples the
models for waveform generation [17, 5, 11]. Mathematically, with
noised signal x into the latent vector and passes it to the bidirectional
a conditioning input yCin ,Lin ∈ RCin ×Lin , where Cin is the input
LSTM to maintain the timbre consistency within a sample. The out-
channel size, and Lin is the input length, the FiLM modulation can
put of the bottleneck layer is resized by linear projections and finally
be expressed as follows: FiLM(x, y, γ, β) = γ ⊙ x + β where x
upsampled into the prediction of noise ϵ̂. Within each downsample
represents the input activations, c is the conditioning input, γ, β ∈
and upsample block, convolution layers are conditioned on diffusion
RCout are normalizing parameters obtained as γ, β = MLP(y). The
time step embedding σ (as in [5]) and class embedding c through
⊙ symbol denotes the Hadamard product (element-wise matrix mul-
FiLM, and on temporal event feature T through Block-FiLM. The
tiplication).
model is trained end-to-end to minimize the loss function proposed
in CRASH [5]. Temporal FiLM (TFiLM) was proposed to overcome the lim-
itation of FiLM in time-varying information in conditioning sig-
nals, which is crucial for processing audio signals. Given a se-
2.2. Temporal Event Feature quential conditioning input yCin ,Lin ∈ RCin ×Lin , where Cin is
the input channel size, and Lin is the sequential length in the time
As the primary objective of T-Foley is to generate audio upon a tem-
domain, TFiLM firstly splits y into N blocks. THe i-th block
poral event condition, it is essential for the model to learn temporal
Ybi ∈ RCin ×Lin /N (i = 1, ..., N ) is max-pooled in the time di-
information regarding the occurrence and transition of sound events
in time. This necessitates the appropriate conditioning temporal fea- mension as Ybpool i
= Max-Pool1:Lin /N (Ybi ) ∈ RCin and followd
ture for the sound events. We used root-mean-square (RMS) of the by an RNN as sequential modeling to obtain normalizing parame-
waveform, which is widely used frame-level amplitude envelope fea- ters (γi , βi ), hi = RNN(Ybpool
i
; hi−1 ). Finally, a linear modulation
tures as below: is applied channel-wise according to the normalizing parameters:

Xboutput = (1Lout /N γiT ) ⊙ Xbi + (1Lout /N βiT ) ∈ RCout ×Lout /N

r
1 i+W 2 i
RMS(t) = Σ x (t) (1a) (2)
W t=i
where γi , βi ∈ RCout , 1d = [1, ..., 1]T ∈ Rd . TFiLM was orig-
where W is a window size and h is a hop size. In our experiment, we inally proposed for self-conditioning to modulate intermediate fea-
set W = 512 and h = 128. We also considered power (the square of tures. Therefore, Cin = Cout and Lin = Lout as [Xbi ’s]T = x =
RMS) and onset/offset (the start and end of a particular sound event) y = [Ybi ’s]T . In this paper, we generalize it to the case where con-
as candidates. After a preliminary experiment, we decided to use ditioning signal yCin ,Lin and modulating signal xCout ,Lout are dif-
RMS because there was no significant difference between RMS and ferent. However, incorporating TFiLM in every conditioning layer
Model #params↓ infer.t ↓ E-L1↓ FAD-P↓ FAD-V↓ IS↑ 3.2. Experimental Details
Real data - - 0.0 22.81 4.06 2.18
We train our model to estimate the reparameterized score function of
w/o condition 87M 12s 0.2212 53.94 36.10 1.46 a normal transition kernel with variance-preserving cosine schedul-
FiLM[16] 83M 6.3s 0.0772 54.59 36.06 1.94
ing as proposed in [5]. For conditional sampling, DDPM-like dis-
TFiLM[18] 101M 13s 0.0469 49.44 36.10 1.74
BFiLM 74M 9.5s 0.0367 41.59 36.09 1.79 cretization of SDE[21] is chosen with classifier-free guidance[22].
While 500-epoch of training, we randomly dropped the conditions
p = 0.1 for training in the unconditional scenario[11].
Table 1. Results of generation without or with event timing condi-
tion by FiLM, Temporal FiLM(TFiLM), and Block FiLM(BFiLM).
3.3. Evaluation
(#params: Number of trainable parameters, infer.t: Approximate in-
ference time for predicting 1 sample, E-L1: Event L1 loss, FAD- Objective Evaluation of audio generation models relies on three
P, and FAD-V: FADs based on PANNs and VGGish, IS: Inception metrics: Frechet Audio Distance (FAD), Inception Score (IS), and
Score.) Event-L1 Norm (E-L1). FAD and IS measure that the generated
sounds align with the given class condition and their diversity. For
Model Category Fidelity↑ Temporal Fidelity↑ Audio Quality↑ FAD, we exploit two classifiers from VGGish model[23](FAD-V,
16kHz) and PANNs[24](FAD-P, 32kHz). IS also utilizes PANNs. To
FiLM[16] 3.85(±0.12) 4.11(±0.10) 3.28(±0.11) verify the effectiveness of the temporal condition, we propose Event-
TFiLM[18] 4.02(±0.11) 4.00(±0.13) 3.75(±0.11)
L1 norm(E-L1). E-L1 assesses how well the generated sounds ad-
BFiLM 4.22(±0.11) 4.41(±0.09) 4.06(±0.10)
here to the given temporal event condition. We employed the L1
distance between the event timing feature of the target sample and
Table 2. Comparison of Mean Opinion Scores(MOS). Mean and the corresponding generated sample as follows:
95% confidence interval are reported.
1 k
E-L1 = Σi=1 ||Ei − Êi || (4)
k
can lead to computational complexity, prolonging both training and where x(t) (t ∈ [0, T ]) is the audio waveform, Ei is the ground-
inference times. truth event feature of i-th frame, and Êi is the predicted one. We
Block FiLM (BFiLM) is a simplified version of TFiLM to re- average the class-wise scores in each case.
duce computational costs while preserving the benefits of TFiLM. Subjective Evaluation was conducted by a total of 23 partic-
We adopted the idea of applying block-wise transformations from ipants. The evaluation encompassed 26 questions, categorized as
TFiLM by replacing the sequential modeling layer with a simple follows: 1) 14 questions evaluating samples generated under three
MLP layer as the following: different conditions: FiLM, TFiLM, and BFiLM, all based on a sin-
gle conditioning real data sample; 2) 12 questions assessing a cor-
(γi , βi ) = MLP(Ybpool
i
). (3) responding generated sample for each vocal mimicking input con-
dition. Participants evaluated the generated samples based on three
T-Foley is designed assuming that the base U-Net architecture al- criteria: Temporal Fidelity(TF) for alignment of the generated sam-
ready has a sequential module, which is already capable of handling ples with the temporal event condition of the target sample, Category
the sequence modeling among the blocks. We demonstrate the per- Fidelity(CF) for suitability of the generated samples to the given cat-
formance and efficiency of BFiLM in Section 4.1. egory, and Audio Quality(AQ) for the overall quality of the audio.
The scoring follows a numerical rating from 1 to 5 in increments of
0.5.
3. EXPERIMENTAL SETUP
4. RESULTS
3.1. Datasets

We utilized the Foley sound dataset provided for the Foley sound 4.1. Temporal Event Conditioning Methods
synthesis task in the 2023 DCASE challenge [1]. The given dataset We compare the performance of different conditioning methods for
comprises approximately 5k class-labeled sound samples (5.4 hours) the temporal event condition. Table 2 presents the quantitative scores
sourced from three different datasets. The samples are labeled into and MOS (Mean Opinion Score) from qualitative evaluation. Over-
seven different Foley sound classes: DogBark, Footstep, GunShot, all, TFiLM and BFiLM, which consider the temporal aspect of the
Keyboard, MovingMotorVehicle, Rain, Sneeze Cough. All audio event condition, received higher scores in most of the objective and
samples were constructed in mono 16-bit 22,050 Hz format for 4 sec- subjective metrics. Notably, BFiLM demonstrated markedly supe-
onds. In this work, we used about 95% of the development dataset rior performance in most of the results, particularly achieving im-
to train and 5% to validate. provements with approximately 0.7 times fewer parameters and re-
While controllable Foley sound synthesis holds great potential, duced inference time than TFiLM. These results validate that BFiLM
it is not trivial for users to express the desired temporal event ef- is efficient and suitable for our task. FiLM may have high IS value
fectively. For more intuitive and easy conditioning, we used human due to generating diverse low-quality audio different from ground-
voice that mimics Foley sound as a reference to extract temporal truths. In other experiments, we only use BFiLM.
event conditions. In particular, we used subsets of two vocal mim-
icking datasets paired with the original target sounds: Vocal Imita- 4.2. Effect of Block Numbers
tion Set[19] and VocalSketch[20]. We adjusted each audio sample
representing the 6 sound classes (excluding Sneeze Cough) in dura- Block Number N in Section 2.3 is an important hyperparameter
tion to match the training data. to manipulate the resolution of the condition. Less blocks lead to
Fig. 4. The first row shows the control sounds used to extract the
Fig. 3. Tradeoff between performance(E-L1, FAD-P) and effi- target event feature. Subsequent rows show three classes of Foley
ciency(inference time) among different number of blocks. sounds generated with different conditioning blocks (FiLM, TFiLM,
and BFiLM). They are all displayed as a mel-spectrogram.
Vocal Imitation VocalSketch
Block #
E-L1↓ FAD-P↓ E-L1↓ FAD-P↓
245 0.0228 74.94 0.0186 59.33
98 0.0300 69.13 0.0274 56.51
49 0.0306 64.55 0.0299 49.62
14 0.0764 59.56 0.0635 58.93
7 0.0935 66.23 0.0914 60.41

Table 3. Evaluation on Vocal Mimicking Datasets

sparser and smoother conditional information in the temporal axis.

We compare the performance of different block numbers in Figure
3. For accuracy, E-L1 decreases as there are more blocks as ex- Fig. 5. (a) Comparing manually synthesized consecutive gunshot
pected. FAD-P also decreases, with no significant difference ob- sounds with sounds generated through temporal event feature. (b)
served among 49, 98, and 245. In terms of efficiency, inference time Generated sounds with the original temporal event features and those
increases along with the block number, as it requires more compu- with a reduced gain by 10.
tation. In consideration of the tradeoff between accuracy and effi-
ciency, we stick to 49 blocks in other experiments.
Furthermore, the generation of Foley sounds using temporal
event conditions yields considerably more realistic results when
4.3. Evaluation on Vocal Mimicking Datasets
compared to manual Foley sound manipulation. We exemplify two
Table 3 summarizes the result among different block numbers, show- specific scenarios in Figure 5. The first scenario involves consecu-
ing comparable performance in vocal sound. E-L1 decreases for tive machine gunshots. Manually adjusting and copying individual
larger block numbers, which is consistent with Section 4.2. On the gunshot sound snippets can result in an unnatural audio sequence.
other hand, FAD-P is the lowest around 14, 49 number of blocks. Conversely, employing T-Foley to concatenate temporal event con-
This results from the discrepancy of Foley sound and vocal, as the ditions leads to a seamless and lifelike sound. The second is a key-
two sounds have different RMS curves according to the discrep- typing sound with two contrasted examples: typing vigorously on a
ancy in timbre and energy envelope. Choosing the right block num- typewriter and softly pressing keys on a plastic keyboard. T-Foley
ber to smooth the conditioning RMS signal of vocal audio. MOS can generate these two sounds using the same temporal event fea-
for model with N = 49 was measured as CF= 4.41(±0.09), tures by adjusting the gain. This indicates that the level of temporal
EF= 4.40(±0.10), and AQ= 4.34(±0.08), which is a competable event feature controls not only the amplitude envelope but also the
result regarding Table 2. To further showcase the performance of timbre texture.
our model, we conducted experiments on additional temporal event The demo examples are accessible on the companion website.1
conditions with manually recorded clapping sounds or human voices
providing the desired timing cues. 5. CONCLUSION

4.4. Case Study This study presents T-Foley, a foley sound generation system ad-
dressing controllability in the temporal domain. By introducing the
To showcase the performance and usability of T-Foley, we present temporal event feature and the E-L1 metric, we show that Block-
two case studies. Firstly, we compare the output of our proposed FiLM, our proposed conditioning method, is effective and powerful
BFiLM method with that of the FiLM and TFiLM methodologies. in terms of quality and efficiency. We also demonstrate the high per-
In Figure 4, BFiLM exhibits the highest alignment with the mel- formance of T-Foley on vocal mimicking datasets to claim its power
spectrogram of the target conditioning sound. Both FiLM and in usability.
TFiLM generate unclear and undesirable sound events in the Foot-
step sound class. For Gunshot and Sneeze Cough, only BFiLM 1 [Link]
seems to reflect the sustain and decay of the attack in sound well. [Link]
6. REFERENCES [14] Felix Kreuk, Gabriel Synnaeve, Adam Polyak, Uriel Singer,
Alexandre Défossez, Jade Copet, Devi Parikh, Yaniv Taigman,
[1] Keunwoo Choi, Jaekwon Im, Laurie Heller, Brian McFee, and Yossi Adi, “Audiogen: Textually guided audio generation,”
Keisuke Imoto, Yuki Okamoto, Mathieu Lagrange, and Shi- arXiv preprint arXiv:2209.15352, 2022.
nosuke Takamichi, “Foley sound synthesis at the dcase 2023 [15] Chenye Cui, Zhou Zhao, Yi Ren, Jinglin Liu, Rongjie Huang,
challenge,” In arXiv e-prints: 2304.12521, 2023. Feiyang Chen, Zhefeng Wang, Baoxing Huai, and Fei Wu,
[2] Marco Comunità, Huy Phan, and Joshua D Reiss, “Neural “Varietysound: Timbre-controllable video to sound gener-
synthesis of footsteps sound effects with generative adversarial ation via unsupervised information disentanglement,” in
networks,” arXiv preprint arXiv:2110.09605, 2021. ICASSP 2023-2023 IEEE International Conference on Acous-
tics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp.
[3] M Mehdi Afsar, Eric Park, Étienne Paquette, Gauthier Gidel, 1–5.
Kory W Mathewson, and Eilif Muller, “Generating di-
verse realistic laughter for interactive art,” arXiv preprint [16] Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin,
arXiv:2111.03146, 2021. and Aaron Courville, “Film: Visual reasoning with a general
conditioning layer,” in Proceedings of the AAAI Conference on
[4] Javier Nistal, Stefan Lattner, and Gael Richard, “Drum- Artificial Intelligence, 2018, vol. 32.
gan: Synthesis of drum sounds with timbral feature condi-
[17] Nanxin Chen, Yu Zhang, Heiga Zen, Ron J Weiss, Mohammad
tioning using generative adversarial networks,” arXiv preprint
Norouzi, and William Chan, “Wavegrad: Estimating gradients
arXiv:2008.12073, 2020.
for waveform generation,” arXiv preprint arXiv:2009.00713,
[5] Simon Rouard and Gaëtan Hadjeres, “Crash: raw audio score- 2020.
based generative modeling for controllable high-resolution [18] Sawyer Birnbaum, Volodymyr Kuleshov, Zayd Enam, Pang
drum sound synthesis,” arXiv preprint arXiv:2106.07431, Wei W Koh, and Stefano Ermon, “Temporal film: Capturing
2021. long-range sequence dependencies with feature-wise modula-
[6] Sanchita Ghose and John Jeffrey Prevost, “Autofoley: Arti- tions.,” Advances in Neural Information Processing Systems,
ficial synthesis of synchronized sound tracks for silent videos vol. 32, 2019.
with deep learning,” IEEE Transactions on Multimedia, vol. [19] Bongjun Kim, Madhav Ghei, Bryan Pardo, and Zhiyao Duan,
23, pp. 1895–1907, 2020. “Vocal imitation set: a dataset of vocally imitated sound events
[7] Yuexi Du, Ziyang Chen, Justin Salamon, Bryan Russell, and using the audioset ontology.,” in DCASE, 2018, pp. 148–152.
Andrew Owens, “Conditional generation of audio from video [20] Mark Cartwright and Bryan Pardo, “Vocalsketch: Vocally im-
via foley analogies,” in Proceedings of the IEEE/CVF Confer- itating audio concepts,” in Proceedings of the 33rd Annual
ence on Computer Vision and Pattern Recognition, 2023, pp. ACM Conference on Human Factors in Computing Systems,
2426–2436. 2015, pp. 43–46.
[8] Xubo Liu, Turab Iqbal, Jinzheng Zhao, Qiushi Huang, Mark D [21] Diederik Kingma, Tim Salimans, Ben Poole, and Jonathan Ho,
Plumbley, and Wenwu Wang, “Conditional sound generation “Variational diffusion models,” Advances in neural informa-
using neural discrete time-frequency representation learning,” tion processing systems, vol. 34, pp. 21696–21707, 2021.
in 2021 IEEE 31st International Workshop on Machine Learn- [22] Jonathan Ho and Tim Salimans, “Classifier-free diffusion guid-
ing for Signal Processing (MLSP). IEEE, 2021, pp. 1–6. ance,” arXiv preprint arXiv:2207.12598, 2022.
[9] Sanchita Ghose and John J Prevost, “Foleygan: Visu- [23] Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F
ally guided generative adversarial network-based synchronous Gemmeke, Aren Jansen, R Channing Moore, Manoj Plakal,
sound generation in silent videos,” IEEE Transactions on Mul- Devin Platt, Rif A Saurous, Bryan Seybold, et al., “Cnn ar-
timedia, 2022. chitectures for large-scale audio classification,” in 2017 ieee
[10] Peihao Chen, Yang Zhang, Mingkui Tan, Hongdong Xiao, international conference on acoustics, speech and signal pro-
Deng Huang, and Chuang Gan, “Generating visually aligned cessing (icassp). IEEE, 2017, pp. 131–135.
sound from videos,” IEEE Transactions on Image Processing, [24] Qiuqiang Kong, Yin Cao, Turab Iqbal, Yuxuan Wang, Wenwu
vol. 29, pp. 8292–8302, 2020. Wang, and Mark D Plumbley, “Panns: Large-scale pre-
trained audio neural networks for audio pattern recognition,”
[11] Santiago Pascual, Gautam Bhattacharya, Chunghsin Yeh, Jordi
IEEE/ACM Transactions on Audio, Speech, and Language
Pons, and Joan Serrà, “Full-band general audio synthesis with
Processing, vol. 28, pp. 2880–2894, 2020.
score-based diffusion,” in ICASSP 2023-2023 IEEE Interna-
tional Conference on Acoustics, Speech and Signal Processing
(ICASSP). IEEE, 2023, pp. 1–5.
[12] Dongchao Yang, Jianwei Yu, Helin Wang, Wen Wang, Chao
Weng, Yuexian Zou, and Dong Yu, “Diffsound: Discrete dif-
fusion model for text-to-sound generation,” IEEE/ACM Trans-
actions on Audio, Speech, and Language Processing, 2023.
[13] Haohe Liu, Zehua Chen, Yi Yuan, Xinhao Mei, Xubo Liu,
Danilo Mandic, Wenwu Wang, and Mark D Plumbley, “Audi-
oldm: Text-to-audio generation with latent diffusion models,”
arXiv preprint arXiv:2301.12503, 2023.

Event-Guided FSS (DEFENSE)
No ratings yet
Event-Guided FSS (DEFENSE)
44 pages
Draw An Audio: Leveraging Multi-Instruction For Video-to-Audio Synthesis
No ratings yet
Draw An Audio: Leveraging Multi-Instruction For Video-to-Audio Synthesis
14 pages
Foley Sound Synthesis Challenge 2023
No ratings yet
Foley Sound Synthesis Challenge 2023
6 pages
F: Efficient Video-to-Audio Generation Network With Rectified Flow Matching
No ratings yet
F: Efficient Video-to-Audio Generation Network With Rectified Flow Matching
15 pages
Samsung Prism PPT 2
No ratings yet
Samsung Prism PPT 2
11 pages
Rhythmic Foley: Enhanced Audio Alignment
No ratings yet
Rhythmic Foley: Enhanced Audio Alignment
5 pages
Contextual Transformer for Audio Tagging
No ratings yet
Contextual Transformer for Audio Tagging
5 pages
F5-TTS: Flow Matching for Speech Synthesis
No ratings yet
F5-TTS: Flow Matching for Speech Synthesis
18 pages
Mo和Song - Foley-Flow Coordinated Video-To-Audio Generation With Masked Audio-Visual Alignment and Dynamic Con
No ratings yet
Mo和Song - Foley-Flow Coordinated Video-To-Audio Generation With Masked Audio-Visual Alignment and Dynamic Con
10 pages
Fuzzy Finite State Machine via Neural System
No ratings yet
Fuzzy Finite State Machine via Neural System
6 pages
Audio Generation for Researchers
No ratings yet
Audio Generation for Researchers
9 pages
Artificial Neural Network Application For The Temporal Properties of Acoustic Perception
No ratings yet
Artificial Neural Network Application For The Temporal Properties of Acoustic Perception
12 pages
Audio Synthesis with MelGAN
No ratings yet
Audio Synthesis with MelGAN
14 pages
Real-Time Emergency Vehicle Event Detection Using Audio Data Zubayer Islam
No ratings yet
Real-Time Emergency Vehicle Event Detection Using Audio Data Zubayer Islam
11 pages
Learning Representations From Audio Using Autoencoders
No ratings yet
Learning Representations From Audio Using Autoencoders
11 pages
Acoustic Scene Classification
No ratings yet
Acoustic Scene Classification
6 pages
A Similarity-Based Conditioning Method For Controllable Sound Effect Synthesis Low
No ratings yet
A Similarity-Based Conditioning Method For Controllable Sound Effect Synthesis Low
15 pages
TSP CMC 36205
No ratings yet
TSP CMC 36205
18 pages
Deep Learning for Audio Noise Detection
No ratings yet
Deep Learning for Audio Noise Detection
29 pages
Active Control of Powertrain Noise Using A Frequency Domain Filtered-X LMS Algorithm
No ratings yet
Active Control of Powertrain Noise Using A Frequency Domain Filtered-X LMS Algorithm
7 pages
2409 08587v1
No ratings yet
2409 08587v1
5 pages
Chandra - The Linear Change of Waveform Segments Causing Nonlinear Changes of Timbral Presence
No ratings yet
Chandra - The Linear Change of Waveform Segments Causing Nonlinear Changes of Timbral Presence
9 pages
Finite State Transducers in NLP
No ratings yet
Finite State Transducers in NLP
4 pages
2209 03143v2AudioLM
No ratings yet
2209 03143v2AudioLM
11 pages
Matlab Speech Enhancement Guide
No ratings yet
Matlab Speech Enhancement Guide
8 pages
Speaker System Based On Timedomain Theory: Fig.1 Distortion vs. Sound Goodness Relationship
No ratings yet
Speaker System Based On Timedomain Theory: Fig.1 Distortion vs. Sound Goodness Relationship
4 pages
Audio Gen
No ratings yet
Audio Gen
16 pages
Mathieu Lagrange, Junwon Lee, Modan Tailleur, Laurie M. Heller, Keunwoo Choi, Brian Mcfee, Keisuke Imoto, Yuki Okamoto
No ratings yet
Mathieu Lagrange, Junwon Lee, Modan Tailleur, Laurie M. Heller, Keunwoo Choi, Brian Mcfee, Keisuke Imoto, Yuki Okamoto
4 pages
Kimia Report
No ratings yet
Kimia Report
26 pages
ANC System For Noisy Speech
No ratings yet
ANC System For Noisy Speech
9 pages
Synthesis of High-Speed Finite State Machines in Fpgas by State Splitting
No ratings yet
Synthesis of High-Speed Finite State Machines in Fpgas by State Splitting
7 pages
University of California Los Angeles
No ratings yet
University of California Los Angeles
102 pages
Robot Arm Controller Using Fuzzy Speech Recognition
No ratings yet
Robot Arm Controller Using Fuzzy Speech Recognition
7 pages
Real-Time Sonographic Sound Processing
No ratings yet
Real-Time Sonographic Sound Processing
11 pages
LeeVaraiya DigitalV2!02!86 100
No ratings yet
LeeVaraiya DigitalV2!02!86 100
15 pages
On Temporal Logic and Signal Processing
No ratings yet
On Temporal Logic and Signal Processing
15 pages
Mousai - Efficient Test-to-Music Diffusion Models
No ratings yet
Mousai - Efficient Test-to-Music Diffusion Models
19 pages
Introsounds 2 2
No ratings yet
Introsounds 2 2
33 pages
FastPitch TTS Model
No ratings yet
FastPitch TTS Model
5 pages
Applsci 14 02321
No ratings yet
Applsci 14 02321
6 pages
PHD Thesis Sound Event Detection With Weakly Labelled Data - v2.0
No ratings yet
PHD Thesis Sound Event Detection With Weakly Labelled Data - v2.0
102 pages
Signals & Systems
No ratings yet
Signals & Systems
55 pages
Background Noise Suppression in Audio File Using LSTM Network
No ratings yet
Background Noise Suppression in Audio File Using LSTM Network
9 pages
Temporal Patterns (Traps) in Asr of Noisy Speech
No ratings yet
Temporal Patterns (Traps) in Asr of Noisy Speech
4 pages
2024 Acl-Long 437
No ratings yet
2024 Acl-Long 437
19 pages
tmp1737 TMP
No ratings yet
tmp1737 TMP
12 pages
Sirisha Kurakula G00831237 Project
0% (1)
Sirisha Kurakula G00831237 Project
12 pages
Audio Early Warning for Pedestrian Safety
No ratings yet
Audio Early Warning for Pedestrian Safety
3 pages
Generate Audio from Short Samples
No ratings yet
Generate Audio from Short Samples
16 pages
A Generative Model For Natural Sounds Based
No ratings yet
A Generative Model For Natural Sounds Based
10 pages
A Vector Quantized Variational Autoencoder VQ VAE Autoregressive
No ratings yet
A Vector Quantized Variational Autoencoder VQ VAE Autoregressive
14 pages
SE Via Token
No ratings yet
SE Via Token
5 pages
DIHARD-III Speech Diarization System
No ratings yet
DIHARD-III Speech Diarization System
5 pages
Sonic: Shifting Focus To Global Audio Perception in Portrait Animation
No ratings yet
Sonic: Shifting Focus To Global Audio Perception in Portrait Animation
11 pages
2023 Emnlp-Main 990
No ratings yet
2023 Emnlp-Main 990
13 pages
Hybrid CNN-RF for Sound Event Detection
No ratings yet
Hybrid CNN-RF for Sound Event Detection
8 pages
Robust Frequency Invariant Beamforming With Low Sidelobe For Speech Enhancement
No ratings yet
Robust Frequency Invariant Beamforming With Low Sidelobe For Speech Enhancement
7 pages
Subsurface Ventilation Guide
No ratings yet
Subsurface Ventilation Guide
12 pages
RT Svp011e en - 04222021
No ratings yet
RT Svp011e en - 04222021
106 pages
SAP UI5 - FIORI Interview QnA
100% (3)
SAP UI5 - FIORI Interview QnA
18 pages
Fortigate Gns3a
No ratings yet
Fortigate Gns3a
16 pages
REALM TipCard
No ratings yet
REALM TipCard
2 pages
Troubleshoot High Dataplane CPU
No ratings yet
Troubleshoot High Dataplane CPU
9 pages
Blockchain For IoT (9826)
No ratings yet
Blockchain For IoT (9826)
13 pages
Media and Information Literacy MIL Text Information and Media Part 1
100% (1)
Media and Information Literacy MIL Text Information and Media Part 1
61 pages
Preetish Res1
No ratings yet
Preetish Res1
2 pages
Surat Pesanan Apotek
No ratings yet
Surat Pesanan Apotek
1 page
A Project Report Cerificate
No ratings yet
A Project Report Cerificate
4 pages
Wholesale Office & School Supplies Guide
No ratings yet
Wholesale Office & School Supplies Guide
8 pages
8051 Instruction Set Manual MOV
No ratings yet
8051 Instruction Set Manual MOV
4 pages
CS 1103 - Programming Assignment Unit 8
No ratings yet
CS 1103 - Programming Assignment Unit 8
12 pages
HI-PE Plus Multi-Zone Metal Detector
No ratings yet
HI-PE Plus Multi-Zone Metal Detector
12 pages
Spanning Tree Protocol (STP)
No ratings yet
Spanning Tree Protocol (STP)
28 pages
FPV Drone Operator Course Syllabus
No ratings yet
FPV Drone Operator Course Syllabus
3 pages
Listening and Grammar Test
No ratings yet
Listening and Grammar Test
25 pages
Pratica 3 - Termometro Digital
No ratings yet
Pratica 3 - Termometro Digital
18 pages
Advanced OpenSees for Engineers
No ratings yet
Advanced OpenSees for Engineers
21 pages
Model Evaluation and Selection
No ratings yet
Model Evaluation and Selection
17 pages
GOLDMAN SACHS - Internship
No ratings yet
GOLDMAN SACHS - Internship
4 pages
RCD Lab
No ratings yet
RCD Lab
17 pages
Binary:: Base 2 and Base 10 Learn
No ratings yet
Binary:: Base 2 and Base 10 Learn
12 pages
Definitions: Software Is More Than Just A Program Code. A Program Is An Executable Code, Which Serves
No ratings yet
Definitions: Software Is More Than Just A Program Code. A Program Is An Executable Code, Which Serves
22 pages
Senarai Daftar Harta Modal: BIL Nombor Siri Pendaftaran Nama Aset Catatan Tarikh Perolehan Harga Perolehan Asal
No ratings yet
Senarai Daftar Harta Modal: BIL Nombor Siri Pendaftaran Nama Aset Catatan Tarikh Perolehan Harga Perolehan Asal
3 pages
Hauser Ss - Sarthak Agencies
No ratings yet
Hauser Ss - Sarthak Agencies
31 pages
Impact of Social, Economic, Political and Technology On Education
No ratings yet
Impact of Social, Economic, Political and Technology On Education
20 pages
Mudit Pathak-Resume
No ratings yet
Mudit Pathak-Resume
3 pages
Data Structures Lab Manual 18CSL38
No ratings yet
Data Structures Lab Manual 18CSL38
72 pages

ICASSP 2024 T-Foley

Uploaded by

ICASSP 2024 T-Foley

Uploaded by

T-FOLEY: A CONTROLLABLE WAVEFORM-DOMAIN DIFFUSION MODEL FOR

TEMPORAL-EVENT-GUIDED FOLEY SOUND SYNTHESIS

Xboutput = (1Lout /N γiT ) ⊙ Xbi + (1Lout /N βiT ) ∈ RCout ×Lout /N

Table 3. Evaluation on Vocal Mimicking Datasets

sparser and smoother conditional information in the temporal axis.

You might also like