0% found this document useful (0 votes)
18 views5 pages

ICASSP 2024 T-Foley

Uploaded by

Yoonjin Chung
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views5 pages

ICASSP 2024 T-Foley

Uploaded by

Yoonjin Chung
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

T-FOLEY: A CONTROLLABLE WAVEFORM-DOMAIN DIFFUSION MODEL FOR

TEMPORAL-EVENT-GUIDED FOLEY SOUND SYNTHESIS



Yoonjin Chung∗ , ♯ Junwon Lee∗ , ♯♭ Juhan Nam

Graduate School of Artificial Intelligence, KAIST, Republic of Korea

Graduate School of Culture Technology, KAIST, Republic of Korea

ABSTRACT
Foley sound, audio content inserted synchronously with videos in
post-production, plays a crucial role in the user experience of mul-
timedia content. Recently, foley sound synthesis has been actively
studied, leveraging the advance of deep generative models. How-
ever, they mainly focus on mimicking a particular sound class as a
single event or a holistic context without temporal information of
individual sources. We present T-Foley, a Temporal-event guided
waveform generation model for Foley sound synthesis. T-Foley Fig. 1. A description of temporal-event-guided Foley sound synthe-
generates high-quality audio with two conditions: sound class and sis. The model generates waveforms with timbre of the sound class
temporal event condition. The temporal event condition is imple- and loudness that follows the temporal event condition.
mented with Block-FiLM as a novel conditioning method derived
from Temporal-FiLM. We show that the T-Foley model achieves
superior performances in both objective and subjective evaluation
metrics and generates well-synchronized foley sounds to the tempo- the timing of the sound when it includes multiple sound events. Lo-
ral event condition. We particularly use vocal mimicking datasets cating sound events at the right time is critical, considering the prac-
paired with foley sounds for the temporal event control, considering tical use of foley sound synthesis. Some studies generated implicit
its intuitive usage in real-world application scenarios. temporal features from input video during synthesis [6, 9, 10, 15].
Index Terms— Foley Sound Synthesis, Controllable Sound However, they do not have explicit temporal event conditions for
Generation, General Audio Synthesis, Waveform Domain Diffusion controllability and did not count quantitative analysis due to the ab-
sence of an adequate metric.
1. INTRODUCTION This research aims to address the challenge and produce real-
istic and timing-aligned Foley sound effects given a specific sound
Foley sound refers to human-created sound effects, such as footsteps category. To the best of our knowledge, this is the first attempt to
or gunshots, to accentuate visual media. The significance of foley generate audio with explicit temporal event conditions. Our con-
sound lies in its ability to enhance the overall immersive experience tributions are the following: First, we propose T-Foley, a Temporal-
for various forms of media [1]. Foley sounds are usually created event-guided diffusion model with a conditioning sound class to gen-
by foley artists who record and produce required sounds manually, erate high-quality Foley sound. For the temporal guidance, we de-
synchronizing with the visual elements. fine the temporal event feature to represent the timing of sounds.
The advent of neural audio generation has presented an oppor- To devise a conditioning method that reflects temporal informative
tunity to automate and streamline the foley sound creation process, condition, we introduce Block-FiLM, a modification of FiLM [16]
reducing the time and effort required for sound production. To syn- for block-wise affine transformation. Second, we conduct extensive
thesize proper sounds from specific categories, early studies usu- experiments to validate the performance and provide a comparative
ally focused on single sound sources such as foosteps [2], laugh- analysis of results on temporal conditioning methods and temporal
ter [3], and drum [4, 5]. Subsequent research then further improved event features. The evaluation includes both objective and subjec-
a model to be capable of generating multiple sound categories uti- tive metrics. Lastly, we show the potential applications of the neural
lizing auto-regressive models [6, 7, 8], Generative Adversarial Net- Foley sound synthesis by demonstrating its performance with a hu-
works(GANs) [9, 10], or diffusion models [11]. Recently, it has been man voice that mimics the target sound events as an intuitive way to
possible to generate holistic scene sounds based solely on a text de- capture temporal event features.
scription, even without pre-defined sound categories [12, 13, 14].
While previous work showed that the neural audio models can
faithfully synthesize the target sound, few of them pay attention to
∗ These authors contributed equally. This work was supported by Insti- 2. T-FOLEY
tute of Information & communications Technology Planning & Evaluation
(IITP) grant funded by the Korea government (MSIT) (No.2019-0-00075,
Artificial Intelligence Graduate School Program (KAIST)) and the National As shown in Figure 1, the model is designed to input the sound
Research Foundation of Korea (NRF) grant funded by the Korea government class category and temporal event condition, representing “what”
(MSIT) (No. RS-2023-00222383). and “when” the sound should be generated, respectively.
Fig. 2. (a) Overall architecture of the proposed model. (c: sound class, σ: diffusion timestep, T : temporal event feature) (b) A detailed
structure of a Down/Up sampling block. Each Down block includes strided convolutional layer at first while Up blocks exploit the transposed
one. (hin /hout : latent features) (c) Comparison of FiLM, TFiLM, and the proposed BFiLM. (Y : conditioning input, X: input activation)

2.1. Overall Architecture power, and some categories (e.g., rain) do not have definite onsets
and offsets by the nature of the sounds.
The architecture of T-Foley is depicted in Figure 2. The U-Net
structure, including an auto-regressive module in the bottleneck,
was borrowed from DAG [11], a state-of-the-art foley sound synthe- 2.3. Block FiLM
sis model with a high-resolution (44.1kHz) based on a waveform-
We propose Block FiLM as a neural module to condition the gen-
domain diffusion model without leveraging pretrained modules.
eration model on the temporal event feature. FiLM is one of the
Compared with CRASH [5], which is the first diffusion model to
Conditional Batch Normalization (CBN) techniques to modulate the
generate waveform from scratch, DAG contains a sequential module
activations of individual feature maps or channels with affine trans-
at the bottleneck of deeper U-Net architecture to address the prob-
formation based on a conditioning input. It has been widely used in
lem of inconsistent timbre within a generated sample. To predict the
various tasks, including image synthesis, style transfer, and diffusion
noise of each diffusion timestep, the model first downsamples the
models for waveform generation [17, 5, 11]. Mathematically, with
noised signal x into the latent vector and passes it to the bidirectional
a conditioning input yCin ,Lin ∈ RCin ×Lin , where Cin is the input
LSTM to maintain the timbre consistency within a sample. The out-
channel size, and Lin is the input length, the FiLM modulation can
put of the bottleneck layer is resized by linear projections and finally
be expressed as follows: FiLM(x, y, γ, β) = γ ⊙ x + β where x
upsampled into the prediction of noise ϵ̂. Within each downsample
represents the input activations, c is the conditioning input, γ, β ∈
and upsample block, convolution layers are conditioned on diffusion
RCout are normalizing parameters obtained as γ, β = MLP(y). The
time step embedding σ (as in [5]) and class embedding c through
⊙ symbol denotes the Hadamard product (element-wise matrix mul-
FiLM, and on temporal event feature T through Block-FiLM. The
tiplication).
model is trained end-to-end to minimize the loss function proposed
in CRASH [5]. Temporal FiLM (TFiLM) was proposed to overcome the lim-
itation of FiLM in time-varying information in conditioning sig-
nals, which is crucial for processing audio signals. Given a se-
2.2. Temporal Event Feature quential conditioning input yCin ,Lin ∈ RCin ×Lin , where Cin is
the input channel size, and Lin is the sequential length in the time
As the primary objective of T-Foley is to generate audio upon a tem-
domain, TFiLM firstly splits y into N blocks. THe i-th block
poral event condition, it is essential for the model to learn temporal
Ybi ∈ RCin ×Lin /N (i = 1, ..., N ) is max-pooled in the time di-
information regarding the occurrence and transition of sound events
in time. This necessitates the appropriate conditioning temporal fea- mension as Ybpool i
= Max-Pool1:Lin /N (Ybi ) ∈ RCin and followd
ture for the sound events. We used root-mean-square (RMS) of the by an RNN as sequential modeling to obtain normalizing parame-
waveform, which is widely used frame-level amplitude envelope fea- ters (γi , βi ), hi = RNN(Ybpool
i
; hi−1 ). Finally, a linear modulation
tures as below: is applied channel-wise according to the normalizing parameters:

Xboutput = (1Lout /N γiT ) ⊙ Xbi + (1Lout /N βiT ) ∈ RCout ×Lout /N


r
1 i+W 2 i
RMS(t) = Σ x (t) (1a) (2)
W t=i
where γi , βi ∈ RCout , 1d = [1, ..., 1]T ∈ Rd . TFiLM was orig-
where W is a window size and h is a hop size. In our experiment, we inally proposed for self-conditioning to modulate intermediate fea-
set W = 512 and h = 128. We also considered power (the square of tures. Therefore, Cin = Cout and Lin = Lout as [Xbi ’s]T = x =
RMS) and onset/offset (the start and end of a particular sound event) y = [Ybi ’s]T . In this paper, we generalize it to the case where con-
as candidates. After a preliminary experiment, we decided to use ditioning signal yCin ,Lin and modulating signal xCout ,Lout are dif-
RMS because there was no significant difference between RMS and ferent. However, incorporating TFiLM in every conditioning layer
Model #params↓ infer.t ↓ E-L1↓ FAD-P↓ FAD-V↓ IS↑ 3.2. Experimental Details
Real data - - 0.0 22.81 4.06 2.18
We train our model to estimate the reparameterized score function of
w/o condition 87M 12s 0.2212 53.94 36.10 1.46 a normal transition kernel with variance-preserving cosine schedul-
FiLM[16] 83M 6.3s 0.0772 54.59 36.06 1.94
ing as proposed in [5]. For conditional sampling, DDPM-like dis-
TFiLM[18] 101M 13s 0.0469 49.44 36.10 1.74
BFiLM 74M 9.5s 0.0367 41.59 36.09 1.79 cretization of SDE[21] is chosen with classifier-free guidance[22].
While 500-epoch of training, we randomly dropped the conditions
p = 0.1 for training in the unconditional scenario[11].
Table 1. Results of generation without or with event timing condi-
tion by FiLM, Temporal FiLM(TFiLM), and Block FiLM(BFiLM).
3.3. Evaluation
(#params: Number of trainable parameters, infer.t: Approximate in-
ference time for predicting 1 sample, E-L1: Event L1 loss, FAD- Objective Evaluation of audio generation models relies on three
P, and FAD-V: FADs based on PANNs and VGGish, IS: Inception metrics: Frechet Audio Distance (FAD), Inception Score (IS), and
Score.) Event-L1 Norm (E-L1). FAD and IS measure that the generated
sounds align with the given class condition and their diversity. For
Model Category Fidelity↑ Temporal Fidelity↑ Audio Quality↑ FAD, we exploit two classifiers from VGGish model[23](FAD-V,
16kHz) and PANNs[24](FAD-P, 32kHz). IS also utilizes PANNs. To
FiLM[16] 3.85(±0.12) 4.11(±0.10) 3.28(±0.11) verify the effectiveness of the temporal condition, we propose Event-
TFiLM[18] 4.02(±0.11) 4.00(±0.13) 3.75(±0.11)
L1 norm(E-L1). E-L1 assesses how well the generated sounds ad-
BFiLM 4.22(±0.11) 4.41(±0.09) 4.06(±0.10)
here to the given temporal event condition. We employed the L1
distance between the event timing feature of the target sample and
Table 2. Comparison of Mean Opinion Scores(MOS). Mean and the corresponding generated sample as follows:
95% confidence interval are reported.
1 k
E-L1 = Σi=1 ||Ei − Êi || (4)
k
can lead to computational complexity, prolonging both training and where x(t) (t ∈ [0, T ]) is the audio waveform, Ei is the ground-
inference times. truth event feature of i-th frame, and Êi is the predicted one. We
Block FiLM (BFiLM) is a simplified version of TFiLM to re- average the class-wise scores in each case.
duce computational costs while preserving the benefits of TFiLM. Subjective Evaluation was conducted by a total of 23 partic-
We adopted the idea of applying block-wise transformations from ipants. The evaluation encompassed 26 questions, categorized as
TFiLM by replacing the sequential modeling layer with a simple follows: 1) 14 questions evaluating samples generated under three
MLP layer as the following: different conditions: FiLM, TFiLM, and BFiLM, all based on a sin-
gle conditioning real data sample; 2) 12 questions assessing a cor-
(γi , βi ) = MLP(Ybpool
i
). (3) responding generated sample for each vocal mimicking input con-
dition. Participants evaluated the generated samples based on three
T-Foley is designed assuming that the base U-Net architecture al- criteria: Temporal Fidelity(TF) for alignment of the generated sam-
ready has a sequential module, which is already capable of handling ples with the temporal event condition of the target sample, Category
the sequence modeling among the blocks. We demonstrate the per- Fidelity(CF) for suitability of the generated samples to the given cat-
formance and efficiency of BFiLM in Section 4.1. egory, and Audio Quality(AQ) for the overall quality of the audio.
The scoring follows a numerical rating from 1 to 5 in increments of
0.5.
3. EXPERIMENTAL SETUP
4. RESULTS
3.1. Datasets

We utilized the Foley sound dataset provided for the Foley sound 4.1. Temporal Event Conditioning Methods
synthesis task in the 2023 DCASE challenge [1]. The given dataset We compare the performance of different conditioning methods for
comprises approximately 5k class-labeled sound samples (5.4 hours) the temporal event condition. Table 2 presents the quantitative scores
sourced from three different datasets. The samples are labeled into and MOS (Mean Opinion Score) from qualitative evaluation. Over-
seven different Foley sound classes: DogBark, Footstep, GunShot, all, TFiLM and BFiLM, which consider the temporal aspect of the
Keyboard, MovingMotorVehicle, Rain, Sneeze Cough. All audio event condition, received higher scores in most of the objective and
samples were constructed in mono 16-bit 22,050 Hz format for 4 sec- subjective metrics. Notably, BFiLM demonstrated markedly supe-
onds. In this work, we used about 95% of the development dataset rior performance in most of the results, particularly achieving im-
to train and 5% to validate. provements with approximately 0.7 times fewer parameters and re-
While controllable Foley sound synthesis holds great potential, duced inference time than TFiLM. These results validate that BFiLM
it is not trivial for users to express the desired temporal event ef- is efficient and suitable for our task. FiLM may have high IS value
fectively. For more intuitive and easy conditioning, we used human due to generating diverse low-quality audio different from ground-
voice that mimics Foley sound as a reference to extract temporal truths. In other experiments, we only use BFiLM.
event conditions. In particular, we used subsets of two vocal mim-
icking datasets paired with the original target sounds: Vocal Imita- 4.2. Effect of Block Numbers
tion Set[19] and VocalSketch[20]. We adjusted each audio sample
representing the 6 sound classes (excluding Sneeze Cough) in dura- Block Number N in Section 2.3 is an important hyperparameter
tion to match the training data. to manipulate the resolution of the condition. Less blocks lead to
Fig. 4. The first row shows the control sounds used to extract the
Fig. 3. Tradeoff between performance(E-L1, FAD-P) and effi- target event feature. Subsequent rows show three classes of Foley
ciency(inference time) among different number of blocks. sounds generated with different conditioning blocks (FiLM, TFiLM,
and BFiLM). They are all displayed as a mel-spectrogram.
Vocal Imitation VocalSketch
Block #
E-L1↓ FAD-P↓ E-L1↓ FAD-P↓
245 0.0228 74.94 0.0186 59.33
98 0.0300 69.13 0.0274 56.51
49 0.0306 64.55 0.0299 49.62
14 0.0764 59.56 0.0635 58.93
7 0.0935 66.23 0.0914 60.41

Table 3. Evaluation on Vocal Mimicking Datasets

sparser and smoother conditional information in the temporal axis.


We compare the performance of different block numbers in Figure
3. For accuracy, E-L1 decreases as there are more blocks as ex- Fig. 5. (a) Comparing manually synthesized consecutive gunshot
pected. FAD-P also decreases, with no significant difference ob- sounds with sounds generated through temporal event feature. (b)
served among 49, 98, and 245. In terms of efficiency, inference time Generated sounds with the original temporal event features and those
increases along with the block number, as it requires more compu- with a reduced gain by 10.
tation. In consideration of the tradeoff between accuracy and effi-
ciency, we stick to 49 blocks in other experiments.
Furthermore, the generation of Foley sounds using temporal
event conditions yields considerably more realistic results when
4.3. Evaluation on Vocal Mimicking Datasets
compared to manual Foley sound manipulation. We exemplify two
Table 3 summarizes the result among different block numbers, show- specific scenarios in Figure 5. The first scenario involves consecu-
ing comparable performance in vocal sound. E-L1 decreases for tive machine gunshots. Manually adjusting and copying individual
larger block numbers, which is consistent with Section 4.2. On the gunshot sound snippets can result in an unnatural audio sequence.
other hand, FAD-P is the lowest around 14, 49 number of blocks. Conversely, employing T-Foley to concatenate temporal event con-
This results from the discrepancy of Foley sound and vocal, as the ditions leads to a seamless and lifelike sound. The second is a key-
two sounds have different RMS curves according to the discrep- typing sound with two contrasted examples: typing vigorously on a
ancy in timbre and energy envelope. Choosing the right block num- typewriter and softly pressing keys on a plastic keyboard. T-Foley
ber to smooth the conditioning RMS signal of vocal audio. MOS can generate these two sounds using the same temporal event fea-
for model with N = 49 was measured as CF= 4.41(±0.09), tures by adjusting the gain. This indicates that the level of temporal
EF= 4.40(±0.10), and AQ= 4.34(±0.08), which is a competable event feature controls not only the amplitude envelope but also the
result regarding Table 2. To further showcase the performance of timbre texture.
our model, we conducted experiments on additional temporal event The demo examples are accessible on the companion website.1
conditions with manually recorded clapping sounds or human voices
providing the desired timing cues. 5. CONCLUSION

4.4. Case Study This study presents T-Foley, a foley sound generation system ad-
dressing controllability in the temporal domain. By introducing the
To showcase the performance and usability of T-Foley, we present temporal event feature and the E-L1 metric, we show that Block-
two case studies. Firstly, we compare the output of our proposed FiLM, our proposed conditioning method, is effective and powerful
BFiLM method with that of the FiLM and TFiLM methodologies. in terms of quality and efficiency. We also demonstrate the high per-
In Figure 4, BFiLM exhibits the highest alignment with the mel- formance of T-Foley on vocal mimicking datasets to claim its power
spectrogram of the target conditioning sound. Both FiLM and in usability.
TFiLM generate unclear and undesirable sound events in the Foot-
step sound class. For Gunshot and Sneeze Cough, only BFiLM 1 [Link]
seems to reflect the sustain and decay of the attack in sound well. [Link]
6. REFERENCES [14] Felix Kreuk, Gabriel Synnaeve, Adam Polyak, Uriel Singer,
Alexandre Défossez, Jade Copet, Devi Parikh, Yaniv Taigman,
[1] Keunwoo Choi, Jaekwon Im, Laurie Heller, Brian McFee, and Yossi Adi, “Audiogen: Textually guided audio generation,”
Keisuke Imoto, Yuki Okamoto, Mathieu Lagrange, and Shi- arXiv preprint arXiv:2209.15352, 2022.
nosuke Takamichi, “Foley sound synthesis at the dcase 2023 [15] Chenye Cui, Zhou Zhao, Yi Ren, Jinglin Liu, Rongjie Huang,
challenge,” In arXiv e-prints: 2304.12521, 2023. Feiyang Chen, Zhefeng Wang, Baoxing Huai, and Fei Wu,
[2] Marco Comunità, Huy Phan, and Joshua D Reiss, “Neural “Varietysound: Timbre-controllable video to sound gener-
synthesis of footsteps sound effects with generative adversarial ation via unsupervised information disentanglement,” in
networks,” arXiv preprint arXiv:2110.09605, 2021. ICASSP 2023-2023 IEEE International Conference on Acous-
tics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp.
[3] M Mehdi Afsar, Eric Park, Étienne Paquette, Gauthier Gidel, 1–5.
Kory W Mathewson, and Eilif Muller, “Generating di-
verse realistic laughter for interactive art,” arXiv preprint [16] Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin,
arXiv:2111.03146, 2021. and Aaron Courville, “Film: Visual reasoning with a general
conditioning layer,” in Proceedings of the AAAI Conference on
[4] Javier Nistal, Stefan Lattner, and Gael Richard, “Drum- Artificial Intelligence, 2018, vol. 32.
gan: Synthesis of drum sounds with timbral feature condi-
[17] Nanxin Chen, Yu Zhang, Heiga Zen, Ron J Weiss, Mohammad
tioning using generative adversarial networks,” arXiv preprint
Norouzi, and William Chan, “Wavegrad: Estimating gradients
arXiv:2008.12073, 2020.
for waveform generation,” arXiv preprint arXiv:2009.00713,
[5] Simon Rouard and Gaëtan Hadjeres, “Crash: raw audio score- 2020.
based generative modeling for controllable high-resolution [18] Sawyer Birnbaum, Volodymyr Kuleshov, Zayd Enam, Pang
drum sound synthesis,” arXiv preprint arXiv:2106.07431, Wei W Koh, and Stefano Ermon, “Temporal film: Capturing
2021. long-range sequence dependencies with feature-wise modula-
[6] Sanchita Ghose and John Jeffrey Prevost, “Autofoley: Arti- tions.,” Advances in Neural Information Processing Systems,
ficial synthesis of synchronized sound tracks for silent videos vol. 32, 2019.
with deep learning,” IEEE Transactions on Multimedia, vol. [19] Bongjun Kim, Madhav Ghei, Bryan Pardo, and Zhiyao Duan,
23, pp. 1895–1907, 2020. “Vocal imitation set: a dataset of vocally imitated sound events
[7] Yuexi Du, Ziyang Chen, Justin Salamon, Bryan Russell, and using the audioset ontology.,” in DCASE, 2018, pp. 148–152.
Andrew Owens, “Conditional generation of audio from video [20] Mark Cartwright and Bryan Pardo, “Vocalsketch: Vocally im-
via foley analogies,” in Proceedings of the IEEE/CVF Confer- itating audio concepts,” in Proceedings of the 33rd Annual
ence on Computer Vision and Pattern Recognition, 2023, pp. ACM Conference on Human Factors in Computing Systems,
2426–2436. 2015, pp. 43–46.
[8] Xubo Liu, Turab Iqbal, Jinzheng Zhao, Qiushi Huang, Mark D [21] Diederik Kingma, Tim Salimans, Ben Poole, and Jonathan Ho,
Plumbley, and Wenwu Wang, “Conditional sound generation “Variational diffusion models,” Advances in neural informa-
using neural discrete time-frequency representation learning,” tion processing systems, vol. 34, pp. 21696–21707, 2021.
in 2021 IEEE 31st International Workshop on Machine Learn- [22] Jonathan Ho and Tim Salimans, “Classifier-free diffusion guid-
ing for Signal Processing (MLSP). IEEE, 2021, pp. 1–6. ance,” arXiv preprint arXiv:2207.12598, 2022.
[9] Sanchita Ghose and John J Prevost, “Foleygan: Visu- [23] Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F
ally guided generative adversarial network-based synchronous Gemmeke, Aren Jansen, R Channing Moore, Manoj Plakal,
sound generation in silent videos,” IEEE Transactions on Mul- Devin Platt, Rif A Saurous, Bryan Seybold, et al., “Cnn ar-
timedia, 2022. chitectures for large-scale audio classification,” in 2017 ieee
[10] Peihao Chen, Yang Zhang, Mingkui Tan, Hongdong Xiao, international conference on acoustics, speech and signal pro-
Deng Huang, and Chuang Gan, “Generating visually aligned cessing (icassp). IEEE, 2017, pp. 131–135.
sound from videos,” IEEE Transactions on Image Processing, [24] Qiuqiang Kong, Yin Cao, Turab Iqbal, Yuxuan Wang, Wenwu
vol. 29, pp. 8292–8302, 2020. Wang, and Mark D Plumbley, “Panns: Large-scale pre-
trained audio neural networks for audio pattern recognition,”
[11] Santiago Pascual, Gautam Bhattacharya, Chunghsin Yeh, Jordi
IEEE/ACM Transactions on Audio, Speech, and Language
Pons, and Joan Serrà, “Full-band general audio synthesis with
Processing, vol. 28, pp. 2880–2894, 2020.
score-based diffusion,” in ICASSP 2023-2023 IEEE Interna-
tional Conference on Acoustics, Speech and Signal Processing
(ICASSP). IEEE, 2023, pp. 1–5.
[12] Dongchao Yang, Jianwei Yu, Helin Wang, Wen Wang, Chao
Weng, Yuexian Zou, and Dong Yu, “Diffsound: Discrete dif-
fusion model for text-to-sound generation,” IEEE/ACM Trans-
actions on Audio, Speech, and Language Processing, 2023.
[13] Haohe Liu, Zehua Chen, Yi Yuan, Xinhao Mei, Xubo Liu,
Danilo Mandic, Wenwu Wang, and Mark D Plumbley, “Audi-
oldm: Text-to-audio generation with latent diffusion models,”
arXiv preprint arXiv:2301.12503, 2023.

You might also like