Towards Building An End-to-End Multilingual Automatic Lyrics Transcription Model

This paper presents a multilingual automatic lyrics transcription (ALT) model that addresses the challenges of limited data and language imbalance in existing datasets. The authors adapt techniques from English ALT to create a multilingual system, demonstrating that incorporating language information significantly enhances performance compared to monolingual models. The study highlights the potential of multilingual models in low-resource scenarios and provides a solid baseline for future research in this area.

Uploaded by

gxoegbjmkkgeyqnylk

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

164 views5 pages

Towards Building An End-to-End Multilingual Automatic Lyrics Transcription Model

Uploaded by

gxoegbjmkkgeyqnylk

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Towards Building an End-to-End Multilingual

Automatic Lyrics Transcription Model

Jiawen Huang and Emmanouil Benetos
Centre for Digital Music, Queen Mary University of London, London, UK
{[Link], [Link]}@[Link]

Abstract—Multilingual automatic lyrics transcription (ALT) is Multilingual lyrics transcription, however, remains under-
a challenging task due to the limited availability of labelled data explored due to the limited publicly available training and
and the challenges introduced by singing, compared to multilin- evaluation data. Although DALI v2 [12] is a singing dataset
gual automatic speech recognition. Although some multilingual
singing datasets have been released recently, English continues of moderate size with lyrics annotations, it is dominated by
to dominate these collections. Multilingual ALT remains under- English songs, comprising over 80% of the dataset. Whisper
explored due to the scale of data and annotation quality. In this was introduced by OpenAI [14]. It is a robust ASR model
paper, we aim to create a multilingual ALT system with available trained on numerous audio-transcript pairs collected from the
datasets. Inspired by architectures that have been proven effective Internet, and the training data remains unreleased. This model
for English ALT, we adapt these techniques to the multilingual
scenario by expanding the target vocabulary set. We then evaluate has demonstrated its effectiveness in multilingual ALT [24].
the performance of the multilingual model in comparison to Building on this work, Wang et al. [21] investigated the poten-
its monolingual counterparts. Additionally, we explore various tial of adapting it to Mandarin Chinese ALT. A multilingual
conditioning methods to incorporate language information into ALT dataset called MulJam was created by post-processing
the model. We apply analysis by language and combine it with Whisper’s output [24]. With access to these datasets, we
the language classification performance. Our findings reveal that
the multilingual model performs consistently better than the develop multilingual ALT models using publicly available data
monolingual models trained on the language subsets. Further- by combining DALI and MulJam.
more, we demonstrate that incorporating language information Compared to English ALT, multilingual ALT faces several
significantly enhances performance. additional challenges: Firstly, unless explicitly specified, mod-
Index Terms—automatic lyrics transcription, multilingual, els must implicitly identify the underlying language of the
singing voice, music information retrieval
singing to ensure that the predicted lyrics match the correct
I. I NTRODUCTION character set. Secondly, there is a language imbalance in
the datasets. Languages like English typically dominate the
Automatic lyrics transcription (ALT) is the task of recog-
majority of songs, while other languages are considered low-
nising lyrics from singing voice. Access to lyrics enriches
resource in the context of ALT development. Thirdly, specific
the listening experience by arousing sympathy and building
characters appear in different languages’ alphabets but adhere
a deeper connection between the music and the listeners.
to different pronunciation rules, adding complexity to the
Moreover, lyrics transcription can benefit other music anal-
problem.
ysis tasks as well, including lyrics alignment [4], singing
The multilingual ASR task shares many of the challenges
pronunciation analysis [3], and cover song identification [19].
mentioned above. Previous research has extensively stud-
While ALT shares the same input/output format and similar
ied and compared mono-, cross-, and multilingual models
objectives with automatic speech recognition (ASR), it is more
using a single model [9], [18]. These works demonstrate
challenging due to larger variations in rhythm, pitch, and
that multilingual models tend to perform better than their
pronunciation [6], [11]. In recent years, significant progress
mono-/cross-lingual counterparts, particularly in low-resource
has been achieved in ALT for English songs using end-
settings. Furthermore, some studies observe performance gains
to-end models [7], [13], [16]. Stoller et al. [16] developed
by incorporating language information through conditioning
the first end-to-end lyrics alignment model using the Wave-
[18] or predicting language-specific tokens [23]. More recent
U-Net architecture and connectionist temporal classification
research takes advantage of unlabeled data through self-
(CTC) loss [8]. Gao et al. enhanced performance by fine-
supervised learning, such as wav2vec2 [1], [2].
tuning the ALT model in conjunction with a source separation
In this work, we aim to investigate the development of mul-
frontend [7]. Ou et al. [13] leveraged wav2vec2 features and
tilingual ALT models using publicly accessible data, building
applied transfer learning techniques, resulting in a significant
on existing work specifically designed for English ALT. We
performance boost.
tackle the low-resource aspects of ALT model development by
JH is a research student at the UKRI Centre for Doctoral Training in training jointly on data from across a wide range of languages.
Artificial Intelligence and Music, supported jointly by UK Research and Additionally, we study the impact of language information via
Innovation [grant number EP/S022694/1] and Queen Mary University of
London. EB is supported by RAEng/Leverhulme Trust Research Fellowship conditioning and multi-task learning. Our contributions can be
LTRF2223-19-106. summarized as follows: Firstly, as one of the first attempts to-

ISBN: 978-9-4645-9361-7 146 EUSIPCO 2024

(a) The multilingual and monolin- (b) The language-informed model. (c) The language self-conditioned model.
gual models.
Fig. 1: Proposed model architectures at training. Our models consists of a convolutional block CNNBlock , a transformer
encoder TfmEnc, a transformer decoder TfmDec, and several fully connected layers FCctc , FCs2s , and FClang . The dotted
lines indicate feature concatenation (with mapping for 1c). In 1b, emb denotes the language embedding.

wards multilingual ALT, we propose a small-scale multilingual convergence (The model is provided with the ground truth
ALT model trained on publicly available datasets. Secondly, tokens y bos and predicts the next ones y eos .):
we compare the performance of multilingual models with
their monolingual counterparts, revealing that training data in feat = CNNBlock (mel) (1)
additional languages benefits ALT in low-resource scenarios. h = TfmEnc(feat) (2)
Thirdly, we show that language conditioning has a positive bos
o = TfmDec(h, y ) (3)
impact on performance, while the amount of improvement
varies across different languages. where h and o are the output from the encoder and the decoder.
We acknowledge that previous research in multilingual Then h and o are passed to the two fully connected layers
ASR has explored similar approaches, as mentioned above. F Cctc and F Cs2s . These layers are responsible for generating
However, ALT diverges significantly due to different acoustic the posteriorgrams for both the CTC branch and the sequence-
characteristics of the singing voice, the severe shortage of to-sequence (seq2seq) branch.
resources, and language imbalances. Therefore, fundamental pctc = FCctc (h) (4)
assumptions need to be verified carefully. Our results indicate
ps2s = FCs2s (o) (5)
that our work will provide a solid baseline for future research,
as well as an initial step in addressing the new challenges. The loss function is a weighted sum of two components: the
CTC loss for alignment-free training (computed from the CTC
II. M ETHOD branch), and the Kullback-Leibler (KL) divergence loss for
A. Model smooth predictions (computed from the seq2seq branch):
Our models are built upon a similar architecture to the Loss = αLctc (pctc , y) + (1 − α)Ls2s (ps2s , y eos ) (6)
state-of-the-art transformer models [7], utilising the hybrid
CTC/Attention architecture [22]. Fig. 1a illustrates this ar- B. Multilingual model and monolingual models
chitecture at the training stage. The input is an 80-dim Mel- Let there be M languages in the training set {L1 , ..., LM },
spectrogram computed at a sampling rate of 16kHz, with an where Ci represents the character set for language Li . In the
FFT size of 400 and a hop size of 10 ms. The model consists multilingual setting, the target character set C is formed by
of a convolutional block, a transformer encoder, a transformer taking the union of all independent character sets ∪Mi=1 Ci .
decoder and two fully connected layers. The target dimension In addition, we train individual monolingual models for each
of the two fully connected layers is equal to the size of the language to provide comparative analysis with multilingual
target character set N . Let y represent the target lyrics token models. The corresponding training and validation sets are the
list, y bos denote y with a <bos> token added at the beginning, language-specific subsets derived from the multilingual data.
and y eos indicate y with a <eos> token appended at the end.
During training, the Mel-spectrogram is processed through the C. Language-informed models
convolutional block, before being passed to the transformer We condition our multilingual model with language infor-
encoder and decoder. Teacher forcing is adopted for faster mation to study its influence. By providing the model with

147
Train Valid Test
DALI MulJam MulJam Jamendo
English 295444 85773 542 868
French 11959 33322 760 809
Spanish 10317 15146 566 881
German 26343 3208 710 871 Fig. 2: The multilingual vocabulary. <bos> and <eos>
Italian 9164 7807 616 0 denote the beginning and the end of a line. <unk> is the
Russian 0 1805 317 0
unknown token. Epsilon ε is included for the CTC computa-
TABLE I: The numbers of utterances in the training, validation.
tion, and test sets in each language.

annotated durations, excessively long durations (>30s), and

knowledge of the target language, it has the potential to learn abnormally high character rates (>37.5 Hz). All utterances
language-specific features through the encoder and predict are source-separated by Open-Unmix [17].
characters belonging to the target language alphabet through All models are evaluated on the MultiLang Jamendo dataset
the decoder. [5] at line level. It consists of 80 songs in 4 languages:
To be more specific, an embedding is assigned to each lan- English, French, Spanish, and German. Line-level segments
guage (Fig. 1b). During training, we explore three approaches: are prepared according to the line-level timestamps provided
appending the language embedding to the input of the encoder by the dataset. Tab. I shows the statistics of the data for
(feat), to the input of the decoder (h), and to both. The three multilingual and monolingual experiments 1 .
conditioned models are respectively denoted as Enc-Cond, B. Model Configuration
Dec-Cond, and EncDec-Cond.
The convolutional block contains 3 CNN blocks with 64
D. Language self-conditioned model channels. The first two layers have a kernel size of 5 and a
stride of 2, while the last layer has both the kernel and stride
To gain deeper insights into the model’s capability to iden-
set to 1. Positional encoding [20] is added to the transformer
tify the correct language, we make the language identification
input before passing through the encoder. The transformer
ability measurable by taking a multi-task learning approach
encoder has 12 layers and the transformer decoder has 6 layers.
(Fig. 1c). The output of the encoder is averaged over time and
Each encoder layer consists of a multi-head attention and a
passed to a fully connected layer FClang to predict language
position-wise feed-forward layer. Each decoder layer contains
ID. The predicted language probability pl is used as a self-
the same except that the attention layer is causal. The attention
conditioning vector, mapped to the embedding dimension, and
dimension is set to 512, the number of heads is 4 and the
appended to the input of the decoder. In this configuration, a
position-wise feed-forward layer dimension is 2048.
cross-entropy loss term for language identification is added to
The loss weighting parameter α is set to 0.3, and β is set
the overall loss function, where l is the language label:
to 0.1. The language embedding for language-informed and
Loss =αLctc (pctc , y) + (1 − α)Ls2s (ps2s , y eos ) language self-conditioned models has a fixed size of 5 for all
(7) 6 languages. The union character set size, encompassing all
+ βLCE (pl , l)
6 languages, is 91. This includes the Latin alphabet, accented
and special characters, and the Cyrillic alphabet for Russian.
III. E XPERIMENTS
A. Datasets C. Training and Inference
Our models are built upon the speechbrain [15] transformer
The models are trained on the DALI v2 [12] and the MulJam
recipe for ASR 23 . The number of languages M is 6. They
[24] datasets. DALI v2 contains 7756 songs in total in more
are trained using the Adam optimizer [10] and Noam learning
than 30 languages, among which we take the 5 languages that
rate scheduler [20]. The initial learning rate is 0.001 and the
have more than 200 songs each: English, French, German,
number of warm-up steps is 25000. The number of epochs
Spanish, and Italian. We segment the songs to line level, with
is 50 for all models, except for the non-English monolingual
paired lyrics annotations. MulJam contains 6031 songs with
ones, which have 70 epochs. The checkpoint with the lowest
line-level lyrics annotations in 6 languages: English, French,
word error rate on the validation set is selected. During valida-
German, Spanish, Italian, and Russian. For each language, 20
tion and testing, beam search is employed on the transformer
songs are randomly selected for validation.
decoder to select the best prediction autoregressively. The
The training set is a combination of DALI-train and
beam size is 10 at validation and 66 at testing. We use Word
MulJam-train. It is important to note that the lyrics annotations
Error Rate (WER) to assess the performance of ALT models.
in DALI do not include accented or special characters. Instead,
they are converted to the Latin alphabet. Therefore, the valida- 1 The MulJam test set is not used for evaluation because 1) it is not language-

tion and test sets for DALI v2 may not represent the real mul- balanced 2) it is too small for low-resource languages 3) lyrics annotation is
provided at song-level.
tilingual problem. We exclusively use the MulJam validation 2 [Link]
set for validation purposes. For training and validation sets, we LibriSpeech/ASR/transformer/hparams/[Link]
apply additional filtering to exclude utterances with incorrectly 3 Our code available at: [Link]

148
Transformer W2V2 Whisper Multilingual Enc-Cond Dec-Cond EncDec-Cond
Multilingual XLSR-53 large-v3 English 51.45 51.19 50.61 50.80
English 51.45 42.67 36.80 French 68.40 65.01 67.33 65.22
French 68.40 54.74 49.33 Spanish 68.02 62.27 65.44 61.38
Spanish 68.02 45.02 41.15 German 70.18 63.23 63.95 62.07
German 70.18 49.29 44.52 All 64.31 60.32 61.71 59.79
All 64.31 47.95 42.95
TABLE IV: WER (%) of multilingual and language-informed
TABLE II: WER (%) of the multilingual transformer, the models.
wav2vec2-based multilingual models, and Whisper.

Monolingual Multilingual Self-condition

English 53.19 51.45 50.99
the German model yields worse WER. This aligns with the
French 76.06 68.40 70.26 nature of the two languages: Spanish is more phonetic and
Spanish 79.35 68.02 68.17 consistent in its spelling and pronunciation than German.
German 82.32 70.18 67.48
All 72.37 64.31 64.07
Additionally, German uses compounding more frequently than
Spanish, leading to more variations in pronunciation and
TABLE III: WER (%) of monolingual, multilingual, and stress patterns. This suggests that the challenge and data
language self-conditioned models. requirements for training ALT models vary across languages.
The multilingual model outperforms monolingual models
in every language, indicating that having more training data
D. Models for comparison in various languages benefits low-resource language ALT.
For our core experiments, we intentionally avoid using Specifically, while the improvement for English is small
ALT models or feature extractors pretrained with speech data, (∼2%), it exceeds 7% for all other languages. This suggests
such as wav2vec2 and Whisper, although we are aware that that leveraging high-resource language data (English) can be
incorporating these could benefit the performance. This is beneficial for low-resource ALT, when the target languages
since using speech models would introduce impact from the exhibit similarities in pronunciation and spelling rules.
data distribution of the pretrained models, making it difficult to Compared to the results of the multilingual model, the self-
estimate if any difference is due to training on more languages’ conditioned model is able to perform the additional language
data or the knowledge acquired during pretraining. For similar classification task without compromising ALT performance. It
reasons, we avoid using language models. is not surprising that having the auxiliary task does not bring
We still report the WER using a multilingual wav2vec2 significant improvements to ALT, as the language class can be
(W2V2) variant (large-xlsr-53 [2]) as a reference to rather easily inferred from the predicted lyrics.
the WER range that an adapted state-of-the-art English
ALT method [13] can achieve. The W2V2 ALT model C. Language-informed models and the Language Classifica-
uses a wav2vec2 feature extractor frontend, and a hybrid tion Accuracy
CTC/attention backend, similar to the proposed multilingual Tab. IV shows the performance of the multilingual and
transformer. The CTC branch is a fully connected layer while language-informed models. It can be observed that after pro-
the seq2seq branch is a one-layer recurrent neural network viding the language class as input results in improved WER
with attention. The attention dimension is set to 256 and for all languages. Among the three conditioning methods,
the hidden size is also 256. The target character set is the Enc-Cond performs better than Dec-Cond except for English.
multilingual vocabulary C. All other configurations are the EncDec-Cond gives the best overall WER, but the trend
same as in [20]. varies for each language. After conditioning on both encoder
and decoder, there is a clear improvement for non-English
IV. R ESULTS
languages, while the English WER remains nearly the same
A. Comparison with the State-of-The-Art as that for the multilingual model. The French monolingual
Tab. II lists the performance of our multilingual model, the model has a better WER than Spanish and German, but this
wav2vec2-based model, and Whisper. As expected, the W2V2- reverses after language conditioning. This indicates that with
based model outperforms our multilingual transformer, and sufficient data, French ALT might be more challenging than
Whisper performs the best due to its training with diverse the other two due to its complexity and frequent silent letters.
data in different environments and setups. Even when utilizing To gain a deeper understanding of the distinctions among
pretrained models, the WERs for non-English singing remain languages, we examine the language classification confusion
higher than that reported for the English monolingual model matrix of the self-conditioned model in Fig. 3. As depicted,
in [13], indicating greater challenges of multilingual ALT. languages other than English are often misclassified as En-
glish. Additionally, Spanish singing is more commonly mis-
B. Monolingual, multilingual and language self-conditioning taken for Italian than English. It is understandable that for
Tab. III lists the performance of monolingual, multilingual, the dominant language in the training set, the classification
and the language self-conditioned models. Notably, for the accuracy for English is close to 100%, which explains why
monolingual models, despite having more data than Spanish, language conditioning has minimal impact on English ALT.

149
[8] A. Graves, S. Fernández, F. J. Gomez, and J. Schmidhuber, “Connec-

German Spanish French English

0.98 0.0046 0.0035 0.011 0.0023 0 tionist temporal classification: labelling unsegmented sequence data with
0.8 recurrent neural networks,” in Proc. ICML, vol. 148. ACM, 2006, pp.
369–376.
0.15 0.8 0.0099 0.019 0.021 0 [9] G. Heigold, V. Vanhoucke, A. W. Senior, P. Nguyen, M. Ranzato,
0.6 M. Devin, and J. Dean, “Multilingual acoustic models using distributed
deep neural networks,” in IEEE International Conference on Acoustics,
0.087 0.0079 0.78 0.015 0.11 0.0011 0.4 Speech and Signal Processing, ICASSP 2013, Vancouver, BC, Canada,
May 26-31, 2013. IEEE, 2013, pp. 8619–8623.
[10] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimiza-
0.2 tion,” in 3rd International Conference on Learning Representations,
0.15 0.023 0.018 0.8 0.008 0.0011 ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track
0.0 Proceedings, 2015.
English French Spanish German Italian Russian [11] A. M. Kruspe, “Keyword spotting in a-capella singing,” in Proceed-
ings of the 15th International Society for Music Information Retrieval
Fig. 3: Confusion matrix for the self-conditioned model (%). Conference, ISMIR, Taipei, Taiwan, October 27-31, 2014, pp. 271–276.
Languages other than English frequently get confused with [12] G. Meseguer-Brocal, A. Cohen-Hadria, and G. Peeters, “Creating dali, a
large dataset of synchronized audio, lyrics, and notes,” Transactions of
English. the International Society for Music Information Retrieval, vol. 3, no. 1,
2020.
[13] L. Ou, X. Gu, and Y. Wang, “Transfer learning of wav2vec 2.0 for
V. C ONCLUSION AND FUTURE WORK automatic lyric transcription,” in Proceedings of the 23rd International
Society for Music Information Retrieval Conference, ISMIR 2022, Ben-
In summary, our study addresses the challenges of mul- galuru, India, December 4-8, 2022, 2022, pp. 891–899.
tilingual ALT model development when training only from [14] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and
I. Sutskever, “Robust speech recognition via large-scale weak super-
publicly available data, particularly in enhancing low-resource vision,” in International Conference on Machine Learning, ICML 2023,
languages. We illustrate that multilingual ALT surpasses 23-29 July 2023, Honolulu, Hawaii, USA, ser. Proceedings of Machine
monolingual ALT for all languages, primarily due to the shared Learning Research, vol. 202. PMLR, 2023, pp. 28 492–28 518.
[15] M. Ravanelli, T. Parcollet, P. Plantinga, A. Rouhe, S. Cornell, L. Lu-
phonetic similarities among them. Our language-conditioned gosch, C. Subakan, N. Dawalatabad, A. Heba, J. Zhong, J.-C. Chou,
experiments indicate that incorporating language information S.-L. Yeh, S.-W. Fu, C.-F. Liao, E. Rastorgueva, F. Grondin, W. Aris,
enhances performance. H. Na, Y. Gao, R. D. Mori, and Y. Bengio, “SpeechBrain: A general-
purpose speech toolkit,” 2021, arXiv:2106.04624.
In this study, we take a straightforward approach by merg- [16] D. Stoller, S. Durand, and S. Ewert, “End-to-end lyrics alignment
ing grapheme sets for multilingual vocabulary processing. In for polyphonic music using an audio-to-character recognition model,”
future research, we intend to explore more advanced strategies, in IEEE International Conference on Acoustics, Speech and Signal
Processing, ICASSP 2019, Brighton, United Kingdom, May 12-17, 2019.
including adapting target vocabularies using phoneme repre- IEEE, 2019, pp. 181–185.
sentations, subword units, and transliteration techniques. We [17] F. Stöter, S. Uhlich, A. Liutkus, and Y. Mitsufuji, “Open-unmix - A
believe that reducing target ambiguity is particularly critical reference implementation for music source separation,” J. Open Source
Softw., vol. 4, no. 41, p. 1667, 2019.
for effective training in low-resource settings. [18] S. Toshniwal, T. N. Sainath, R. J. Weiss, B. Li, P. J. Moreno, E. Wein-
stein, and K. Rao, “Multilingual speech recognition with a single end-
R EFERENCES to-end model,” in 2018 IEEE International Conference on Acoustics,
[1] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: Speech and Signal Processing, ICASSP 2018, Calgary, AB, Canada,
A framework for self-supervised learning of speech representations,” April 15-20, 2018. IEEE, 2018, pp. 4904–4908.
in Advances in Neural Information Processing Systems 33: Annual [19] A. Vaglio, R. Hennequin, M. Moussallam, and G. Richard, “The
Conference on Neural Information Processing Systems 2020, NeurIPS words remain the same: Cover detection with lyrics transcription,” in
2020, December 6-12, 2020, virtual, 2020. Proceedings of the 22nd International Society for Music Information
[2] A. Conneau, A. Baevski, R. Collobert, A. Mohamed, and M. Auli, Retrieval Conference, ISMIR 2021, Online, November 7-12, 2021, 2021,
“Unsupervised cross-lingual representation learning for speech recogni- pp. 714–721.
tion,” in Interspeech 2021, 22nd Annual Conference of the International [20] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
Speech Communication Association, Brno, Czechia, 30 August - 3 L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances
September 2021. ISCA, 2021, pp. 2426–2430. in Neural Information Processing Systems 30: Annual Conference on
[3] E. Demirel, S. Ahlbäck, and S. Dixon, “Computational pronunciation Neural Information Processing Systems 2017, December 4-9, 2017, Long
analysis in sung utterances,” in 29th European Signal Processing Con- Beach, CA, USA, 2017, pp. 5998–6008.
ference, EUSIPCO 2021, Dublin, Ireland, August 23-27, 2021. IEEE, [21] J.-Y. Wang, C.-I. Leong, Y.-C. Lin, L. Su, and J.-S. R. Jang, “Adapting
2021, pp. 186–190. pretrained speech model for mandarin lyrics transcription and align-
[4] E. Demirel, S. Ahlbäck, and S. Dixon, “Low resource audio-to-lyrics ment,” in 2023 IEEE Automatic Speech Recognition and Understanding
alignment from polyphonic music recordings,” in IEEE International Workshop (ASRU). IEEE, 2023, pp. 1–8.
Conference on Acoustics, Speech and Signal Processing, ICASSP 2021, [22] S. Watanabe, T. Hori, S. Kim, J. R. Hershey, and T. Hayashi, “Hy-
Toronto, ON, Canada, June 6-11, 2021. IEEE, 2021, pp. 586–590. brid ctc/attention architecture for end-to-end speech recognition,” IEEE
[5] S. Durand, D. Stoller, and S. Ewert, “Contrastive learning-based audio to Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp.
lyrics alignment for multiple languages,” in ICASSP 2023 - 2023 IEEE 1240–1253, 2017.
International Conference on Acoustics, Speech and Signal Processing [23] S. Zhou, S. Xu, and B. Xu, “Multilingual end-to-end speech recognition
(ICASSP), 2023, pp. 1–5. with A single transformer on low-resource languages,” ArXiv, vol.
[6] H. Fujihara and M. Goto, “Lyrics-to-audio alignment and its appli- abs/1806.05059, 2018.
cation,” in Multimodal Music Processing, ser. Dagstuhl Follow-Ups. [24] L. Zhuo, R. Yuan, J. Pan, Y. Ma, Y. Li, G. Zhang, S. Liu, R. B.
Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, Germany, 2012, Dannenberg, J. Fu, C. Lin, E. Benetos, W. Chen, W. Xue, and Y. Guo,
vol. 3, pp. 23–36. “Lyricwhiz: Robust multilingual zero-shot lyrics transcription by whis-
[7] X. Gao, C. Gupta, and H. Li, “Polyscriber: Integrated fine-tuning of pering to chatgpt,” in Proceedings of the 24th International Society
extractor and lyrics transcriber for polyphonic music,” IEEE ACM Trans. for Music Information Retrieval Conference, ISMIR 2023, Milan, Italy,
Audio Speech Lang. Process., vol. 31, pp. 1968–1981, 2023. November 5-9, 2023, 2023, pp. 343–351.

150

Viết Tiểu Luận
No ratings yet
Viết Tiểu Luận
12 pages
Learning To Speak Fluently in A Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning
No ratings yet
Learning To Speak Fluently in A Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning
5 pages
Cross-Language Transfer Learning, Continuous Learning, and Domain
No ratings yet
Cross-Language Transfer Learning, Continuous Learning, and Domain
5 pages
Unsupervised Cross-Lingual Representation Learning For Speech Recognition
No ratings yet
Unsupervised Cross-Lingual Representation Learning For Speech Recognition
11 pages
Automatic - Lyrics - Transcription - of - Polyphonic - Music - With - Lyrics - Chord - Multi - Task - Learning
No ratings yet
Automatic - Lyrics - Transcription - of - Polyphonic - Music - With - Lyrics - Chord - Multi - Task - Learning
8 pages
Multilingual TTS via Voice Conversion
No ratings yet
Multilingual TTS via Voice Conversion
5 pages
Code-Switching ASR with Concatenated Tokenizer
No ratings yet
Code-Switching ASR with Concatenated Tokenizer
9 pages
Le 2020 Dual
No ratings yet
Le 2020 Dual
14 pages
Multilingual BLSTM ASR for Low-Resource Languages
No ratings yet
Multilingual BLSTM ASR for Low-Resource Languages
5 pages
Expanding Speech Tech to 1,000+ Languages
No ratings yet
Expanding Speech Tech to 1,000+ Languages
41 pages
498 Submission
No ratings yet
498 Submission
4 pages
31 Multilingual Automatic Speech
No ratings yet
31 Multilingual Automatic Speech
9 pages
Bahasa Ada Untuk Membuat Bahasa Yang Tidak Ada
No ratings yet
Bahasa Ada Untuk Membuat Bahasa Yang Tidak Ada
5 pages
Comparing The Fine-Tuning and Performance of Whisper Pre-Trained Models For Turkish Speech Recognition Task
No ratings yet
Comparing The Fine-Tuning and Performance of Whisper Pre-Trained Models For Turkish Speech Recognition Task
4 pages
Low-Resource Multilingual and Zero-Shot Multispeaker TTS - 2022.aacl-Main.56
No ratings yet
Low-Resource Multilingual and Zero-Shot Multispeaker TTS - 2022.aacl-Main.56
11 pages
Cross-Lingual Text-To-Speech Synthesis Via Domain Adaptation and Perceptual Similarity Regression in Speaker Space
No ratings yet
Cross-Lingual Text-To-Speech Synthesis Via Domain Adaptation and Perceptual Similarity Regression in Speaker Space
5 pages
Comparative Analysis of Automatic Speech Recognition Techniques
No ratings yet
Comparative Analysis of Automatic Speech Recognition Techniques
8 pages
Multilingual Acoustic Models with DNNs
No ratings yet
Multilingual Acoustic Models with DNNs
5 pages
Karafiat Interspeech2017 IS171775
No ratings yet
Karafiat Interspeech2017 IS171775
5 pages
Bilingual DNN for Frisian Code-Switching ASR
No ratings yet
Bilingual DNN for Frisian Code-Switching ASR
8 pages
Pineiromartin24 Interspeech
No ratings yet
Pineiromartin24 Interspeech
5 pages
Multitask Learning for Low-Resource ASR
No ratings yet
Multitask Learning for Low-Resource ASR
12 pages
2208.12666v1 Feature Extraction
No ratings yet
2208.12666v1 Feature Extraction
13 pages
Lyrics Generation via Artist Embeddings
No ratings yet
Lyrics Generation via Artist Embeddings
5 pages
CONNEAU and Lample - 2019 - Cross-Lingual Language Model Pretraining
No ratings yet
CONNEAU and Lample - 2019 - Cross-Lingual Language Model Pretraining
11 pages
Cross-Lingual ASR for Slovene Without Transcriptions
No ratings yet
Cross-Lingual ASR for Slovene Without Transcriptions
8 pages
Google USM: Multilingual ASR Innovation
No ratings yet
Google USM: Multilingual ASR Innovation
20 pages
Transfer Learning For ASR To Deal With Low-Resource Data Problem
No ratings yet
Transfer Learning For ASR To Deal With Low-Resource Data Problem
8 pages
Improving Myanmar Automatic Speech Recognition With Optimization of Convolutional Neural Network Parameters
No ratings yet
Improving Myanmar Automatic Speech Recognition With Optimization of Convolutional Neural Network Parameters
10 pages
Myanmar ASR via CNN Optimization
No ratings yet
Myanmar ASR via CNN Optimization
10 pages
StyleTTS: Advanced Text-to-Speech Model
No ratings yet
StyleTTS: Advanced Text-to-Speech Model
20 pages
2022.lrec-1.542 A Survey of Multilingual Models For Automatic Speech Recognition
No ratings yet
2022.lrec-1.542 A Survey of Multilingual Models For Automatic Speech Recognition
9 pages
A Review On Speech Recognition Approaches and Challenges For Portuguese: Exploring The Feasibility of Fine-Tuning Large-Scale End-To-End Models
No ratings yet
A Review On Speech Recognition Approaches and Challenges For Portuguese: Exploring The Feasibility of Fine-Tuning Large-Scale End-To-End Models
13 pages
Multilingual TTS with RADTTS Control
No ratings yet
Multilingual TTS with RADTTS Control
5 pages
Acoustic Word Embeddings MDPI
No ratings yet
Acoustic Word Embeddings MDPI
9 pages
Latent Linguistic Embedding For Cross-Lingual Text-To-Speech and Voice Conversion
No ratings yet
Latent Linguistic Embedding For Cross-Lingual Text-To-Speech and Voice Conversion
5 pages
The ART of Conversation: Measuring Phonetic Convergence and Deliberate Imitation in L2-Speech With A Siamese RNN
No ratings yet
The ART of Conversation: Measuring Phonetic Convergence and Deliberate Imitation in L2-Speech With A Siamese RNN
5 pages
International Journal of Cognitive Computing in Engineering: Harsh Ahlawat, Naveen Aggarwal, Deepti Gupta
No ratings yet
International Journal of Cognitive Computing in Engineering: Harsh Ahlawat, Naveen Aggarwal, Deepti Gupta
37 pages
Tamil Speech Recognition Using XLSR Wav2Vec2.0 Amp CTC Algorithm-1
No ratings yet
Tamil Speech Recognition Using XLSR Wav2Vec2.0 Amp CTC Algorithm-1
6 pages
Hybrid Deep Learning and Signal Processing For Arabic Dialect Recognition in Low-Resource Settings
No ratings yet
Hybrid Deep Learning and Signal Processing For Arabic Dialect Recognition in Low-Resource Settings
11 pages
Lyrics-to-Audio Alignment via Vowel Patterns
No ratings yet
Lyrics-to-Audio Alignment via Vowel Patterns
13 pages
Multilabel Genre Classification
No ratings yet
Multilabel Genre Classification
17 pages
Enhancing ASR for Non-Native English
No ratings yet
Enhancing ASR for Non-Native English
25 pages
Enabling ASR For Low-Resource Languages: A Comprehensive Dataset Creation Approach
No ratings yet
Enabling ASR For Low-Resource Languages: A Comprehensive Dataset Creation Approach
13 pages
Multilingual Speech Processing Insights
No ratings yet
Multilingual Speech Processing Insights
24 pages
English-Spanish Prosody Mapping Model
No ratings yet
English-Spanish Prosody Mapping Model
147 pages
2024 Findings-Naacl 231
No ratings yet
2024 Findings-Naacl 231
13 pages
Direct Punjabi To English Speech Translation Using Discrete Units
No ratings yet
Direct Punjabi To English Speech Translation Using Discrete Units
13 pages
Multimodal Lyric Generation with MIDI
No ratings yet
Multimodal Lyric Generation with MIDI
6 pages
Assembaly AI Universal 1 ASR Model
No ratings yet
Assembaly AI Universal 1 ASR Model
22 pages
Self-Supervised Representationsforsingingvoiceconversion
No ratings yet
Self-Supervised Representationsforsingingvoiceconversion
5 pages
End-to-End Arabic Dialect Speech Recognition
No ratings yet
End-to-End Arabic Dialect Speech Recognition
17 pages
Multilingual ASR for Ethiopian Languages
No ratings yet
Multilingual ASR for Ethiopian Languages
5 pages
2024 Lrec-Main 851
No ratings yet
2024 Lrec-Main 851
10 pages
Paper TTS+Conversion
No ratings yet
Paper TTS+Conversion
13 pages
Language Transfer in Audio Word2Vec
No ratings yet
Language Transfer in Audio Word2Vec
8 pages
AudioPaLM: Unified Speech and Text Model
No ratings yet
AudioPaLM: Unified Speech and Text Model
27 pages
Bengali Speech Recognition Model
No ratings yet
Bengali Speech Recognition Model
5 pages
Unsupervised Multilingual POS Tagging
100% (1)
Unsupervised Multilingual POS Tagging
10 pages
Spring 2015 Apartment Floor Plans
No ratings yet
Spring 2015 Apartment Floor Plans
8 pages
The Taylor Swift Effect and International Law
No ratings yet
The Taylor Swift Effect and International Law
6 pages
Poetic Titles: Who Speaks?
No ratings yet
Poetic Titles: Who Speaks?
42 pages
Brother's French Fry Habit Poem
No ratings yet
Brother's French Fry Habit Poem
1 page
AI & Machine Learning Course Syllabus
No ratings yet
AI & Machine Learning Course Syllabus
3 pages
VGG16 Architecture
No ratings yet
VGG16 Architecture
30 pages
ASL Classification with CNNs
No ratings yet
ASL Classification with CNNs
6 pages
Understanding Recurrent Neural Networks
No ratings yet
Understanding Recurrent Neural Networks
31 pages
Bagging vs Boosting Trees Explained
100% (1)
Bagging vs Boosting Trees Explained
12 pages
MSAAI ProgramPlan FA24start
No ratings yet
MSAAI ProgramPlan FA24start
2 pages
Curriculum CVDL Master Program Updated
No ratings yet
Curriculum CVDL Master Program Updated
42 pages
DL Practical
No ratings yet
DL Practical
25 pages
Deep Learning: Neural Networks Overview
No ratings yet
Deep Learning: Neural Networks Overview
44 pages
Understanding Neural Networks A Python Implementation
No ratings yet
Understanding Neural Networks A Python Implementation
8 pages
Week 2 - VAE - Lesson
No ratings yet
Week 2 - VAE - Lesson
22 pages
ML UNIT 4 Sir
No ratings yet
ML UNIT 4 Sir
42 pages
Artificial Intelligence PPSC Notes
No ratings yet
Artificial Intelligence PPSC Notes
2 pages
Introduction to AI and Machine Learning
No ratings yet
Introduction to AI and Machine Learning
23 pages
IB Computer Science - 2025 Case Study - Chat Bots
No ratings yet
IB Computer Science - 2025 Case Study - Chat Bots
170 pages
Ghibli Art English
No ratings yet
Ghibli Art English
14 pages
Artificial Intelligence: Its Merits and Demerits
100% (2)
Artificial Intelligence: Its Merits and Demerits
2 pages
Convolutional Neural Network CNN For Ima
No ratings yet
Convolutional Neural Network CNN For Ima
5 pages
MC-JEPA: Joint Learning of Motion & Content
No ratings yet
MC-JEPA: Joint Learning of Motion & Content
20 pages
Ccs355 Neural Networks and Deep Learning
No ratings yet
Ccs355 Neural Networks and Deep Learning
5 pages
GenAI SecurityRisks
No ratings yet
GenAI SecurityRisks
53 pages
Object Detection Using Transformers: H.O.D DR.D.Haritha
No ratings yet
Object Detection Using Transformers: H.O.D DR.D.Haritha
24 pages
Machine Learning Lesson - Plan
No ratings yet
Machine Learning Lesson - Plan
3 pages
PyTorch - A Comprehensive Overview
No ratings yet
PyTorch - A Comprehensive Overview
7 pages
AI Presentation Topics for Students
No ratings yet
AI Presentation Topics for Students
3 pages
Applying Convolutional Neural Network For Network Intrusion Detection
No ratings yet
Applying Convolutional Neural Network For Network Intrusion Detection
7 pages
Seminar
No ratings yet
Seminar
23 pages
Radiomic Analysis of Glioma Survival
No ratings yet
Radiomic Analysis of Glioma Survival
28 pages
CM321 NLP Syllabus
No ratings yet
CM321 NLP Syllabus
2 pages
GR 10 Ai Portfoilio Activities
No ratings yet
GR 10 Ai Portfoilio Activities
9 pages