0% found this document useful (0 votes)
164 views5 pages

Towards Building An End-to-End Multilingual Automatic Lyrics Transcription Model

This paper presents a multilingual automatic lyrics transcription (ALT) model that addresses the challenges of limited data and language imbalance in existing datasets. The authors adapt techniques from English ALT to create a multilingual system, demonstrating that incorporating language information significantly enhances performance compared to monolingual models. The study highlights the potential of multilingual models in low-resource scenarios and provides a solid baseline for future research in this area.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
164 views5 pages

Towards Building An End-to-End Multilingual Automatic Lyrics Transcription Model

This paper presents a multilingual automatic lyrics transcription (ALT) model that addresses the challenges of limited data and language imbalance in existing datasets. The authors adapt techniques from English ALT to create a multilingual system, demonstrating that incorporating language information significantly enhances performance compared to monolingual models. The study highlights the potential of multilingual models in low-resource scenarios and provides a solid baseline for future research in this area.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Towards Building an End-to-End Multilingual

Automatic Lyrics Transcription Model


Jiawen Huang and Emmanouil Benetos
Centre for Digital Music, Queen Mary University of London, London, UK
{[Link], [Link]}@[Link]

Abstract—Multilingual automatic lyrics transcription (ALT) is Multilingual lyrics transcription, however, remains under-
a challenging task due to the limited availability of labelled data explored due to the limited publicly available training and
and the challenges introduced by singing, compared to multilin- evaluation data. Although DALI v2 [12] is a singing dataset
gual automatic speech recognition. Although some multilingual
singing datasets have been released recently, English continues of moderate size with lyrics annotations, it is dominated by
to dominate these collections. Multilingual ALT remains under- English songs, comprising over 80% of the dataset. Whisper
explored due to the scale of data and annotation quality. In this was introduced by OpenAI [14]. It is a robust ASR model
paper, we aim to create a multilingual ALT system with available trained on numerous audio-transcript pairs collected from the
datasets. Inspired by architectures that have been proven effective Internet, and the training data remains unreleased. This model
for English ALT, we adapt these techniques to the multilingual
scenario by expanding the target vocabulary set. We then evaluate has demonstrated its effectiveness in multilingual ALT [24].
the performance of the multilingual model in comparison to Building on this work, Wang et al. [21] investigated the poten-
its monolingual counterparts. Additionally, we explore various tial of adapting it to Mandarin Chinese ALT. A multilingual
conditioning methods to incorporate language information into ALT dataset called MulJam was created by post-processing
the model. We apply analysis by language and combine it with Whisper’s output [24]. With access to these datasets, we
the language classification performance. Our findings reveal that
the multilingual model performs consistently better than the develop multilingual ALT models using publicly available data
monolingual models trained on the language subsets. Further- by combining DALI and MulJam.
more, we demonstrate that incorporating language information Compared to English ALT, multilingual ALT faces several
significantly enhances performance. additional challenges: Firstly, unless explicitly specified, mod-
Index Terms—automatic lyrics transcription, multilingual, els must implicitly identify the underlying language of the
singing voice, music information retrieval
singing to ensure that the predicted lyrics match the correct
I. I NTRODUCTION character set. Secondly, there is a language imbalance in
the datasets. Languages like English typically dominate the
Automatic lyrics transcription (ALT) is the task of recog-
majority of songs, while other languages are considered low-
nising lyrics from singing voice. Access to lyrics enriches
resource in the context of ALT development. Thirdly, specific
the listening experience by arousing sympathy and building
characters appear in different languages’ alphabets but adhere
a deeper connection between the music and the listeners.
to different pronunciation rules, adding complexity to the
Moreover, lyrics transcription can benefit other music anal-
problem.
ysis tasks as well, including lyrics alignment [4], singing
The multilingual ASR task shares many of the challenges
pronunciation analysis [3], and cover song identification [19].
mentioned above. Previous research has extensively stud-
While ALT shares the same input/output format and similar
ied and compared mono-, cross-, and multilingual models
objectives with automatic speech recognition (ASR), it is more
using a single model [9], [18]. These works demonstrate
challenging due to larger variations in rhythm, pitch, and
that multilingual models tend to perform better than their
pronunciation [6], [11]. In recent years, significant progress
mono-/cross-lingual counterparts, particularly in low-resource
has been achieved in ALT for English songs using end-
settings. Furthermore, some studies observe performance gains
to-end models [7], [13], [16]. Stoller et al. [16] developed
by incorporating language information through conditioning
the first end-to-end lyrics alignment model using the Wave-
[18] or predicting language-specific tokens [23]. More recent
U-Net architecture and connectionist temporal classification
research takes advantage of unlabeled data through self-
(CTC) loss [8]. Gao et al. enhanced performance by fine-
supervised learning, such as wav2vec2 [1], [2].
tuning the ALT model in conjunction with a source separation
In this work, we aim to investigate the development of mul-
frontend [7]. Ou et al. [13] leveraged wav2vec2 features and
tilingual ALT models using publicly accessible data, building
applied transfer learning techniques, resulting in a significant
on existing work specifically designed for English ALT. We
performance boost.
tackle the low-resource aspects of ALT model development by
JH is a research student at the UKRI Centre for Doctoral Training in training jointly on data from across a wide range of languages.
Artificial Intelligence and Music, supported jointly by UK Research and Additionally, we study the impact of language information via
Innovation [grant number EP/S022694/1] and Queen Mary University of
London. EB is supported by RAEng/Leverhulme Trust Research Fellowship conditioning and multi-task learning. Our contributions can be
LTRF2223-19-106. summarized as follows: Firstly, as one of the first attempts to-

ISBN: 978-9-4645-9361-7 146 EUSIPCO 2024


(a) The multilingual and monolin- (b) The language-informed model. (c) The language self-conditioned model.
gual models.
Fig. 1: Proposed model architectures at training. Our models consists of a convolutional block CNNBlock , a transformer
encoder TfmEnc, a transformer decoder TfmDec, and several fully connected layers FCctc , FCs2s , and FClang . The dotted
lines indicate feature concatenation (with mapping for 1c). In 1b, emb denotes the language embedding.

wards multilingual ALT, we propose a small-scale multilingual convergence (The model is provided with the ground truth
ALT model trained on publicly available datasets. Secondly, tokens y bos and predicts the next ones y eos .):
we compare the performance of multilingual models with
their monolingual counterparts, revealing that training data in feat = CNNBlock (mel) (1)
additional languages benefits ALT in low-resource scenarios. h = TfmEnc(feat) (2)
Thirdly, we show that language conditioning has a positive bos
o = TfmDec(h, y ) (3)
impact on performance, while the amount of improvement
varies across different languages. where h and o are the output from the encoder and the decoder.
We acknowledge that previous research in multilingual Then h and o are passed to the two fully connected layers
ASR has explored similar approaches, as mentioned above. F Cctc and F Cs2s . These layers are responsible for generating
However, ALT diverges significantly due to different acoustic the posteriorgrams for both the CTC branch and the sequence-
characteristics of the singing voice, the severe shortage of to-sequence (seq2seq) branch.
resources, and language imbalances. Therefore, fundamental pctc = FCctc (h) (4)
assumptions need to be verified carefully. Our results indicate
ps2s = FCs2s (o) (5)
that our work will provide a solid baseline for future research,
as well as an initial step in addressing the new challenges. The loss function is a weighted sum of two components: the
CTC loss for alignment-free training (computed from the CTC
II. M ETHOD branch), and the Kullback-Leibler (KL) divergence loss for
A. Model smooth predictions (computed from the seq2seq branch):
Our models are built upon a similar architecture to the Loss = αLctc (pctc , y) + (1 − α)Ls2s (ps2s , y eos ) (6)
state-of-the-art transformer models [7], utilising the hybrid
CTC/Attention architecture [22]. Fig. 1a illustrates this ar- B. Multilingual model and monolingual models
chitecture at the training stage. The input is an 80-dim Mel- Let there be M languages in the training set {L1 , ..., LM },
spectrogram computed at a sampling rate of 16kHz, with an where Ci represents the character set for language Li . In the
FFT size of 400 and a hop size of 10 ms. The model consists multilingual setting, the target character set C is formed by
of a convolutional block, a transformer encoder, a transformer taking the union of all independent character sets ∪Mi=1 Ci .
decoder and two fully connected layers. The target dimension In addition, we train individual monolingual models for each
of the two fully connected layers is equal to the size of the language to provide comparative analysis with multilingual
target character set N . Let y represent the target lyrics token models. The corresponding training and validation sets are the
list, y bos denote y with a <bos> token added at the beginning, language-specific subsets derived from the multilingual data.
and y eos indicate y with a <eos> token appended at the end.
During training, the Mel-spectrogram is processed through the C. Language-informed models
convolutional block, before being passed to the transformer We condition our multilingual model with language infor-
encoder and decoder. Teacher forcing is adopted for faster mation to study its influence. By providing the model with

147
Train Valid Test
DALI MulJam MulJam Jamendo
English 295444 85773 542 868
French 11959 33322 760 809
Spanish 10317 15146 566 881
German 26343 3208 710 871 Fig. 2: The multilingual vocabulary. <bos> and <eos>
Italian 9164 7807 616 0 denote the beginning and the end of a line. <unk> is the
Russian 0 1805 317 0
unknown token. Epsilon ε is included for the CTC computa-
TABLE I: The numbers of utterances in the training, valida- tion.
tion, and test sets in each language.

annotated durations, excessively long durations (>30s), and


knowledge of the target language, it has the potential to learn abnormally high character rates (>37.5 Hz). All utterances
language-specific features through the encoder and predict are source-separated by Open-Unmix [17].
characters belonging to the target language alphabet through All models are evaluated on the MultiLang Jamendo dataset
the decoder. [5] at line level. It consists of 80 songs in 4 languages:
To be more specific, an embedding is assigned to each lan- English, French, Spanish, and German. Line-level segments
guage (Fig. 1b). During training, we explore three approaches: are prepared according to the line-level timestamps provided
appending the language embedding to the input of the encoder by the dataset. Tab. I shows the statistics of the data for
(feat), to the input of the decoder (h), and to both. The three multilingual and monolingual experiments 1 .
conditioned models are respectively denoted as Enc-Cond, B. Model Configuration
Dec-Cond, and EncDec-Cond.
The convolutional block contains 3 CNN blocks with 64
D. Language self-conditioned model channels. The first two layers have a kernel size of 5 and a
stride of 2, while the last layer has both the kernel and stride
To gain deeper insights into the model’s capability to iden-
set to 1. Positional encoding [20] is added to the transformer
tify the correct language, we make the language identification
input before passing through the encoder. The transformer
ability measurable by taking a multi-task learning approach
encoder has 12 layers and the transformer decoder has 6 layers.
(Fig. 1c). The output of the encoder is averaged over time and
Each encoder layer consists of a multi-head attention and a
passed to a fully connected layer FClang to predict language
position-wise feed-forward layer. Each decoder layer contains
ID. The predicted language probability pl is used as a self-
the same except that the attention layer is causal. The attention
conditioning vector, mapped to the embedding dimension, and
dimension is set to 512, the number of heads is 4 and the
appended to the input of the decoder. In this configuration, a
position-wise feed-forward layer dimension is 2048.
cross-entropy loss term for language identification is added to
The loss weighting parameter α is set to 0.3, and β is set
the overall loss function, where l is the language label:
to 0.1. The language embedding for language-informed and
Loss =αLctc (pctc , y) + (1 − α)Ls2s (ps2s , y eos ) language self-conditioned models has a fixed size of 5 for all
(7) 6 languages. The union character set size, encompassing all
+ βLCE (pl , l)
6 languages, is 91. This includes the Latin alphabet, accented
and special characters, and the Cyrillic alphabet for Russian.
III. E XPERIMENTS
A. Datasets C. Training and Inference
Our models are built upon the speechbrain [15] transformer
The models are trained on the DALI v2 [12] and the MulJam
recipe for ASR 23 . The number of languages M is 6. They
[24] datasets. DALI v2 contains 7756 songs in total in more
are trained using the Adam optimizer [10] and Noam learning
than 30 languages, among which we take the 5 languages that
rate scheduler [20]. The initial learning rate is 0.001 and the
have more than 200 songs each: English, French, German,
number of warm-up steps is 25000. The number of epochs
Spanish, and Italian. We segment the songs to line level, with
is 50 for all models, except for the non-English monolingual
paired lyrics annotations. MulJam contains 6031 songs with
ones, which have 70 epochs. The checkpoint with the lowest
line-level lyrics annotations in 6 languages: English, French,
word error rate on the validation set is selected. During valida-
German, Spanish, Italian, and Russian. For each language, 20
tion and testing, beam search is employed on the transformer
songs are randomly selected for validation.
decoder to select the best prediction autoregressively. The
The training set is a combination of DALI-train and
beam size is 10 at validation and 66 at testing. We use Word
MulJam-train. It is important to note that the lyrics annotations
Error Rate (WER) to assess the performance of ALT models.
in DALI do not include accented or special characters. Instead,
they are converted to the Latin alphabet. Therefore, the valida- 1 The MulJam test set is not used for evaluation because 1) it is not language-

tion and test sets for DALI v2 may not represent the real mul- balanced 2) it is too small for low-resource languages 3) lyrics annotation is
provided at song-level.
tilingual problem. We exclusively use the MulJam validation 2 [Link]
set for validation purposes. For training and validation sets, we LibriSpeech/ASR/transformer/hparams/[Link]
apply additional filtering to exclude utterances with incorrectly 3 Our code available at: [Link]

148
Transformer W2V2 Whisper Multilingual Enc-Cond Dec-Cond EncDec-Cond
Multilingual XLSR-53 large-v3 English 51.45 51.19 50.61 50.80
English 51.45 42.67 36.80 French 68.40 65.01 67.33 65.22
French 68.40 54.74 49.33 Spanish 68.02 62.27 65.44 61.38
Spanish 68.02 45.02 41.15 German 70.18 63.23 63.95 62.07
German 70.18 49.29 44.52 All 64.31 60.32 61.71 59.79
All 64.31 47.95 42.95
TABLE IV: WER (%) of multilingual and language-informed
TABLE II: WER (%) of the multilingual transformer, the models.
wav2vec2-based multilingual models, and Whisper.

Monolingual Multilingual Self-condition


English 53.19 51.45 50.99
the German model yields worse WER. This aligns with the
French 76.06 68.40 70.26 nature of the two languages: Spanish is more phonetic and
Spanish 79.35 68.02 68.17 consistent in its spelling and pronunciation than German.
German 82.32 70.18 67.48
All 72.37 64.31 64.07
Additionally, German uses compounding more frequently than
Spanish, leading to more variations in pronunciation and
TABLE III: WER (%) of monolingual, multilingual, and stress patterns. This suggests that the challenge and data
language self-conditioned models. requirements for training ALT models vary across languages.
The multilingual model outperforms monolingual models
in every language, indicating that having more training data
D. Models for comparison in various languages benefits low-resource language ALT.
For our core experiments, we intentionally avoid using Specifically, while the improvement for English is small
ALT models or feature extractors pretrained with speech data, (∼2%), it exceeds 7% for all other languages. This suggests
such as wav2vec2 and Whisper, although we are aware that that leveraging high-resource language data (English) can be
incorporating these could benefit the performance. This is beneficial for low-resource ALT, when the target languages
since using speech models would introduce impact from the exhibit similarities in pronunciation and spelling rules.
data distribution of the pretrained models, making it difficult to Compared to the results of the multilingual model, the self-
estimate if any difference is due to training on more languages’ conditioned model is able to perform the additional language
data or the knowledge acquired during pretraining. For similar classification task without compromising ALT performance. It
reasons, we avoid using language models. is not surprising that having the auxiliary task does not bring
We still report the WER using a multilingual wav2vec2 significant improvements to ALT, as the language class can be
(W2V2) variant (large-xlsr-53 [2]) as a reference to rather easily inferred from the predicted lyrics.
the WER range that an adapted state-of-the-art English
ALT method [13] can achieve. The W2V2 ALT model C. Language-informed models and the Language Classifica-
uses a wav2vec2 feature extractor frontend, and a hybrid tion Accuracy
CTC/attention backend, similar to the proposed multilingual Tab. IV shows the performance of the multilingual and
transformer. The CTC branch is a fully connected layer while language-informed models. It can be observed that after pro-
the seq2seq branch is a one-layer recurrent neural network viding the language class as input results in improved WER
with attention. The attention dimension is set to 256 and for all languages. Among the three conditioning methods,
the hidden size is also 256. The target character set is the Enc-Cond performs better than Dec-Cond except for English.
multilingual vocabulary C. All other configurations are the EncDec-Cond gives the best overall WER, but the trend
same as in [20]. varies for each language. After conditioning on both encoder
and decoder, there is a clear improvement for non-English
IV. R ESULTS
languages, while the English WER remains nearly the same
A. Comparison with the State-of-The-Art as that for the multilingual model. The French monolingual
Tab. II lists the performance of our multilingual model, the model has a better WER than Spanish and German, but this
wav2vec2-based model, and Whisper. As expected, the W2V2- reverses after language conditioning. This indicates that with
based model outperforms our multilingual transformer, and sufficient data, French ALT might be more challenging than
Whisper performs the best due to its training with diverse the other two due to its complexity and frequent silent letters.
data in different environments and setups. Even when utilizing To gain a deeper understanding of the distinctions among
pretrained models, the WERs for non-English singing remain languages, we examine the language classification confusion
higher than that reported for the English monolingual model matrix of the self-conditioned model in Fig. 3. As depicted,
in [13], indicating greater challenges of multilingual ALT. languages other than English are often misclassified as En-
glish. Additionally, Spanish singing is more commonly mis-
B. Monolingual, multilingual and language self-conditioning taken for Italian than English. It is understandable that for
Tab. III lists the performance of monolingual, multilingual, the dominant language in the training set, the classification
and the language self-conditioned models. Notably, for the accuracy for English is close to 100%, which explains why
monolingual models, despite having more data than Spanish, language conditioning has minimal impact on English ALT.

149
[8] A. Graves, S. Fernández, F. J. Gomez, and J. Schmidhuber, “Connec-

German Spanish French English


0.98 0.0046 0.0035 0.011 0.0023 0 tionist temporal classification: labelling unsegmented sequence data with
0.8 recurrent neural networks,” in Proc. ICML, vol. 148. ACM, 2006, pp.
369–376.
0.15 0.8 0.0099 0.019 0.021 0 [9] G. Heigold, V. Vanhoucke, A. W. Senior, P. Nguyen, M. Ranzato,
0.6 M. Devin, and J. Dean, “Multilingual acoustic models using distributed
deep neural networks,” in IEEE International Conference on Acoustics,
0.087 0.0079 0.78 0.015 0.11 0.0011 0.4 Speech and Signal Processing, ICASSP 2013, Vancouver, BC, Canada,
May 26-31, 2013. IEEE, 2013, pp. 8619–8623.
[10] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimiza-
0.2 tion,” in 3rd International Conference on Learning Representations,
0.15 0.023 0.018 0.8 0.008 0.0011 ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track
0.0 Proceedings, 2015.
English French Spanish German Italian Russian [11] A. M. Kruspe, “Keyword spotting in a-capella singing,” in Proceed-
ings of the 15th International Society for Music Information Retrieval
Fig. 3: Confusion matrix for the self-conditioned model (%). Conference, ISMIR, Taipei, Taiwan, October 27-31, 2014, pp. 271–276.
Languages other than English frequently get confused with [12] G. Meseguer-Brocal, A. Cohen-Hadria, and G. Peeters, “Creating dali, a
large dataset of synchronized audio, lyrics, and notes,” Transactions of
English. the International Society for Music Information Retrieval, vol. 3, no. 1,
2020.
[13] L. Ou, X. Gu, and Y. Wang, “Transfer learning of wav2vec 2.0 for
V. C ONCLUSION AND FUTURE WORK automatic lyric transcription,” in Proceedings of the 23rd International
Society for Music Information Retrieval Conference, ISMIR 2022, Ben-
In summary, our study addresses the challenges of mul- galuru, India, December 4-8, 2022, 2022, pp. 891–899.
tilingual ALT model development when training only from [14] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and
I. Sutskever, “Robust speech recognition via large-scale weak super-
publicly available data, particularly in enhancing low-resource vision,” in International Conference on Machine Learning, ICML 2023,
languages. We illustrate that multilingual ALT surpasses 23-29 July 2023, Honolulu, Hawaii, USA, ser. Proceedings of Machine
monolingual ALT for all languages, primarily due to the shared Learning Research, vol. 202. PMLR, 2023, pp. 28 492–28 518.
[15] M. Ravanelli, T. Parcollet, P. Plantinga, A. Rouhe, S. Cornell, L. Lu-
phonetic similarities among them. Our language-conditioned gosch, C. Subakan, N. Dawalatabad, A. Heba, J. Zhong, J.-C. Chou,
experiments indicate that incorporating language information S.-L. Yeh, S.-W. Fu, C.-F. Liao, E. Rastorgueva, F. Grondin, W. Aris,
enhances performance. H. Na, Y. Gao, R. D. Mori, and Y. Bengio, “SpeechBrain: A general-
purpose speech toolkit,” 2021, arXiv:2106.04624.
In this study, we take a straightforward approach by merg- [16] D. Stoller, S. Durand, and S. Ewert, “End-to-end lyrics alignment
ing grapheme sets for multilingual vocabulary processing. In for polyphonic music using an audio-to-character recognition model,”
future research, we intend to explore more advanced strategies, in IEEE International Conference on Acoustics, Speech and Signal
Processing, ICASSP 2019, Brighton, United Kingdom, May 12-17, 2019.
including adapting target vocabularies using phoneme repre- IEEE, 2019, pp. 181–185.
sentations, subword units, and transliteration techniques. We [17] F. Stöter, S. Uhlich, A. Liutkus, and Y. Mitsufuji, “Open-unmix - A
believe that reducing target ambiguity is particularly critical reference implementation for music source separation,” J. Open Source
Softw., vol. 4, no. 41, p. 1667, 2019.
for effective training in low-resource settings. [18] S. Toshniwal, T. N. Sainath, R. J. Weiss, B. Li, P. J. Moreno, E. Wein-
stein, and K. Rao, “Multilingual speech recognition with a single end-
R EFERENCES to-end model,” in 2018 IEEE International Conference on Acoustics,
[1] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: Speech and Signal Processing, ICASSP 2018, Calgary, AB, Canada,
A framework for self-supervised learning of speech representations,” April 15-20, 2018. IEEE, 2018, pp. 4904–4908.
in Advances in Neural Information Processing Systems 33: Annual [19] A. Vaglio, R. Hennequin, M. Moussallam, and G. Richard, “The
Conference on Neural Information Processing Systems 2020, NeurIPS words remain the same: Cover detection with lyrics transcription,” in
2020, December 6-12, 2020, virtual, 2020. Proceedings of the 22nd International Society for Music Information
[2] A. Conneau, A. Baevski, R. Collobert, A. Mohamed, and M. Auli, Retrieval Conference, ISMIR 2021, Online, November 7-12, 2021, 2021,
“Unsupervised cross-lingual representation learning for speech recogni- pp. 714–721.
tion,” in Interspeech 2021, 22nd Annual Conference of the International [20] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
Speech Communication Association, Brno, Czechia, 30 August - 3 L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances
September 2021. ISCA, 2021, pp. 2426–2430. in Neural Information Processing Systems 30: Annual Conference on
[3] E. Demirel, S. Ahlbäck, and S. Dixon, “Computational pronunciation Neural Information Processing Systems 2017, December 4-9, 2017, Long
analysis in sung utterances,” in 29th European Signal Processing Con- Beach, CA, USA, 2017, pp. 5998–6008.
ference, EUSIPCO 2021, Dublin, Ireland, August 23-27, 2021. IEEE, [21] J.-Y. Wang, C.-I. Leong, Y.-C. Lin, L. Su, and J.-S. R. Jang, “Adapting
2021, pp. 186–190. pretrained speech model for mandarin lyrics transcription and align-
[4] E. Demirel, S. Ahlbäck, and S. Dixon, “Low resource audio-to-lyrics ment,” in 2023 IEEE Automatic Speech Recognition and Understanding
alignment from polyphonic music recordings,” in IEEE International Workshop (ASRU). IEEE, 2023, pp. 1–8.
Conference on Acoustics, Speech and Signal Processing, ICASSP 2021, [22] S. Watanabe, T. Hori, S. Kim, J. R. Hershey, and T. Hayashi, “Hy-
Toronto, ON, Canada, June 6-11, 2021. IEEE, 2021, pp. 586–590. brid ctc/attention architecture for end-to-end speech recognition,” IEEE
[5] S. Durand, D. Stoller, and S. Ewert, “Contrastive learning-based audio to Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp.
lyrics alignment for multiple languages,” in ICASSP 2023 - 2023 IEEE 1240–1253, 2017.
International Conference on Acoustics, Speech and Signal Processing [23] S. Zhou, S. Xu, and B. Xu, “Multilingual end-to-end speech recognition
(ICASSP), 2023, pp. 1–5. with A single transformer on low-resource languages,” ArXiv, vol.
[6] H. Fujihara and M. Goto, “Lyrics-to-audio alignment and its appli- abs/1806.05059, 2018.
cation,” in Multimodal Music Processing, ser. Dagstuhl Follow-Ups. [24] L. Zhuo, R. Yuan, J. Pan, Y. Ma, Y. Li, G. Zhang, S. Liu, R. B.
Schloss Dagstuhl - Leibniz-Zentrum fuer Informatik, Germany, 2012, Dannenberg, J. Fu, C. Lin, E. Benetos, W. Chen, W. Xue, and Y. Guo,
vol. 3, pp. 23–36. “Lyricwhiz: Robust multilingual zero-shot lyrics transcription by whis-
[7] X. Gao, C. Gupta, and H. Li, “Polyscriber: Integrated fine-tuning of pering to chatgpt,” in Proceedings of the 24th International Society
extractor and lyrics transcriber for polyphonic music,” IEEE ACM Trans. for Music Information Retrieval Conference, ISMIR 2023, Milan, Italy,
Audio Speech Lang. Process., vol. 31, pp. 1968–1981, 2023. November 5-9, 2023, 2023, pp. 343–351.

150

You might also like