Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2020
Audio textures are a superset of standard musical instrument timbres that include more complex sounds such as rain, wind, rolling, or scraping. With appropriate modeling strategies, textures can be synthesized under parametric control analogous to the way musical instruments are, and can then become a powerful creativity tools for music making. However, audio textures, with complex structure spanning multiple time scales, are a challenge to model and generate synthetically. They are even challenging to define. Deep learning approaches offer new ways to develop generative audio texture models, and they create different demands on training data than traditional modeling approaches, In this paper we briefly review previous modeling approaches, and attempt to rationalize and converge on a definition of textures using modeling concepts. We introduce a new and growing data set along with a system for managing metadata specifically designed for audio textures. Finally, we report on some re...
ArXiv, 2019
Capturing high-level structure in audio waveforms is challenging because a single second of audio spans tens of thousands of timesteps. While long-range dependencies are difficult to model directly in the time domain, we show that they can be more tractably modelled in two-dimensional time-frequency representations such as spectrograms. By leveraging this representational advantage, in conjunction with a highly expressive probabilistic model and a multiscale generation procedure, we design a model capable of generating high-fidelity audio samples which capture structure at timescales that time-domain models have yet to achieve. We apply our model to a variety of audio generation tasks, including unconditional speech generation, music generation, and text-to-speech synthesis---showing improvements over previous approaches in both density estimates and human judgments.
ArXiv, 2018
We describe a system based on deep learning that generates drum patterns in the electronic dance music domain. Experimental results reveal that generated patterns can be employed to produce musically sound and creative transitions between different genres, and that the process of generation is of interest to practitioners in the field.
arXiv (Cornell University), 2023
The research in Deep Learning applications in sound and music computing have gathered an interest in the recent years; however, there is still a missing link between these new technologies and on how they can be incorporated into real-world artistic practices. In this work, we explore a well-known Deep Learning architecture called Variational Autoencoders (VAEs). These architectures have been used in many areas for generating latent spaces where data points are organized so that similar data points locate closer to each other. Previously, VAEs have been used for generating latent timbre spaces or latent spaces of symbolic music excepts. Applying VAE to audio features of timbre requires a vocoder to transform the timbre generated by the network to an audio signal, which is computationally expensive. In this work, we apply VAEs to raw audio data directly while bypassing audio feature extraction. This approach allows the practitioners to use any audio recording while giving exibility and control over the aesthetics through dataset curation. The lower computation time in audio signal generation allows the raw audio approach to be incorporated into real-time applications. In this work, we propose three strategies to explore latent spaces of audio and timbre for sound design applications. By doing so, our aim is to initiate a conversation on artistic approaches and strategies to utilize latent audio spaces in sound and music practices.
Neural Computing and Applications, 2018
In addition to traditional tasks such as prediction, classification and translation, deep learning is receiving growing attention as an approach for music generation, as witnessed by recent research groups such as Magenta at Google and CTRL (Creator Technology Research Lab) at Spotify. The motivation is in using the capacity of deep learning architectures and training techniques to automatically learn musical styles from arbitrary musical corpora and then to generate samples from the estimated distribution. However, a direct application of deep learning to generate content rapidly reaches limits as the generated content tends to mimic the training set without exhibiting true creativity. Moreover, deep learning architectures do not offer direct ways for controlling generation (e.g., imposing some tonality or other arbitrary constraints). Furthermore, deep learning architectures alone are autistic automata which generate music autonomously without human user interaction, far from the objective of interactively assisting musicians to compose and refine music. Issues such as: control, structure, creativity and interactivity are the focus of our analysis. In this paper, we select some limitations of a direct application of deep learning to music generation, analyze why the issues are not fulfilled and how to address them by possible approaches. Various examples of recent systems are cited as examples of promising directions. * To appear in Special Issue on Deep learning for music and audio, Neural Computing & Applications, Springer Nature, 2018. 1 With many variants such as convolutional networks, recurrent networks, autoencoders, restricted Boltzmann machines, etc. [GBC16].
2011
The synthesis of sound textures, such as rain, wind, or crowds, is an important application for cinema, multimedia creation, games and installations. However, despite the clearly defined requirments of naturalness and flexibility, no automatic method has yet found widespread use. After clarifying the definition, terminology, and usages of sound texture synthesis, we will give an overview of the many existing methods and approaches, and the few available software implementations, and classify them by the synthesis model they are based on, such as subtractive or additive synthesis, granular synthesis, corpus-based concatenative synthesis, wavelets, or physical modeling. Additionally, an overview is given over analysis methods used for sound texture synthesis, such as segmentation, statistical modeling, timbral analysis, and modeling of transitions.
2019
We are interested in using deep learning models to generate new music. Using the 1 Maestro Dataset, we will use an LSTM architecture that inputs tokenized Midi 2 files and outputs predictions for note. Our accuracy will be measured by taking 3 predicted noted and comparing those to ground truths. Using A.I. for music is a 4 relatively new area of study, and this project provides an investigation into creating 5 an effect model for the music industry. 6
International Journal of Scientific Research in Computer Science, Engineering and Information Technology, 2021
Advancement in deep neural networks have made it possible to compose music that mimics music composition by humans. The capacity of deep learning architectures in learning musical style from arbitrary musical corpora have been explored in this paper. The paper proposes a method for generated from the estimated distribution. Musical chords have been extracted for various instruments to train a sequential model to generate the polyphonic music on some selected instruments. We demonstrate a simple method comprising a sequential LSTM models to generate polyphonic music. The results of the model evaluation show that generated music is pleasant to hear and is similar to music played by humans. This has great application in entertainment industry which enables music composers to generate variety of creative music.
arXiv (Cornell University), 2020
We present the Latent Timbre Synthesis (LTS), a new audio synthesis method using Deep Learning. The synthesis method allows composers and sound designers to interpolate and extrapolate between the timbre of multiple sounds using the latent space of audio frames. We provide the details of two Variational Autoencoder architectures for LTS, and compare their advantages and drawbacks. The implementation includes a fully working application with graphical user interface, called interpolate two, which enables practitioners to explore the timbre between two audio excerpts of their selection using interpolation and extrapolation in the latent space of audio frames. Our implementation is open-source, and we aim to improve the accessibility of this technology by providing a guide for users with any technical background.
Journal of The Audio Engineering Society, 2018
In this work, we propose a deep learning based method, namely, variational, convolutional recurrent autoencoders (VCRAE), for musical instrument synthesis. This method utilizes the higher level time-frequency representations extracted by the convolutional and recurrent layers to learn a Gaussian distribution in the training stage, which will be later used to infer unique samples through interpolation of multiple instruments in the usage stage. The reconstruction performance of VCRAE is evaluated by proxy through an instrument classifier, and provides significantly better accuracy than two other baseline autoencoder methods. The synthesized samples for the combinations of 15 different instruments are available on the companion website.
Zenodo (CERN European Organization for Nuclear Research), 2023
This paper introduces a new course on deep learning with audio, designed specifically for graduate students in arts studies. The course introduces students the principles of deep learning models in audio and symbolic domain as well as their possible applications in music composition and production. The course covers topics such as data preparation and processing, neural network architectures, training and application of deep learning models in music related tasks. The course also incorporates hands-on exercises and projects, allowing students to apply the concepts learned in class to real-world audio data. In addition, the course introduces a novel approach to integrating audio generation using deep learning models in Pure Data realtime audio synthesis environment, which enables students to create original and expressive audio content in a programming environment that they are more familiar with. The variety of the audio content produced by the students demonstrates the effectiveness of the course in fostering creative approach to their own music productions. Overall, this new course on deep learning with audio represents a significant contribution to the field of artificial intelligence (AI) music and creativity, providing arts graduate students with the necessary skills and knowledge to tackle the challenges of the rapidly evolving AI music technologies.
Neural Computing and Applications
We present the Latent Timbre Synthesis (LTS), a new audio synthesis method using Deep Learning. The synthesis method allows composers and sound designers to interpolate and extrapolate between the timbre of multiple sounds using the latent space of audio frames. We provide the details of two Variational Autoencoder architectures for LTS, and compare their advantages and drawbacks. The implementation includes a fully working application with graphical user interface, called interpolate two, which enables practitioners to explore the timbre between two audio excerpts of their selection using interpolation and extrapolation in the latent space of audio frames. Our implementation is open-source, and we aim to improve the accessibility of this technology by providing a guide for users with any technical background.
This paper introduces WaveNet, a deep neural network for generating raw audio waveforms. The model is fully probabilistic and autoregressive, with the predic-tive distribution for each audio sample conditioned on all previous ones; nonetheless we show that it can be efficiently trained on data with tens of thousands of samples per second of audio. When applied to text-to-speech, it yields state-of-the-art performance, with human listeners rating it as significantly more natural sounding than the best parametric and concatenative systems for both English and Chinese. A single WaveNet can capture the characteristics of many different speakers with equal fidelity, and can switch between them by conditioning on the speaker identity. When trained to model music, we find that it generates novel and often highly realistic musical fragments. We also show that it can be employed as a discriminative model, returning promising results for phoneme recognition.
Cornell University - arXiv, 2017
We propose a recurrent variational auto-encoder for texture synthesis. A novel loss function, FLTBNK, is used for training the texture synthesizer. It is rotational and partially color invariant loss function. Unlike L2 loss, FLTBNK explicitly models the correlation of color intensity between pixels. Our texture synthesizer 1 generates neighboring tiles to expand a sample texture and is evaluated using various texture patterns from Describable Textures Dataset (DTD). We perform both quantitative and qualitative experiments with various loss functions to evaluate the performance of our proposed loss function (FLTBNK)-a minihuman subject study is used for the qualitative evaluation.
International Journal of Computer Applications, 2017
Convolutional neural networks have recently become extremely popular in various deep learning applications. One such application is style transfer for images. Following this trend, this paper explores how this technique can be applied to audio data. The technique discussed involves combining the content features of one audio sample with the style features of another audio sample. The results produced show how a Convolutional Neural Network can be used to extract features from audio signals. The paper also discusses the various modifications made in the algorithm used for image style transfer in order to apply it to audio signals.
2021 24th International Conference on Digital Audio Effects (DAFx)
This paper proposes a novel way of doing audio synthesis at the waveform level using Transformer architectures. We propose a deep neural network for generating waveforms, similar to wavenet [1]. This is fully probabilistic, auto-regressive, and causal, i.e. each sample generated depends on only the previously observed samples. Our approach outperforms a widely used wavenet architecture by up to 9% on a similar dataset for predicting the next step. Using the attention mechanism, we enable the architecture to learn which audio samples are important for the prediction of the future sample. We show how causal transformer generative models can be used for raw waveform synthesis. We also show that this performance can be improved by another 2% by conditioning samples over a wider context. The flexibility of the current model to synthesize audio from latent representations suggests a large number of potential applications. The novel approach of using generative transformer architectures for raw audio synthesis is, however, still far away from generating any meaningful music similar to wavenet, without using latent codes/meta-data to aid the generation process.
2019
Image style transfer networks are used to blend images, producing images that are a mix of source images. The process is based on controlled extraction of style and content aspects of images, using pre-trained Convolutional Neural Networks (CNNs). Our interest lies in adopting these image style transfer networks for the purpose of transforming sounds. Audio signals can be presented as grey-scale images of audio spectrograms. The purpose of our work is to investigate whether audio spectrogram inputs can be used with image neural transfer networks to produce new sounds. Using musical instrument sounds as source sounds, we apply and compare three existing image neural style transfer networks for the task of sound mixing. Our evaluation shows that all three networks are successful in producing consistent, new sounds based on the two source sounds. We use classification models to demonstrate that the new audio signals are consistent and distinguishable from the source instrument sounds. ...
2020
Music albums nowadays, can not be conceived without its artwork. Since first used, album artwork importance has changed. In our digital era, audiovisual content is everywhere and of course, regarding music albums, album covers play an important role. Computer Vision has unleashed powerful technologies for image generation,in the last decade, which have been used for lots of different applications. In particular, the main discoveries are Varational Autoencoders (VAEs) and Generative Adversarial Networks (GANs). Latest researches on these technologies have contributed to understand and improve them, ac-quiring high quality and complex image generation. In this thesis, we experiment with the latest image generation tools to achieve album artwork generation based on audio sam-ples. We first analyse image generation without audio conditioning for VAEs and the three GAN approaches: vanilla GAN, Least Squares GAN (LSGAN) and Wasserter GAN with gradient penalty (WGAN-GP). Finally, we try th...
Cornell University - arXiv, 2018
Sequential models achieve state-of-the-art results in audio, visual and textual domains with respect to both estimating the data distribution and generating high-quality samples. Efficient sampling for this class of models has however remained an elusive problem. With a focus on text-to-speech synthesis, we describe a set of general techniques for reducing sampling time while maintaining high output quality. We first describe a single-layer recurrent neural network, the WaveRNN, with a dual softmax layer that matches the quality of the state-of-the-art WaveNet model. The compact form of the network makes it possible to generate 24 kHz 16-bit audio 4× faster than real time on a GPU. Second, we apply a weight pruning technique to reduce the number of weights in the WaveRNN. We find that, for a constant number of parameters, large sparse networks perform better than small dense networks and this relationship holds for sparsity levels beyond 96%. The small number of weights in a Sparse WaveRNN makes it possible to sample high-fidelity audio on a mobile CPU in real time. Finally, we propose a new generation scheme based on subscaling that folds a long sequence into a batch of shorter sequences and allows one to generate multiple samples at once. The Subscale WaveRNN produces 16 samples per step without loss of quality and offers an orthogonal method for increasing sampling efficiency. * Equal contribution 1 DeepMind 2 Google Brain.
arXiv (Cornell University), 2017
We introduce a two-stream model for dynamic texture synthesis. Our model is based on pre-trained convolutional networks (ConvNets) that target two independent tasks: (i) object recognition, and (ii) optical flow prediction. Given an input dynamic texture, statistics of filter responses from the object recognition ConvNet encapsulate the per-frame appearance of the input texture, while statistics of filter responses from the optical flow ConvNet model its dynamics. To generate a novel texture, a randomly initialized input sequence is optimized to match the feature statistics from each stream of an example texture. Inspired by recent work on image style transfer and enabled by the two-stream model, we also apply the synthesis approach to combine the texture appearance from one texture with the dynamics of another to generate entirely novel dynamic textures. We show that our approach generates novel, high quality samples that match both the framewise appearance and temporal evolution of input texture. Finally, we quantitatively evaluate our texture synthesis approach with a thorough user study.
IRJET, 2022
This paper describes automatic music generation. This is done using the concept of deep learning. The generation of music is in the form of sequence of ABC notes. Music technology is currently realistic for use of large scale data. Mostly for music generation using deep learning LSTM or GRU's are used for modelling. As far as music generation is considered it is similar to sequence generation. LSTM most efficiently generates sequence hence use of this would be best.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.