we present BigVGAN, a universal neural vocoder. It’s trained only on speech data but shows extraordinary zero-shot generalization ability for non-speech vocalizations (laughter, applaud), singing voices, music, instrumental audio that are even recorded in varied noisy environment!
We present CleanUNet, a speech denoising model on the raw waveform. It is based on an encoder-decoder architecture combined with several self-attention blocks to refine its bottleneck representations, which is crucial to obtain good results. It outperforms the state-of-the-art models in terms of denoised speech quality from various objective and subjective evaluation metrics.
We present an unsupervised alignment learning framework that learns speech-text alignments online in text to speech models. We showcase this alignment learning framework can be applied to any TTS model removing the dependency of TTS systems on external aligners. It also enhances the speech quality as evaluated by human evaluators.
RAD-TTS is a parallel flow-based generative network for text-to-speech synthesis which does not rely on external aligners to learn speech-text alignments and supports diversity in generated speech by modeling speech rhythm as a separate generative distribution.
Recommended citation: Anand Bhattad, Aysegul Dundar, Guilin Liu, Andrew Tao, Bryan Catanzaro, View Generalization for Single Image Textured 3D Models, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition(CVPR) 2021.
We release version 1.0 of Megatron which makes the training of large NLP models even faster and sustains 62.4 teraFLOPs in the end-to-end training that is 48% of the theoretical peak FLOPS for a single GPU in a DGX2-H server.
We propose unsupervised techniques to synthesize high frame rate videos directly from low frame rate videos using cycle consistency. We also introduce a pseudo-supervised loss term that enforces the interpolated frames to be consistent with predictions of a pre-trained interpolation model. The pseudo-supervised loss term, used together with cycle consistency, can effectively adapt a pre-trained model to a new target domain. We show results that significantly reduce the domain gap problem in video frame interpolation.
We train an 8.3 billion parameter transformer language model with 8-way model parallelism and 64-way data parallelism on 512 GPUs, making it the largest transformer based language model ever trained at 24x the size of BERT and 5.6x the size of GPT-2
This paper shows how to scale up training sets for semantic segmentation by using video prediction-based data synthesis method. Our proposed joint propagation strategy and boundary relaxation technique can alleviate the label noise in the synthesized samples and lead to state-of-the-art performance on three benchmark datasets Cityscapes, CamVid and KITTI.
SDCNet is a 3D convolutional neural network proposed for frame prediction. The model takes as input a sequence of past frames and their inter-frame optical flows and generates a per-pixel kernel and motion vector. A future frame is then synthesised by sampling past frames guided by the motion vectors and weighted by the learned kernels.
This paper shows how to do large scale distributed, large batch, mixed precision training of language models with investigations into the successes and limitations of large batch training on publicly available language datasets.
This paper shows how to do whole binary classification for malware detection with a convolutional neural network. Done in collaboration with researchers at the University of Maryland.