Meta AI Introduces MAGNET: The First Pure Non-Autoregressive Method for Text-Conditioned Audio Generation

Recent advancements in self-supervised representation learning, sequence modeling, and audio synthesis have significantly enhanced the performance of conditional audio generation. The prevailing approach involves representing audio signals as compressed representations, either discrete or continuous, upon which generative models are applied. Various works have explored methods, such as applying a Vector Quantized Variational Autoencoder (VQ-VAE) directly on raw waveforms or training conditional diffusion-based generative models on learned continuous representations.

To address limitations in existing approaches, researchers at FAIR Team META have introduced MAGNET, an acronym for masked audio generation using non-autoregressive transformers. MAGNET is a novel masked generative sequence modeling technique operating on a multi-stream representation of audio signals.

Unlike autoregressive models, MAGNET works non-autoregressively, significantly reducing inference time and latency. During training, MAGNET samples a masking rate from a masking scheduler and masks and predicts spans of input tokens conditioned on unmasked ones. It gradually constructs the output audio sequence during inference using several decoding steps. Additionally, they introduce a novel rescoring method leveraging an external pre-trained model to improve generation quality.

They also explore a Hybrid version of MAGNET, combining autoregressive and non-autoregressive models. In the hybrid approach, the beginning of the token sequence is generated autoregressively, while the rest of the sequence is decoded in parallel. Previous works have proposed similar non-autoregressive modeling techniques for machine translation and image generation tasks. However, MAGNET is distinct in its application to audio generation, leveraging the full frequency spectrum of the signal.

They evaluate MAGNET for text-to-music and text-to-audio generation tasks, reporting objective metrics and conducting a human study. The results demonstrate that MAGNET achieves comparable results to autoregressive baselines while significantly reducing latency. Additionally, they analyze the trade-offs between autoregressive and non-autoregressive models, providing insights into their performance characteristics. Their contributions include the introduction of MAGNET as a novel non-autoregressive model for audio generation, using external pre-trained models for rescoring, and exploring a hybrid approach combining autoregressive and non-autoregressive modeling.

Furthermore, their work contributes to exploring non-autoregressive modeling techniques in audio generation, offering insights into their effectiveness and applicability in real-world scenarios. By significantly reducing latency without sacrificing generation quality, MAGNET opens up possibilities for interactive applications such as music generation and editing under Digital Audio Workstations (DAW).

Additionally, the proposed rescoring method enhances the overall quality of generated audio, further solidifying the practical utility of the approach. Through rigorous evaluation and analysis, they comprehensively understand the trade-offs between autoregressive and non-autoregressive models, paving the way for future advancements in efficient and high-quality audio generation systems.

Check out the Paper and Github. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and Google News. Join our 38k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and LinkedIn Group.