How Do Schrodinger Bridges Beat Diffusion Models On Text-To-Speech (TTS) Synthesis?

With the growing number of advancements in Artificial Intelligence, the fields of Natural Language Processing, Natural Language Generation, and Computer Vision have gained massive popularity recently, all thanks to the introduction of Large Language Models (LLMs). Diffusion models, which have proven to be successful in producing text-to-speech (TTS) synthesis, have shown some great generation quality. However, their prior distribution is limited to a representation that introduces noise and offers little information about the desired generation goal.

In recent research, a team of researchers from Tsinghua University and Microsoft Research Asia has introduced a new text-to-speech system called Bridge-TTS. It is the first attempt to substitute a clean and predictable alternative for the noisy Gaussian prior used in well-established diffusion-based TTS approaches. This replacement prior provides strong structural information about the target and has been taken from the latent representation extracted from the text input.

The team has shared that the main contribution is the development of a completely manageable Schrodinger bridge that connects the ground-truth mel-spectrogram and the clean prior. The suggested bridge-TTS uses a data-to-data process, which improves the information content of the previous distribution, in contrast to diffusion models that function through a data-to-noise process.

The team has evaluated the approach, and upon evaluation, the efficacy of the suggested method has been highlighted by the experimental validation conducted on the LJ-Speech dataset. In 50-step/1000-step synthesis settings, Bridge-TTS has demonstrated better performance than its diffusion counterpart, Grad-TTS. It has even performed better in few-step scenarios than strong and fast TTS models. The Bridge-TTS approach’s primary strengths have been emphasized as being the synthesis quality and sampling efficiency.

The team has summarized the primary contributions as follows.

Mel-spectrograms have been produced from an uncontaminated text latent representation. Unlike the traditional data-to-noise procedure, this representation, which functions as the condition information in the context of diffusion models, has been created to be noise-free. Schrodinger bridge has been used to investigate a data-to-data process.

For paired data, a fully tractable Schrodinger bridge has been proposed. This bridge uses a reference stochastic differential equation (SDE) in a flexible form. This method permits empirical investigation of design spaces in addition to offering a theoretical explanation.

It has been studied that how the sampling technique, model parameterization, and noise scheduling contribute to improved TTS quality. An asymmetric noise schedule, data prediction, and first-order bridge samplers have also been implemented.

The complete theoretical explanation of the underlying processes has been made possible by the fully tractable Schrodinger bridge. Empirical investigations have been carried out in order to comprehend how different elements affect the quality of TTS, which includes examining the effects of asymmetric noise schedules, model parameterization decisions, and sampling process efficiency.

The method has produced great outcomes in terms of inference speed and generation quality. The diffusion-based equivalent Grad-TTS has been greatly outperformed by the method in both 1000-step and 50-step generation situations. It also outperformed FastGrad-TTS in 4-step generation, the transformer-based model FastSpeech 2, and the state-of-the-art distillation approach CoMoSpeech in 2-step generation.

The method has achieved outstanding outcomes after just one training session. This efficiency is visible at several stages of the creation process, demonstrating the dependability and potency of the suggested approach.

Check out the Paper and Project. All credit for this research goes to the researchers of this project. Also, don’t forget to join our 33k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

The post How Do Schrodinger Bridges Beat Diffusion Models On Text-To-Speech (TTS) Synthesis? appeared first on MarkTechPost.

#AIShorts #Applications #ArtificialIntelligence #EditorsPick #LanguageModel #LargeLanguageModel #MachineLearning #Sound #Staff #TechNews #Technology #Uncategorized
[Source: AI Techpark]

How Do Schrodinger Bridges Beat Diffusion Models On Text-To-Speech (TTS) Synthesis?

Related Post

You missed

Duo Health Announces New President and COO

Accenture announced the acquisition of BOSLAN

NEC APAC, Spectro Cloud partner to Accelerate Cloud Native Innovation

ARCLE: A Reinforcement Learning Environment for Abstract Reasoning Challenges