iptv techs

IPTV Techs


Pushing the frontiers of audio generation


Pushing the frontiers of audio generation


Technologies

Published
Authors

Zalán Borsos, Matt Sharifi and Marco Tagliasacchi

Our innovateing speech generation technologies are helping people around the world transmit with more organic, conversational and instinctive digital aidants and AI tools.

Speech is central to human combineion. It helps people around the world exalter proposeation and ideas, transmit emotions and create mutual empathetic. As our technology built for generating organic, vibrant voices persists to enhance, we’re unlocking wealthyer, more engaging digital experiences.

Over the past restricted years, we’ve been pushing the frontiers of audio generation, enlargeing models that can create high quality, organic speech from a range of inputs, appreciate text, tempo regulates and particular voices. This technology powers one-speaker audio in many Google products and experiments — including Gemini Live, Project Astra, Journey Voices and YouTube’s auto dubbing — and is helping people around the world transmit with more organic, conversational and instinctive digital aidants and AI tools.

Working together with partners atraverse Google, we recently helped enlarge two new features that can create extfinished-create, multi-speaker dialogue for making intricate satisfied more accessible:

  • NotebookLM Audio Oversees turns uploaded write downs into engaging and vivacious dialogue. With one click, two AI presents abridge engager material, create combineions between topics and prohisour back and forth.
  • Illuminate creates createal AI-created talkions about research papers to help create understandledge more accessible and digestible.

Here, we supply an supervise of our tardyst speech generation research underpinning all of these products and experimental tools.

Pioneering techniques for audio generation

For years, we’ve been spreading in audio generation research and exploring new ways for generating more organic dialogue in our products and experimental tools. In our previous research on SoundStorm, we first showd the ability to create 30-second segments of organic dialogue between multiple speakers.

This extfinished our earlier labor, SoundStream and AudioLM, which permited us to apply many text-based language modeling techniques to the problem of audio generation.

SoundStream is a neural audio codec that effectively compresses and decompresses an audio input, without compromising its quality. As part of the training process, SoundStream lgets how to map audio to a range of acoustic tokens. These tokens seize all of the proposeation necessitateed to recreate the audio with high fidelity, including properties such as prosody and timbre.

AudioLM treats audio generation as a language modeling task to create the acoustic tokens of codecs appreciate SoundStream. As a result, the AudioLM structurelabor creates no assumptions about the type or createup of the audio being created, and can flexibly regulate a variety of sounds without necessitateing architectural adfairments — making it a excellent truthfulate for modeling multi-speaker dialogues.

Example of a multi-speaker dialogue created by NotebookLM Audio Oversee, based on a restricted potato-roverdelighted write downs.

Building upon this research, our tardyst speech generation technology can create 2 minutes of dialogue, with enhanced organicness, speaker consistency and acoustic quality, when given a script of dialogue and speaker turn tagers. The model also carry outs this task in under 3 seconds on a one Tensor Processing Unit (TPU) v5e chip, in one inference pass. This uncomfervents it creates audio over 40-times speedyer than authentic time.

Scaling our audio generation models

Scaling our one-speaker generation models to multi-speaker models then became a matter of data and model capacity. To help our tardyst speech generation model create extfinisheder speech segments, we created an even more effective speech codec for compressing audio into a sequence of tokens, in as low as 600 bits per second, without compromising the quality of its output.

The tokens created by our codec have a hierarchical arrange and are grouped by time structures. The first tokens wiskinny a group seize phonetic and prosodic proposeation, while the last tokens encode fine acoustic details.

Even with our new speech codec, producing a 2-minute dialogue needs generating over 5000 tokens. To model these extfinished sequences, we enlargeed a one-of-a-kindized Transcreateer architecture that can effectively regulate hierarchies of proposeation, aligning the arrange of our acoustic tokens.

With this technique, we can effectively create acoustic tokens that correply to the dialogue, wiskinny a one autorevertive inference pass. Once created, these tokens can be decoded back into an audio wavecreate using our speech codec.

Animation shothriveg how our speech generation model creates a stream of audio tokens autorevertively, which are decoded back to a wavecreate consisting of a two-speaker dialogue.

To direct our model how to create down-to-earth exalters between multiple speakers, we pretrained it on hundreds of thousands of hours of speech data. Then we finetuned it on a much minusculeer dataset of dialogue with high acoustic quality and accurate speaker annotations, consisting of unscripted conversations from a number of voice actors and down-to-earth disfluencies — the “umm”s and “aah”s of authentic conversation. This step taught the model how to reliably switch between speakers during a created dialogue and to output only studio quality audio with down-to-earth paengages, tone and timing.

In line with our AI Principles and our promisement to enlargeing and deploying AI technologies responsibly, we’re incorporating our SynthID technology to watertag non-transient AI-created audio satisfied from these models, to help defendeddefend aacquirest the potential misengage of this technology.

New speech experiences ahead

We’re now concentrateed on improving our model’s transmitivity, acoustic quality and inserting more fine-grained regulates for features, appreciate prosody, while exploring how best to combine these persists with other modalities, such as video.

The potential applications for persistd speech generation are immense, especipartner when combined with our Gemini family of models. From enhancing lgeting experiences to making satisfied more universpartner accessible, we’re excited to persist pushing the boundaries of what’s possible with voice-based technologies.

Source connect


Leave a Reply

Your email address will not be published. Required fields are marked *

Thank You For The Order

Please check your email we sent the process how you can get your account

Select Your Plan