Nvidia has shown off a new generative AI model that can synthesize audio through simple text instructions and contextual audio inputs to create unique sounds. Nvidia envisions Fugatto 1 "as a tool for creatives, empowering them to quickly bring their sonic fantasies and unheard sounds to life—an instrument for imagination, not a replacement for creativity."
In its research paper, the team says Large Language Models (LLMs) trained on text can learn to infer instructions from inputs, but LLMs trained purely on audio cannot do that. Audio does not have data that shows how it was created.
Nvidia's Fugatto 1 uses a specialized dataset that pulls from a wide gamut of sounds and a method for understanding and controlling instructions called ComposeableART. This allows the model to create an emergent dataset that can help the model combine different sounds, even ones it wasn't trained to handle.
Nvidia has shown off a few examples of the model in action on Fugatto's Github page, such as the ability to synthesize the sound of a dog barking in time with electronic dance music, a typewriter that whispers every letter typed, and even a saxophone that meows or barks.
So far, Nvidia has no plans to release the model publicly.