The report mentions several audio production models, such as UniAudio, MusicGen, and MusicLM. Today, we can first talk about
UniAudio: Using large language models to achieve universal audio generation
UniAudio is an audio generation model based on LLM, developed in collaboration between Microsoft Research Asia and some universities. It supports a wide range of audio generation tasks. UniAudio can generate speech, sounds, music, and singing voices based on various input conditions such as phonemes, text descriptions, or audio itself. The model was built using 100,000 hours of multi-source open audio data and has been expanded to one billion parameters. The audio tokenization method and language model architecture are specifically designed for performance and efficiency.
Things it can do include:
Zero-shot learning for text-to-speech conversion(Zero-shot TTS) Clone celebrity voices(Cloning famous person’s voice)
Clone everyday sounds(Cloning the person’s voice from your daily life)
Long sentence text-to-speech conversion(Long sentence by TTS) Zero-shot voice conversion(Zero-shot VC)
Zero-shot singing voice synthesis(Zero-shot Sing Voice Synthesis) Zero-shot speech enhancement(Zero-shot Speech Enhancement)
Zero-shot target speaker extraction(Zero-shot Target Speaker Extraction) Zero-shot text-to-speech conversion(Zero-shot Text-to-Sound) 20-second audio generation(20s audio genenration) Directive-based text-to-speech conversion(Instructed TTS) Audio editing(Audio Edit)
Voice dereverberation(Speech Dereverberation) Voice editing(Speech Edit) Chinese text to speech(Chinese TTS)
The sound can be heard here: https://dongchaoyang.top/UniAudio_demo/