

Task Name Prompt Inputs Outputs
Text-To-Speech Generate a speech with text "here we go". /
Style Transfer Speak using the voice of this audio. The text is "Here we go".
Speech Recognition Transcribe this speech. Here we go.
Speech Enhancement Enhance the quality of the speech signal.
Speech Separation Separate each speech from the speech mixture.
Mono-to-Binaural Transfer this mono audio into binaural audio.


Task Name Prompt Inputs Outputs
Text-To-Sing Please generate a piece of singing voice. Text sequence is 小酒窝长睫毛AP是你最美的记号. Note sequence is C#4/Db4 | F#4/Gb4 | G#4/Ab4 | A#4/Bb4 F#4/Gb4 | F#4/Gb4 C#4/Db4 | C#4/Db4 | rest | C#4/Db4 | A#4/Bb4 | G#4/Ab4 | A#4/Bb4 | G#4/Ab4 | F4 | C#4/Db4. Note duration sequence is 0.407140 | 0.376190 | 0.242180 | 0.509550 0.183420 | 0.315400 0.235020 | 0.361660 | 0.223070 | 0.377270 | 0.340550 | 0.299620 | 0.344510 | 0.283770 | 0.323390 | 0.360340. /


Task Name Prompt Inputs Outputs
Text-To-Audio Generate an audio of a piano playing. /
Audio Inpainting I want to inpaint this audio.
Image-To-Audio Generate an audio of this image.
Sound Detection What events does this audio include?
Target Sound Detection Please help me detect the target sound in the audio based on desription: "I want to detect thunder event". The thunder happened in this audio from 0.0 to 9.984 seconds.
Sound Extraction Extract the thunder event from this audio.
Audio-To-Text Give me the description of this audio. The audio is recording of a goat bleating nearby several times.


Task Name Prompt Inputs Outputs
Talking Head Synthesis Generate a talking human portrait video.