If you prefer to manage voices on your own, you can use your own audio file as a reference for the voice clone.
The text to be converted to speech.
The audio of the voice prompt to clone. This can be the url of a publicly accessible audio file or base64 encoded byte string.
The audio file should have a duration ranging from 3 to 30 seconds (quality does not improve with more than 30 seconds of reference audio). It can be in any audio format, as long as it is less than 50 MB.
Language code used to specify language/accent for the model, see supported languages. If not specified, language is auto-detected.
Assuming all other properties didn’t change, a fixed seed should always generate the exact same audio file.
Output audio format. Must be one of the following:
mp3_44100_192
- MP3 with 44.1kHz sample rate at 192kbpsmp3_44100_128
- MP3 with 44.1kHz sample rate at 128kbpsmp3_44100_96
- MP3 with 44.1kHz sample rate at 96kbpsmp3_44100_64
- MP3 with 44.1kHz sample rate at 64kbpsmp3_44100_32
- MP3 with 44.1kHz sample rate at 32kbpsmp3_22050_32
- MP3 with 22.05kHz sample rate at 32kbpswav_44100
- WAV with 44.1kHz sample ratewav_24000
- WAV with 24kHz sample ratewav_22050
- WAV with 22.05kHz sample ratewav_16000
- WAV with 16kHz sample rate