Text to Speech (via prompt audio)

POST

If you prefer to manage voices on your own, you can use your own audio file as a reference for the voice clone.

Request

This endpoint expects an object.
textstringRequired

The text to be converted to speech.

prompt_audiostringRequired

The audio of the voice prompt to clone. This can be the url of a publicly accessible audio file or base64 encoded byte string.

The audio file should have a duration ranging from 3 to 30 seconds (quality does not improve with more than 30 seconds of reference audio). It can be in any audio format, as long as it is less than 50 MB.

language_codestringOptional

Language code used to specify language/accent for the model, see supported languages. If not specified, language is auto-detected.

seedintegerOptional

Assuming all other properties didn’t change, a fixed seed should always generate the exact same audio file.

output_formatstringOptionalDefaults to mp3_44100_192

Output audio format. Must be one of the following:

  • mp3_44100_192 - MP3 with 44.1kHz sample rate at 192kbps
  • mp3_44100_128 - MP3 with 44.1kHz sample rate at 128kbps
  • mp3_44100_96 - MP3 with 44.1kHz sample rate at 96kbps
  • mp3_44100_64 - MP3 with 44.1kHz sample rate at 64kbps
  • mp3_44100_32 - MP3 with 44.1kHz sample rate at 32kbps
  • mp3_22050_32 - MP3 with 22.05kHz sample rate at 32kbps
  • wav_44100 - WAV with 44.1kHz sample rate
  • wav_24000 - WAV with 24kHz sample rate
  • wav_22050 - WAV with 22.05kHz sample rate
  • wav_16000 - WAV with 16kHz sample rate

Response

This endpoint returns a file.