Text to Speech (via prompt audio)


If you prefer to manage voices on your own, you can use your own audio file as a reference for the voice clone.


This endpoint expects an object.

The text to be converted to speech.


The audio of the voice prompt to clone. This can be the url of a publicly accessible audio file or base64 encoded byte string.

The audio file should have a duration ranging from 3 to 30 seconds (quality does not improve with more than 30 seconds of reference audio). It can be in any audio format, as long as it is less than 50 MB.


Language code used to specify language/accent for the model, see supported languages. If not specified, language is auto-detected.


Assuming all other properties didn’t change, a fixed seed should always generate the exact same audio file.

output_formatstringOptionalDefaults to mp3_44100_192

Output audio format. Must be one of the following:

  • mp3_44100_192 - MP3 with 44.1kHz sample rate at 192kbps
  • mp3_44100_128 - MP3 with 44.1kHz sample rate at 128kbps
  • mp3_44100_96 - MP3 with 44.1kHz sample rate at 96kbps
  • mp3_44100_64 - MP3 with 44.1kHz sample rate at 64kbps
  • mp3_44100_32 - MP3 with 44.1kHz sample rate at 32kbps
  • mp3_22050_32 - MP3 with 22.05kHz sample rate at 32kbps
  • wav_44100 - WAV with 44.1kHz sample rate
  • wav_24000 - WAV with 24kHz sample rate
  • wav_22050 - WAV with 22.05kHz sample rate
  • wav_16000 - WAV with 16kHz sample rate


This endpoint returns a file.
Built with