title: Streaming subtitle: >- Learn how to stream real-time audio from the ElevenLabs API using chunked transfer encoding
The ElevenLabs API supports real-time audio streaming for select endpoints, returning raw audio bytes (e.g., MP3 data) directly over HTTP using chunked transfer encoding. This allows clients to process or play audio incrementally as it is generated.
Our official Node and Python libraries include utilities to simplify handling this continuous audio stream.
Streaming is supported for the Text to Speech API, Voice Changer API & Audio Isolation API. This section focuses on how streaming works for requests made to the Text to Speech API.
In Python, a streaming request looks like:
from elevenlabs import stream
from elevenlabs.client import ElevenLabs
elevenlabs = ElevenLabs()
audio_stream = elevenlabs.text_to_speech.stream(
text="This is a test",
voice_id="JBFqnCBsd6RMkjVDRZzb",
model_id="eleven_multilingual_v2"
)
# option 1: play the streamed audio locally
stream(audio_stream)
# option 2: process the audio bytes manually
for chunk in audio_stream:
if isinstance(chunk, bytes):
print(chunk)
In Node / Typescript, a streaming request looks like:
import { ElevenLabsClient, stream } from '@elevenlabs/elevenlabs-js';
import { Readable } from 'stream';
const elevenlabs = new ElevenLabsClient();
async function main() {
const audioStream = await elevenlabs.textToSpeech.stream('JBFqnCBsd6RMkjVDRZzb', {
text: 'This is a test',
modelId: 'eleven_multilingual_v2',
});
// option 1: play the streamed audio locally
await stream(Readable.from(audioStream));
// option 2: process the audio manually
for await (const chunk of audioStream) {
console.log(chunk);
}
}
main();
Stream speech
POST https://api.elevenlabs.io/v1/text-to-speech/{voice_id}/stream Content-Type: application/json
Converts text into speech using a voice of your choice and returns audio as an audio stream.
Reference: https://elevenlabs.io/docs/api-reference/text-to-speech/stream
OpenAPI Specification
openapi: 3.1.0
info:
title: api
version: 1.0.0
paths:
/v1/text-to-speech/{voice_id}/stream:
post:
operationId: stream
summary: Stream speech
description: >-
Converts text into speech using a voice of your choice and returns audio
as an audio stream.
tags:
- subpackage_textToSpeech
parameters:
- name: voice_id
in: path
description: >-
ID of the voice to be used. Use the [Get
voices](/docs/api-reference/voices/search) endpoint list all the
available voices.
required: true
schema:
type: string
- name: enable_logging
in: query
description: >-
When enable_logging is set to false zero retention mode will be used
for the request. This will mean history features are unavailable for
this request, including request stitching. Zero retention mode may
only be used by enterprise customers.
required: false
schema:
type: boolean
default: true
- name: optimize_streaming_latency
in: query
description: >-
You can turn on latency optimizations at some cost of quality. The
best possible final latency varies by model. Possible values:
0 - default mode (no latency optimizations)
1 - normal latency optimizations (about 50% of possible latency
improvement of option 3)
2 - strong latency optimizations (about 75% of possible latency
improvement of option 3)
3 - max latency optimizations
4 - max latency optimizations, but also with text normalizer turned
off for even more latency savings (best latency, but can
mispronounce eg numbers and dates).
Defaults to None.
required: false
schema:
type: integer
- name: output_format
in: query
description: >-
Output format of the generated audio. Formatted as
codec_sample_rate_bitrate. So an mp3 with 22.05kHz sample rate at
32kbs is represented as mp3_22050_32. MP3 with 192kbps bitrate
requires you to be subscribed to Creator tier or above. PCM with
44.1kHz sample rate requires you to be subscribed to Pro tier or
above. Note that the μ-law format (sometimes written mu-law, often
approximated as u-law) is commonly used for Twilio audio inputs.
required: false
schema:
$ref: >-
#/components/schemas/type_textToSpeech:TextToSpeechStreamRequestOutputFormat
- name: xi-api-key
in: header
required: false
schema:
type: string
responses:
'200':
description: Streaming audio data
content:
application/octet-stream:
schema:
type: string
format: binary
'422':
description: Validation Error
content:
application/json:
schema:
$ref: '#/components/schemas/type_:HTTPValidationError'
requestBody:
content:
application/json:
schema:
type: object
properties:
text:
type: string
description: The text that will get converted into speech.
model_id:
type: string
default: eleven_multilingual_v2
description: >-
Identifier of the model that will be used, you can query
them using GET /v1/models. The model needs to have support
for text to speech, you can check this using the
can_do_text_to_speech property.
language_code:
type: string
description: >-
Language code (ISO 639-1) used to enforce a language for the
model and text normalization. If the model does not support
provided language code, an error will be returned.
voice_settings:
$ref: '#/components/schemas/type_:VoiceSettings'
description: >-
Voice settings overriding stored settings for the given
voice. They are applied only on the given request.
pronunciation_dictionary_locators:
type: array
items:
$ref: >-
#/components/schemas/type_:PronunciationDictionaryVersionLocator
description: >-
A list of pronunciation dictionary locators (id, version_id)
to be applied to the text. They will be applied in order.
You may have up to 3 locators per request
seed:
type: integer
description: >-
If specified, our system will make a best effort to sample
deterministically, such that repeated requests with the same
seed and parameters should return the same result.
Determinism is not guaranteed. Must be integer between 0 and
4294967295.
previous_text:
type: string
description: >-
The text that came before the text of the current request.
Can be used to improve the speech's continuity when
concatenating together multiple generations or to influence
the speech's continuity in the current generation.
next_text:
type: string
description: >-
The text that comes after the text of the current request.
Can be used to improve the speech's continuity when
concatenating together multiple generations or to influence
the speech's continuity in the current generation.
previous_request_ids:
type: array
items:
type: string
description: >-
A list of request_id of the samples that were generated
before this generation. Can be used to improve the speech's
continuity when splitting up a large task into multiple
requests. The results will be best when the same model is
used across the generations. In case both previous_text and
previous_request_ids is send, previous_text will be ignored.
A maximum of 3 request_ids can be send.
next_request_ids:
type: array
items:
type: string
description: >-
A list of request_id of the samples that come after this
generation. next_request_ids is especially useful for
maintaining the speech's continuity when regenerating a
sample that has had some audio quality issues. For example,
if you have generated 3 speech clips, and you want to
improve clip 2, passing the request id of clip 3 as a
next_request_id (and that of clip 1 as a
previous_request_id) will help maintain natural flow in the
combined speech. The results will be best when the same
model is used across the generations. In case both next_text
and next_request_ids is send, next_text will be ignored. A
maximum of 3 request_ids can be send.
use_pvc_as_ivc:
type: boolean
default: false
description: >-
If true, we won't use PVC version of the voice for the
generation but the IVC version. This is a temporary
workaround for higher latency in PVC versions.
apply_text_normalization:
$ref: >-
#/components/schemas/type_textToSpeech:BodyTextToSpeechStreamApplyTextNormalization
description: >-
This parameter controls text normalization with three modes:
'auto', 'on', and 'off'. When set to 'auto', the system will
automatically decide whether to apply text normalization
(e.g., spelling out numbers). With 'on', text normalization
will always be applied, while with 'off', it will be
skipped.
apply_language_text_normalization:
type: boolean
default: false
description: >-
This parameter controls language text normalization. This
helps with proper pronunciation of text in some supported
languages. WARNING: This parameter can heavily increase the
latency of the request. Currently only supported for
Japanese.
required:
- text
servers:
- url: https://api.elevenlabs.io
- url: https://api.us.elevenlabs.io
- url: https://api.eu.residency.elevenlabs.io
- url: https://api.in.residency.elevenlabs.io
components:
schemas:
type_textToSpeech:TextToSpeechStreamRequestOutputFormat:
type: string
enum:
- mp3_22050_32
- mp3_24000_48
- mp3_44100_32
- mp3_44100_64
- mp3_44100_96
- mp3_44100_128
- mp3_44100_192
- pcm_8000
- pcm_16000
- pcm_22050
- pcm_24000
- pcm_32000
- pcm_44100
- pcm_48000
- ulaw_8000
- alaw_8000
- opus_48000_32
- opus_48000_64
- opus_48000_96
- opus_48000_128
- opus_48000_192
default: mp3_44100_128
description: >-
Output format of the generated audio. Formatted as
codec_sample_rate_bitrate. So an mp3 with 22.05kHz sample rate at 32kbs
is represented as mp3_22050_32. MP3 with 192kbps bitrate requires you to
be subscribed to Creator tier or above. PCM with 44.1kHz sample rate
requires you to be subscribed to Pro tier or above. Note that the μ-law
format (sometimes written mu-law, often approximated as u-law) is
commonly used for Twilio audio inputs.
title: TextToSpeechStreamRequestOutputFormat
type_:VoiceSettings:
type: object
properties:
stability:
type: number
format: double
description: >-
Determines how stable the voice is and the randomness between each
generation. Lower values introduce broader emotional range for the
voice. Higher values can result in a monotonous voice with limited
emotion.
use_speaker_boost:
type: boolean
description: >-
This setting boosts the similarity to the original speaker. Using
this setting requires a slightly higher computational load, which in
turn increases latency.
similarity_boost:
type: number
format: double
description: >-
Determines how closely the AI should adhere to the original voice
when attempting to replicate it.
style:
type: number
format: double
description: >-
Determines the style exaggeration of the voice. This setting
attempts to amplify the style of the original speaker. It does
consume additional computational resources and might increase
latency if set to anything other than 0.
speed:
type: number
format: double
description: >-
Adjusts the speed of the voice. A value of 1.0 is the default speed,
while values less than 1.0 slow down the speech, and values greater
than 1.0 speed it up.
title: VoiceSettings
type_:PronunciationDictionaryVersionLocator:
type: object
properties:
pronunciation_dictionary_id:
type: string
description: The ID of the pronunciation dictionary.
version_id:
type: string
description: >-
The ID of the version of the pronunciation dictionary. If not
provided, the latest version will be used.
required:
- pronunciation_dictionary_id
title: PronunciationDictionaryVersionLocator
type_textToSpeech:BodyTextToSpeechStreamApplyTextNormalization:
type: string
enum:
- auto
- 'on'
- 'off'
default: auto
description: >-
This parameter controls text normalization with three modes: 'auto',
'on', and 'off'. When set to 'auto', the system will automatically
decide whether to apply text normalization (e.g., spelling out numbers).
With 'on', text normalization will always be applied, while with 'off',
it will be skipped.
title: BodyTextToSpeechStreamApplyTextNormalization
type_:ValidationErrorLocItem:
oneOf:
- type: string
- type: integer
title: ValidationErrorLocItem
type_:ValidationError:
type: object
properties:
loc:
type: array
items:
$ref: '#/components/schemas/type_:ValidationErrorLocItem'
msg:
type: string
type:
type: string
required:
- loc
- msg
- type
title: ValidationError
type_:HTTPValidationError:
type: object
properties:
detail:
type: array
items:
$ref: '#/components/schemas/type_:ValidationError'
title: HTTPValidationError
SDK Code Examples
import { ElevenLabsClient } from "@elevenlabs/elevenlabs-js";
async function main() {
const client = new ElevenLabsClient();
await client.textToSpeech.stream("JBFqnCBsd6RMkjVDRZzb", {
outputFormat: "mp3_44100_128",
text: "The first move is what sets everything in motion.",
modelId: "eleven_multilingual_v2",
});
}
main();
from elevenlabs import ElevenLabs
client = ElevenLabs()
client.text_to_speech.stream(
voice_id="JBFqnCBsd6RMkjVDRZzb",
output_format="mp3_44100_128",
text="The first move is what sets everything in motion.",
model_id="eleven_multilingual_v2",
)
package main
import (
"fmt"
"strings"
"net/http"
"io"
)
func main() {
url := "https://api.elevenlabs.io/v1/text-to-speech/JBFqnCBsd6RMkjVDRZzb/stream?output_format=mp3_44100_128"
payload := strings.NewReader("{
\"text\": \"The first move is what sets everything in motion.\",
\"model_id\": \"eleven_multilingual_v2\"
}")
req, _ := http.NewRequest("POST", url, payload)
req.Header.Add("Content-Type", "application/json")
res, _ := http.DefaultClient.Do(req)
defer res.Body.Close()
body, _ := io.ReadAll(res.Body)
fmt.Println(res)
fmt.Println(string(body))
}
require 'uri'
require 'net/http'
url = URI("https://api.elevenlabs.io/v1/text-to-speech/JBFqnCBsd6RMkjVDRZzb/stream?output_format=mp3_44100_128")
http = Net::HTTP.new(url.host, url.port)
http.use_ssl = true
request = Net::HTTP::Post.new(url)
request["Content-Type"] = 'application/json'
request.body = "{
\"text\": \"The first move is what sets everything in motion.\",
\"model_id\": \"eleven_multilingual_v2\"
}"
response = http.request(request)
puts response.read_body
import com.mashape.unirest.http.HttpResponse;
import com.mashape.unirest.http.Unirest;
HttpResponse<String> response = Unirest.post("https://api.elevenlabs.io/v1/text-to-speech/JBFqnCBsd6RMkjVDRZzb/stream?output_format=mp3_44100_128")
.header("Content-Type", "application/json")
.body("{
\"text\": \"The first move is what sets everything in motion.\",
\"model_id\": \"eleven_multilingual_v2\"
}")
.asString();
<?php
require_once('vendor/autoload.php');
$client = new \GuzzleHttp\Client();
$response = $client->request('POST', 'https://api.elevenlabs.io/v1/text-to-speech/JBFqnCBsd6RMkjVDRZzb/stream?output_format=mp3_44100_128', [
'body' => '{
"text": "The first move is what sets everything in motion.",
"model_id": "eleven_multilingual_v2"
}',
'headers' => [
'Content-Type' => 'application/json',
],
]);
echo $response->getBody();
using RestSharp;
var client = new RestClient("https://api.elevenlabs.io/v1/text-to-speech/JBFqnCBsd6RMkjVDRZzb/stream?output_format=mp3_44100_128");
var request = new RestRequest(Method.POST);
request.AddHeader("Content-Type", "application/json");
request.AddParameter("application/json", "{
\"text\": \"The first move is what sets everything in motion.\",
\"model_id\": \"eleven_multilingual_v2\"
}", ParameterType.RequestBody);
IRestResponse response = client.Execute(request);
import Foundation
let headers = ["Content-Type": "application/json"]
let parameters = [
"text": "The first move is what sets everything in motion.",
"model_id": "eleven_multilingual_v2"
] as [String : Any]
let postData = JSONSerialization.data(withJSONObject: parameters, options: [])
let request = NSMutableURLRequest(url: NSURL(string: "https://api.elevenlabs.io/v1/text-to-speech/JBFqnCBsd6RMkjVDRZzb/stream?output_format=mp3_44100_128")! as URL,
cachePolicy: .useProtocolCachePolicy,
timeoutInterval: 10.0)
request.httpMethod = "POST"
request.allHTTPHeaderFields = headers
request.httpBody = postData as Data
let session = URLSession.shared
let dataTask = session.dataTask(with: request as URLRequest, completionHandler: { (data, response, error) -> Void in
if (error != nil) {
print(error as Any)
} else {
let httpResponse = response as? HTTPURLResponse
print(httpResponse)
}
})
dataTask.resume()
Stream speech with timing
POST https://api.elevenlabs.io/v1/text-to-speech/{voice_id}/stream/with-timestamps Content-Type: application/json
Converts text into speech using a voice of your choice and returns a stream of JSONs containing audio as a base64 encoded string together with information on when which character was spoken.
Reference: https://elevenlabs.io/docs/api-reference/text-to-speech/stream-with-timestamps
OpenAPI Specification
openapi: 3.1.0
info:
title: api
version: 1.0.0
paths:
/v1/text-to-speech/{voice_id}/stream/with-timestamps:
post:
operationId: stream-with-timestamps
summary: Stream speech with timing
description: >-
Converts text into speech using a voice of your choice and returns a
stream of JSONs containing audio as a base64 encoded string together
with information on when which character was spoken.
tags:
- subpackage_textToSpeech
parameters:
- name: voice_id
in: path
description: >-
ID of the voice to be used. Use the [Get
voices](/docs/api-reference/voices/search) endpoint list all the
available voices.
required: true
schema:
type: string
- name: enable_logging
in: query
description: >-
When enable_logging is set to false zero retention mode will be used
for the request. This will mean history features are unavailable for
this request, including request stitching. Zero retention mode may
only be used by enterprise customers.
required: false
schema:
type: boolean
default: true
- name: optimize_streaming_latency
in: query
description: >-
You can turn on latency optimizations at some cost of quality. The
best possible final latency varies by model. Possible values:
0 - default mode (no latency optimizations)
1 - normal latency optimizations (about 50% of possible latency
improvement of option 3)
2 - strong latency optimizations (about 75% of possible latency
improvement of option 3)
3 - max latency optimizations
4 - max latency optimizations, but also with text normalizer turned
off for even more latency savings (best latency, but can
mispronounce eg numbers and dates).
Defaults to None.
required: false
schema:
type: integer
- name: output_format
in: query
description: >-
Output format of the generated audio. Formatted as
codec_sample_rate_bitrate. So an mp3 with 22.05kHz sample rate at
32kbs is represented as mp3_22050_32. MP3 with 192kbps bitrate
requires you to be subscribed to Creator tier or above. PCM with
44.1kHz sample rate requires you to be subscribed to Pro tier or
above. Note that the μ-law format (sometimes written mu-law, often
approximated as u-law) is commonly used for Twilio audio inputs.
required: false
schema:
$ref: >-
#/components/schemas/type_textToSpeech:TextToSpeechStreamWithTimestampsRequestOutputFormat
- name: xi-api-key
in: header
required: false
schema:
type: string
responses:
'200':
description: Stream of transcription chunks
content:
text/event-stream:
schema:
$ref: >-
#/components/schemas/type_:StreamingAudioChunkWithTimestampsResponse
'422':
description: Validation Error
content:
application/json:
schema:
$ref: '#/components/schemas/type_:HTTPValidationError'
requestBody:
content:
application/json:
schema:
type: object
properties:
text:
type: string
description: The text that will get converted into speech.
model_id:
type: string
default: eleven_multilingual_v2
description: >-
Identifier of the model that will be used, you can query
them using GET /v1/models. The model needs to have support
for text to speech, you can check this using the
can_do_text_to_speech property.
language_code:
type: string
description: >-
Language code (ISO 639-1) used to enforce a language for the
model and text normalization. If the model does not support
provided language code, an error will be returned.
voice_settings:
$ref: '#/components/schemas/type_:VoiceSettings'
description: >-
Voice settings overriding stored settings for the given
voice. They are applied only on the given request.
pronunciation_dictionary_locators:
type: array
items:
$ref: >-
#/components/schemas/type_:PronunciationDictionaryVersionLocator
description: >-
A list of pronunciation dictionary locators (id, version_id)
to be applied to the text. They will be applied in order.
You may have up to 3 locators per request
seed:
type: integer
description: >-
If specified, our system will make a best effort to sample
deterministically, such that repeated requests with the same
seed and parameters should return the same result.
Determinism is not guaranteed. Must be integer between 0 and
4294967295.
previous_text:
type: string
description: >-
The text that came before the text of the current request.
Can be used to improve the speech's continuity when
concatenating together multiple generations or to influence
the speech's continuity in the current generation.
next_text:
type: string
description: >-
The text that comes after the text of the current request.
Can be used to improve the speech's continuity when
concatenating together multiple generations or to influence
the speech's continuity in the current generation.
previous_request_ids:
type: array
items:
type: string
description: >-
A list of request_id of the samples that were generated
before this generation. Can be used to improve the speech's
continuity when splitting up a large task into multiple
requests. The results will be best when the same model is
used across the generations. In case both previous_text and
previous_request_ids is send, previous_text will be ignored.
A maximum of 3 request_ids can be send.
next_request_ids:
type: array
items:
type: string
description: >-
A list of request_id of the samples that come after this
generation. next_request_ids is especially useful for
maintaining the speech's continuity when regenerating a
sample that has had some audio quality issues. For example,
if you have generated 3 speech clips, and you want to
improve clip 2, passing the request id of clip 3 as a
next_request_id (and that of clip 1 as a
previous_request_id) will help maintain natural flow in the
combined speech. The results will be best when the same
model is used across the generations. In case both next_text
and next_request_ids is send, next_text will be ignored. A
maximum of 3 request_ids can be send.
use_pvc_as_ivc:
type: boolean
default: false
description: >-
If true, we won't use PVC version of the voice for the
generation but the IVC version. This is a temporary
workaround for higher latency in PVC versions.
apply_text_normalization:
$ref: >-
#/components/schemas/type_textToSpeech:BodyTextToSpeechStreamWithTimestampsApplyTextNormalization
description: >-
This parameter controls text normalization with three modes:
'auto', 'on', and 'off'. When set to 'auto', the system will
automatically decide whether to apply text normalization
(e.g., spelling out numbers). With 'on', text normalization
will always be applied, while with 'off', it will be
skipped.
apply_language_text_normalization:
type: boolean
default: false
description: >-
This parameter controls language text normalization. This
helps with proper pronunciation of text in some supported
languages. WARNING: This parameter can heavily increase the
latency of the request. Currently only supported for
Japanese.
required:
- text
servers:
- url: https://api.elevenlabs.io
- url: https://api.us.elevenlabs.io
- url: https://api.eu.residency.elevenlabs.io
- url: https://api.in.residency.elevenlabs.io
components:
schemas:
type_textToSpeech:TextToSpeechStreamWithTimestampsRequestOutputFormat:
type: string
enum:
- mp3_22050_32
- mp3_24000_48
- mp3_44100_32
- mp3_44100_64
- mp3_44100_96
- mp3_44100_128
- mp3_44100_192
- pcm_8000
- pcm_16000
- pcm_22050
- pcm_24000
- pcm_32000
- pcm_44100
- pcm_48000
- ulaw_8000
- alaw_8000
- opus_48000_32
- opus_48000_64
- opus_48000_96
- opus_48000_128
- opus_48000_192
default: mp3_44100_128
description: >-
Output format of the generated audio. Formatted as
codec_sample_rate_bitrate. So an mp3 with 22.05kHz sample rate at 32kbs
is represented as mp3_22050_32. MP3 with 192kbps bitrate requires you to
be subscribed to Creator tier or above. PCM with 44.1kHz sample rate
requires you to be subscribed to Pro tier or above. Note that the μ-law
format (sometimes written mu-law, often approximated as u-law) is
commonly used for Twilio audio inputs.
title: TextToSpeechStreamWithTimestampsRequestOutputFormat
type_:VoiceSettings:
type: object
properties:
stability:
type: number
format: double
description: >-
Determines how stable the voice is and the randomness between each
generation. Lower values introduce broader emotional range for the
voice. Higher values can result in a monotonous voice with limited
emotion.
use_speaker_boost:
type: boolean
description: >-
This setting boosts the similarity to the original speaker. Using
this setting requires a slightly higher computational load, which in
turn increases latency.
similarity_boost:
type: number
format: double
description: >-
Determines how closely the AI should adhere to the original voice
when attempting to replicate it.
style:
type: number
format: double
description: >-
Determines the style exaggeration of the voice. This setting
attempts to amplify the style of the original speaker. It does
consume additional computational resources and might increase
latency if set to anything other than 0.
speed:
type: number
format: double
description: >-
Adjusts the speed of the voice. A value of 1.0 is the default speed,
while values less than 1.0 slow down the speech, and values greater
than 1.0 speed it up.
title: VoiceSettings
type_:PronunciationDictionaryVersionLocator:
type: object
properties:
pronunciation_dictionary_id:
type: string
description: The ID of the pronunciation dictionary.
version_id:
type: string
description: >-
The ID of the version of the pronunciation dictionary. If not
provided, the latest version will be used.
required:
- pronunciation_dictionary_id
title: PronunciationDictionaryVersionLocator
type_textToSpeech:BodyTextToSpeechStreamWithTimestampsApplyTextNormalization:
type: string
enum:
- auto
- 'on'
- 'off'
default: auto
description: >-
This parameter controls text normalization with three modes: 'auto',
'on', and 'off'. When set to 'auto', the system will automatically
decide whether to apply text normalization (e.g., spelling out numbers).
With 'on', text normalization will always be applied, while with 'off',
it will be skipped.
title: BodyTextToSpeechStreamWithTimestampsApplyTextNormalization
type_:CharacterAlignmentResponseModel:
type: object
properties:
characters:
type: array
items:
type: string
character_start_times_seconds:
type: array
items:
type: number
format: double
character_end_times_seconds:
type: array
items:
type: number
format: double
required:
- characters
- character_start_times_seconds
- character_end_times_seconds
title: CharacterAlignmentResponseModel
type_:StreamingAudioChunkWithTimestampsResponse:
type: object
properties:
audio_base64:
type: string
description: Base64 encoded audio data
alignment:
$ref: '#/components/schemas/type_:CharacterAlignmentResponseModel'
description: Timestamp information for each character in the original text
normalized_alignment:
$ref: '#/components/schemas/type_:CharacterAlignmentResponseModel'
description: Timestamp information for each character in the normalized text
required:
- audio_base64
title: StreamingAudioChunkWithTimestampsResponse
type_:ValidationErrorLocItem:
oneOf:
- type: string
- type: integer
title: ValidationErrorLocItem
type_:ValidationError:
type: object
properties:
loc:
type: array
items:
$ref: '#/components/schemas/type_:ValidationErrorLocItem'
msg:
type: string
type:
type: string
required:
- loc
- msg
- type
title: ValidationError
type_:HTTPValidationError:
type: object
properties:
detail:
type: array
items:
$ref: '#/components/schemas/type_:ValidationError'
title: HTTPValidationError
SDK Code Examples
import { ElevenLabsClient } from "@elevenlabs/elevenlabs-js";
async function main() {
const client = new ElevenLabsClient();
await client.textToSpeech.streamWithTimestamps("JBFqnCBsd6RMkjVDRZzb", {
outputFormat: "mp3_44100_128",
text: "The first move is what sets everything in motion.",
modelId: "eleven_multilingual_v2",
});
}
main();
from elevenlabs import ElevenLabs
client = ElevenLabs()
client.text_to_speech.stream_with_timestamps(
voice_id="JBFqnCBsd6RMkjVDRZzb",
output_format="mp3_44100_128",
text="The first move is what sets everything in motion.",
model_id="eleven_multilingual_v2",
)
package main
import (
"fmt"
"strings"
"net/http"
"io"
)
func main() {
url := "https://api.elevenlabs.io/v1/text-to-speech/JBFqnCBsd6RMkjVDRZzb/stream/with-timestamps?output_format=mp3_44100_128"
payload := strings.NewReader("{
\"text\": \"The first move is what sets everything in motion.\",
\"model_id\": \"eleven_multilingual_v2\"
}")
req, _ := http.NewRequest("POST", url, payload)
req.Header.Add("Content-Type", "application/json")
res, _ := http.DefaultClient.Do(req)
defer res.Body.Close()
body, _ := io.ReadAll(res.Body)
fmt.Println(res)
fmt.Println(string(body))
}
require 'uri'
require 'net/http'
url = URI("https://api.elevenlabs.io/v1/text-to-speech/JBFqnCBsd6RMkjVDRZzb/stream/with-timestamps?output_format=mp3_44100_128")
http = Net::HTTP.new(url.host, url.port)
http.use_ssl = true
request = Net::HTTP::Post.new(url)
request["Content-Type"] = 'application/json'
request.body = "{
\"text\": \"The first move is what sets everything in motion.\",
\"model_id\": \"eleven_multilingual_v2\"
}"
response = http.request(request)
puts response.read_body
import com.mashape.unirest.http.HttpResponse;
import com.mashape.unirest.http.Unirest;
HttpResponse<String> response = Unirest.post("https://api.elevenlabs.io/v1/text-to-speech/JBFqnCBsd6RMkjVDRZzb/stream/with-timestamps?output_format=mp3_44100_128")
.header("Content-Type", "application/json")
.body("{
\"text\": \"The first move is what sets everything in motion.\",
\"model_id\": \"eleven_multilingual_v2\"
}")
.asString();
<?php
require_once('vendor/autoload.php');
$client = new \GuzzleHttp\Client();
$response = $client->request('POST', 'https://api.elevenlabs.io/v1/text-to-speech/JBFqnCBsd6RMkjVDRZzb/stream/with-timestamps?output_format=mp3_44100_128', [
'body' => '{
"text": "The first move is what sets everything in motion.",
"model_id": "eleven_multilingual_v2"
}',
'headers' => [
'Content-Type' => 'application/json',
],
]);
echo $response->getBody();
using RestSharp;
var client = new RestClient("https://api.elevenlabs.io/v1/text-to-speech/JBFqnCBsd6RMkjVDRZzb/stream/with-timestamps?output_format=mp3_44100_128");
var request = new RestRequest(Method.POST);
request.AddHeader("Content-Type", "application/json");
request.AddParameter("application/json", "{
\"text\": \"The first move is what sets everything in motion.\",
\"model_id\": \"eleven_multilingual_v2\"
}", ParameterType.RequestBody);
IRestResponse response = client.Execute(request);
import Foundation
let headers = ["Content-Type": "application/json"]
let parameters = [
"text": "The first move is what sets everything in motion.",
"model_id": "eleven_multilingual_v2"
] as [String : Any]
let postData = JSONSerialization.data(withJSONObject: parameters, options: [])
let request = NSMutableURLRequest(url: NSURL(string: "https://api.elevenlabs.io/v1/text-to-speech/JBFqnCBsd6RMkjVDRZzb/stream/with-timestamps?output_format=mp3_44100_128")! as URL,
cachePolicy: .useProtocolCachePolicy,
timeoutInterval: 10.0)
request.httpMethod = "POST"
request.allHTTPHeaderFields = headers
request.httpBody = postData as Data
let session = URLSession.shared
let dataTask = session.dataTask(with: request as URLRequest, completionHandler: { (data, response, error) -> Void in
if (error != nil) {
print(error as Any)
} else {
let httpResponse = response as? HTTPURLResponse
print(httpResponse)
}
})
dataTask.resume()
WebSocket
GET /v1/text-to-speech/{voice_id}/stream-input
The Text-to-Speech WebSockets API is designed to generate audio from partial text input while ensuring consistency throughout the generated audio. Although highly flexible, the WebSockets API isn't a one-size-fits-all solution. It's well-suited for scenarios where:
- The input text is being streamed or generated in chunks.
- Word-to-audio alignment information is required.
However, it may not be the best choice when:
- The entire input text is available upfront. Given that the generations are partial, some buffering is involved, which could potentially result in slightly higher latency compared to a standard HTTP request.
- You want to quickly experiment or prototype. Working with WebSockets can be harder and more complex than using a standard HTTP API, which might slow down rapid development and testing.
Reference: https://elevenlabs.io/docs/api-reference/text-to-speech/v-1-text-to-speech-voice-id-stream-input
AsyncAPI Specification
asyncapi: 2.6.0
info:
title: V 1 Text To Speech Voice Id Stream Input
version: subpackage_v1TextToSpeechVoiceIdStreamInput.v1TextToSpeechVoiceIdStreamInput
description: >-
The Text-to-Speech WebSockets API is designed to generate audio from partial
text input
while ensuring consistency throughout the generated audio. Although highly
flexible,
the WebSockets API isn't a one-size-fits-all solution. It's well-suited for
scenarios where:
* The input text is being streamed or generated in chunks.
* Word-to-audio alignment information is required.
However, it may not be the best choice when:
* The entire input text is available upfront. Given that the generations are partial,
some buffering is involved, which could potentially result in slightly higher latency compared
to a standard HTTP request.
* You want to quickly experiment or prototype. Working with WebSockets can be harder and more
complex than using a standard HTTP API, which might slow down rapid development and testing.
channels:
/v1/text-to-speech/{voice_id}/stream-input:
description: >-
The Text-to-Speech WebSockets API is designed to generate audio from
partial text input
while ensuring consistency throughout the generated audio. Although highly
flexible,
the WebSockets API isn't a one-size-fits-all solution. It's well-suited
for scenarios where:
* The input text is being streamed or generated in chunks.
* Word-to-audio alignment information is required.
However, it may not be the best choice when:
* The entire input text is available upfront. Given that the generations are partial,
some buffering is involved, which could potentially result in slightly higher latency compared
to a standard HTTP request.
* You want to quickly experiment or prototype. Working with WebSockets can be harder and more
complex than using a standard HTTP API, which might slow down rapid development and testing.
parameters:
voice_id:
description: The unique identifier for the voice to use in the TTS process.
schema:
type: string
bindings:
ws:
query:
type: object
properties:
authorization:
type: string
single_use_token:
type: string
model_id:
type: string
language_code:
type: string
enable_logging:
type: boolean
default: true
enable_ssml_parsing:
type: boolean
default: false
output_format:
$ref: '#/components/schemas/type_:TextToSpeechOutputFormatEnum'
inactivity_timeout:
type: integer
default: 20
sync_alignment:
type: boolean
default: false
auto_mode:
type: boolean
default: false
apply_text_normalization:
$ref: >-
#/components/schemas/type_:TextToSpeechApplyTextNormalizationEnum
seed:
type: integer
headers:
type: object
properties:
xi-api-key:
type: string
publish:
operationId: v-1-text-to-speech-voice-id-stream-input-publish
summary: Server message
message:
name: subscribe
payload:
$ref: >-
#/components/schemas/type_v1TextToSpeechVoiceIdStreamInput:receiveMessage
subscribe:
operationId: v-1-text-to-speech-voice-id-stream-input-subscribe
summary: Client message
message:
name: publish
payload:
$ref: >-
#/components/schemas/type_v1TextToSpeechVoiceIdStreamInput:sendMessage
servers:
Production:
url: wss://api.elevenlabs.io/
protocol: wss
x-default: true
Production US:
url: wss://api.us.elevenlabs.io/
protocol: wss
Production EU:
url: wss://api.eu.residency.elevenlabs.io/
protocol: wss
Production India:
url: wss://api.in.residency.elevenlabs.io/
protocol: wss
components:
schemas:
type_:TextToSpeechOutputFormatEnum:
type: string
enum:
- mp3_22050_32
- mp3_44100_32
- mp3_44100_64
- mp3_44100_96
- mp3_44100_128
- mp3_44100_192
- pcm_8000
- pcm_16000
- pcm_22050
- pcm_24000
- pcm_44100
- ulaw_8000
- alaw_8000
- opus_48000_32
- opus_48000_64
- opus_48000_96
- opus_48000_128
- opus_48000_192
description: The output audio format
title: TextToSpeechOutputFormatEnum
type_:TextToSpeechApplyTextNormalizationEnum:
type: string
enum:
- auto
- 'on'
- 'off'
default: auto
description: >-
This parameter controls text normalization with three modes - 'auto',
'on', and 'off'. When set to 'auto', the system will automatically
decide whether to apply text normalization (e.g., spelling out numbers).
With 'on', text normalization will always be applied, while with 'off',
it will be skipped. For the 'eleven_flash_v2_5' model, text
normalization can only be enabled with Enterprise plans. Defaults to
'auto'.
title: TextToSpeechApplyTextNormalizationEnum
type_:NormalizedAlignment:
type: object
properties:
charStartTimesMs:
type: array
items:
type: integer
description: >-
A list of starting times (in milliseconds) for each character in the
normalized text as it
corresponds to the audio. For instance, the character 'H' starts at
time 0 ms in the audio.
Note these times are relative to the returned chunk from the model,
and not the
full audio response.
charDurationsMs:
type: array
items:
type: integer
description: >-
A list of durations (in milliseconds) for each character in the
normalized text as it
corresponds to the audio. For instance, the character 'H' lasts for
3 ms in the audio.
Note these times are relative to the returned chunk from the model,
and not the
full audio response.
chars:
type: array
items:
type: string
description: >-
A list of characters in the normalized text sequence. For instance,
the first character is 'H'.
Note that this list may contain spaces, punctuation, and other
special characters.
The length of this list should be the same as the lengths of
`charStartTimesMs` and `charDurationsMs`.
description: >-
Alignment information for the generated audio given the input normalized
text sequence.
title: NormalizedAlignment
type_:Alignment:
type: object
properties:
charStartTimesMs:
type: array
items:
type: integer
description: >-
A list of starting times (in milliseconds) for each character in the
text as it
corresponds to the audio. For instance, the character 'H' starts at
time 0 ms in the audio.
Note these times are relative to the returned chunk from the model,
and not the
full audio response.
charDurationsMs:
type: array
items:
type: integer
description: >-
A list of durations (in milliseconds) for each character in the text
as it
corresponds to the audio. For instance, the character 'H' lasts for
3 ms in the audio.
Note these times are relative to the returned chunk from the model,
and not the
full audio response.
chars:
type: array
items:
type: string
description: >-
A list of characters in the text sequence. For instance, the first
character is 'H'.
Note that this list may contain spaces, punctuation, and other
special characters.
The length of this list should be the same as the lengths of
`charStartTimesMs` and `charDurationsMs`.
description: >-
Alignment information for the generated audio given the input text
sequence.
title: Alignment
type_:AudioOutput:
type: object
properties:
audio:
type: string
description: >-
A generated partial audio chunk, encoded using the selected
output_format, by default this
is MP3 encoded as a base64 string.
normalizedAlignment:
$ref: '#/components/schemas/type_:NormalizedAlignment'
alignment:
$ref: '#/components/schemas/type_:Alignment'
required:
- audio
title: AudioOutput
type_:FinalOutput:
type: object
properties:
isFinal:
type: boolean
enum:
- true
description: >-
Indicates if the generation is complete. If set to `True`, `audio`
will be null.
title: FinalOutput
type_v1TextToSpeechVoiceIdStreamInput:receiveMessage:
oneOf:
- $ref: '#/components/schemas/type_:AudioOutput'
- $ref: '#/components/schemas/type_:FinalOutput'
description: Receive messages from the WebSocket
title: receiveMessage
type_:RealtimeVoiceSettings:
type: object
properties:
stability:
type: number
format: double
default: 0.5
description: Defines the stability for voice settings.
similarity_boost:
type: number
format: double
default: 0.75
description: Defines the similarity boost for voice settings.
style:
type: number
format: double
default: 0
description: >-
Defines the style for voice settings. This parameter is available on
V2+ models.
use_speaker_boost:
type: boolean
default: true
description: >-
Defines the use speaker boost for voice settings. This parameter is
available on V2+ models.
speed:
type: number
format: double
default: 1
description: >-
Controls the speed of the generated speech. Values range from 0.7 to
1.2, with 1.0 being the default speed.
title: RealtimeVoiceSettings
type_:GenerationConfig:
type: object
properties:
chunk_length_schedule:
type: array
items:
type: number
format: double
description: >-
This is an advanced setting that most users shouldn't need to use.
It relates to our
generation schedule.
Our WebSocket service incorporates a buffer system designed to
optimize the Time To First Byte (TTFB) while maintaining
high-quality streaming.
All text sent to the WebSocket endpoint is added to this buffer and
only when that buffer reaches a certain size is an audio generation
attempted. This is because our model provides higher quality audio
when the model has longer inputs, and can deduce more context about
how the text should be delivered.
The buffer ensures smooth audio data delivery and is automatically
emptied with a final audio generation either when the stream is
closed, or upon sending a `flush` command. We have advanced settings
for changing the chunk schedule, which can improve latency at the
cost of quality by generating audio more frequently with smaller
text inputs.
The `chunk_length_schedule` determines the minimum amount of text
that needs to be sent and present in our
buffer before audio starts being generated. This is to maximise the
amount of context available to
the model to improve audio quality, whilst balancing latency of the
returned audio chunks.
The default value for `chunk_length_schedule` is: [120, 160, 250,
290].
This means that the first chunk of audio will not be generated until
you send text that
totals at least 120 characters long. The next chunk of audio will
only be generated once a
further 160 characters have been sent. The third audio chunk will be
generated after the
next 250 characters. Then the fourth, and beyond, will be generated
in sets of at least 290 characters.
Customize this array to suit your needs. If you want to generate
audio more frequently
to optimise latency, you can reduce the values in the array. Note
that setting the values
too low may result in lower quality audio. Please test and adjust as
needed.
Each item should be in the range 50-500.
title: GenerationConfig
type_:PronunciationDictionaryLocator:
type: object
properties:
pronunciation_dictionary_id:
type: string
description: The unique identifier of the pronunciation dictionary
version_id:
type: string
description: The version identifier of the pronunciation dictionary
required:
- pronunciation_dictionary_id
- version_id
description: Identifies a specific pronunciation dictionary to use
title: PronunciationDictionaryLocator
type_:InitializeConnection:
type: object
properties:
text:
type: string
enum:
- ' '
description: The initial text that must be sent is a blank space.
voice_settings:
$ref: '#/components/schemas/type_:RealtimeVoiceSettings'
generation_config:
$ref: '#/components/schemas/type_:GenerationConfig'
pronunciation_dictionary_locators:
type: array
items:
$ref: '#/components/schemas/type_:PronunciationDictionaryLocator'
description: >-
Optional list of pronunciation dictionary locators. If provided,
these dictionaries will be used to
modify pronunciation of matching text. Must only be provided in the
first message.
Note: Pronunciation dictionary matches will only be respected within
a provided chunk.
xi-api-key:
type: string
description: >-
Your ElevenLabs API key. This can only be included in the first
message and is not needed if present in the header.
authorization:
type: string
description: >-
Your authorization bearer token. This can only be included in the
first message and is not needed if present in the header.
required:
- text
title: InitializeConnection
type_:SendText:
type: object
properties:
text:
type: string
description: >-
The text to be sent to the API for audio generation. Should always
end with a single space string.
try_trigger_generation:
type: boolean
default: false
description: >-
This is an advanced setting that most users shouldn't need to use.
It relates to our generation schedule.
Use this to attempt to immediately trigger the generation of audio,
overriding the `chunk_length_schedule`.
Unlike flush, `try_trigger_generation` will only generate audio if
our
buffer contains more than a minimum
threshold of characters, this is to ensure a higher quality response
from our model.
Note that overriding the chunk schedule to generate small amounts of
text may result in lower quality audio, therefore, only use this
parameter if you
really need text to be processed immediately. We generally recommend
keeping the default value of
`false` and adjusting the `chunk_length_schedule` in the
`generation_config` instead.
voice_settings:
$ref: '#/components/schemas/type_:RealtimeVoiceSettings'
description: >-
The voice settings field can be provided in the first
`InitializeConnection` message and then must either be not provided
or not changed.
generator_config:
$ref: '#/components/schemas/type_:GenerationConfig'
description: >-
The generator config field can be provided in the first
`InitializeConnection` message and then must either be not provided
or not changed.
flush:
type: boolean
default: false
description: >-
Flush forces the generation of audio. Set this value to true when
you have finished sending text, but want to keep the websocket
connection open.
This is useful when you want to ensure that the last chunk of audio
is generated even when the length of text sent is smaller than the
value set in chunk_length_schedule (e.g. 120 or 50).
required:
- text
title: SendText
type_:CloseConnection:
type: object
properties:
text:
type: string
enum:
- ''
description: End the stream with an empty string
required:
- text
title: CloseConnection
type_v1TextToSpeechVoiceIdStreamInput:sendMessage:
oneOf:
- $ref: '#/components/schemas/type_:InitializeConnection'
- $ref: '#/components/schemas/type_:SendText'
- $ref: '#/components/schemas/type_:CloseConnection'
description: Send messages to the WebSocket
title: sendMessage
Multi-Context WebSocket
GET /v1/text-to-speech/{voice_id}/multi-stream-input
The Multi-Context Text-to-Speech WebSockets API allows for generating audio from text input while managing multiple independent audio generation streams (contexts) over a single WebSocket connection. This is useful for scenarios requiring concurrent or interleaved audio generations, such as dynamic conversational AI applications.
Each context, identified by a context id, maintains its own state. You can send text to specific
contexts, flush them, or close them independently. A close_socket message can be used to terminate
the entire connection gracefully.
For more information on best practices for how to use this API, please see the multi context websocket guide.
AsyncAPI Specification
asyncapi: 2.6.0
info:
title: V 1 Text To Speech Voice Id Multi Stream Input
version: >-
subpackage_v1TextToSpeechVoiceIdMultiStreamInput.v1TextToSpeechVoiceIdMultiStreamInput
description: >-
The Multi-Context Text-to-Speech WebSockets API allows for generating audio
from text input
while managing multiple independent audio generation streams (contexts) over
a single WebSocket connection.
This is useful for scenarios requiring concurrent or interleaved audio
generations, such as dynamic
conversational AI applications.
Each context, identified by a context id, maintains its own state. You can
send text to specific
contexts, flush them, or close them independently. A `close_socket` message
can be used to terminate
the entire connection gracefully.
For more information on best practices for how to use this API, please see
the [multi context websocket
guide](/docs/developers/guides/cookbooks/multi-context-web-socket).
channels:
/v1/text-to-speech/{voice_id}/multi-stream-input:
description: >-
The Multi-Context Text-to-Speech WebSockets API allows for generating
audio from text input
while managing multiple independent audio generation streams (contexts)
over a single WebSocket connection.
This is useful for scenarios requiring concurrent or interleaved audio
generations, such as dynamic
conversational AI applications.
Each context, identified by a context id, maintains its own state. You can
send text to specific
contexts, flush them, or close them independently. A `close_socket`
message can be used to terminate
the entire connection gracefully.
For more information on best practices for how to use this API, please see
the [multi context websocket
guide](/docs/developers/guides/cookbooks/multi-context-web-socket).
parameters:
voice_id:
description: The unique identifier for the voice to use in the TTS process.
schema:
type: string
bindings:
ws:
query:
type: object
properties:
authorization:
type: string
single_use_token:
type: string
model_id:
type: string
language_code:
type: string
enable_logging:
type: boolean
default: true
enable_ssml_parsing:
type: boolean
default: false
output_format:
$ref: '#/components/schemas/type_:TextToSpeechOutputFormatEnum'
inactivity_timeout:
type: integer
default: 20
sync_alignment:
type: boolean
default: false
auto_mode:
type: boolean
default: false
apply_text_normalization:
$ref: >-
#/components/schemas/type_:TextToSpeechApplyTextNormalizationEnum
seed:
type: integer
headers:
type: object
properties:
xi-api-key:
type: string
publish:
operationId: v-1-text-to-speech-voice-id-multi-stream-input-publish
summary: Server message
message:
name: subscribe
payload:
$ref: >-
#/components/schemas/type_v1TextToSpeechVoiceIdMultiStreamInput:receiveMessageMulti
subscribe:
operationId: v-1-text-to-speech-voice-id-multi-stream-input-subscribe
summary: Client message
message:
name: publish
payload:
$ref: >-
#/components/schemas/type_v1TextToSpeechVoiceIdMultiStreamInput:sendMessageMulti
servers:
Production:
url: wss://api.elevenlabs.io/
protocol: wss
x-default: true
Production US:
url: wss://api.us.elevenlabs.io/
protocol: wss
Production EU:
url: wss://api.eu.residency.elevenlabs.io/
protocol: wss
Production India:
url: wss://api.in.residency.elevenlabs.io/
protocol: wss
components:
schemas:
type_:TextToSpeechOutputFormatEnum:
type: string
enum:
- mp3_22050_32
- mp3_44100_32
- mp3_44100_64
- mp3_44100_96
- mp3_44100_128
- mp3_44100_192
- pcm_8000
- pcm_16000
- pcm_22050
- pcm_24000
- pcm_44100
- ulaw_8000
- alaw_8000
- opus_48000_32
- opus_48000_64
- opus_48000_96
- opus_48000_128
- opus_48000_192
description: The output audio format
title: TextToSpeechOutputFormatEnum
type_:TextToSpeechApplyTextNormalizationEnum:
type: string
enum:
- auto
- 'on'
- 'off'
default: auto
description: >-
This parameter controls text normalization with three modes - 'auto',
'on', and 'off'. When set to 'auto', the system will automatically
decide whether to apply text normalization (e.g., spelling out numbers).
With 'on', text normalization will always be applied, while with 'off',
it will be skipped. For the 'eleven_flash_v2_5' model, text
normalization can only be enabled with Enterprise plans. Defaults to
'auto'.
title: TextToSpeechApplyTextNormalizationEnum
type_:NormalizedAlignment:
type: object
properties:
charStartTimesMs:
type: array
items:
type: integer
description: >-
A list of starting times (in milliseconds) for each character in the
normalized text as it
corresponds to the audio. For instance, the character 'H' starts at
time 0 ms in the audio.
Note these times are relative to the returned chunk from the model,
and not the
full audio response.
charDurationsMs:
type: array
items:
type: integer
description: >-
A list of durations (in milliseconds) for each character in the
normalized text as it
corresponds to the audio. For instance, the character 'H' lasts for
3 ms in the audio.
Note these times are relative to the returned chunk from the model,
and not the
full audio response.
chars:
type: array
items:
type: string
description: >-
A list of characters in the normalized text sequence. For instance,
the first character is 'H'.
Note that this list may contain spaces, punctuation, and other
special characters.
The length of this list should be the same as the lengths of
`charStartTimesMs` and `charDurationsMs`.
description: >-
Alignment information for the generated audio given the input normalized
text sequence.
title: NormalizedAlignment
type_:Alignment:
type: object
properties:
charStartTimesMs:
type: array
items:
type: integer
description: >-
A list of starting times (in milliseconds) for each character in the
text as it
corresponds to the audio. For instance, the character 'H' starts at
time 0 ms in the audio.
Note these times are relative to the returned chunk from the model,
and not the
full audio response.
charDurationsMs:
type: array
items:
type: integer
description: >-
A list of durations (in milliseconds) for each character in the text
as it
corresponds to the audio. For instance, the character 'H' lasts for
3 ms in the audio.
Note these times are relative to the returned chunk from the model,
and not the
full audio response.
chars:
type: array
items:
type: string
description: >-
A list of characters in the text sequence. For instance, the first
character is 'H'.
Note that this list may contain spaces, punctuation, and other
special characters.
The length of this list should be the same as the lengths of
`charStartTimesMs` and `charDurationsMs`.
description: >-
Alignment information for the generated audio given the input text
sequence.
title: Alignment
type_:AudioOutputMulti:
type: object
properties:
audio:
type: string
description: Base64 encoded audio chunk.
normalizedAlignment:
$ref: '#/components/schemas/type_:NormalizedAlignment'
alignment:
$ref: '#/components/schemas/type_:Alignment'
contextId:
type: string
description: The contextId for which this audio is.
required:
- audio
description: Server payload containing an audio chunk for a specific context.
title: AudioOutputMulti
type_:FinalOutputMulti:
type: object
properties:
isFinal:
type: boolean
enum:
- true
description: Indicates this is the final message for the context.
contextId:
type: string
description: The context_id for which this is the final message.
required:
- isFinal
description: Server payload indicating the final output for a specific context.
title: FinalOutputMulti
type_v1TextToSpeechVoiceIdMultiStreamInput:receiveMessageMulti:
oneOf:
- $ref: '#/components/schemas/type_:AudioOutputMulti'
- $ref: '#/components/schemas/type_:FinalOutputMulti'
description: Receive messages from the multi-context WebSocket.
title: receiveMessageMulti
type_:RealtimeVoiceSettings:
type: object
properties:
stability:
type: number
format: double
default: 0.5
description: Defines the stability for voice settings.
similarity_boost:
type: number
format: double
default: 0.75
description: Defines the similarity boost for voice settings.
style:
type: number
format: double
default: 0
description: >-
Defines the style for voice settings. This parameter is available on
V2+ models.
use_speaker_boost:
type: boolean
default: true
description: >-
Defines the use speaker boost for voice settings. This parameter is
available on V2+ models.
speed:
type: number
format: double
default: 1
description: >-
Controls the speed of the generated speech. Values range from 0.7 to
1.2, with 1.0 being the default speed.
title: RealtimeVoiceSettings
type_:GenerationConfig:
type: object
properties:
chunk_length_schedule:
type: array
items:
type: number
format: double
description: >-
This is an advanced setting that most users shouldn't need to use.
It relates to our
generation schedule.
Our WebSocket service incorporates a buffer system designed to
optimize the Time To First Byte (TTFB) while maintaining
high-quality streaming.
All text sent to the WebSocket endpoint is added to this buffer and
only when that buffer reaches a certain size is an audio generation
attempted. This is because our model provides higher quality audio
when the model has longer inputs, and can deduce more context about
how the text should be delivered.
The buffer ensures smooth audio data delivery and is automatically
emptied with a final audio generation either when the stream is
closed, or upon sending a `flush` command. We have advanced settings
for changing the chunk schedule, which can improve latency at the
cost of quality by generating audio more frequently with smaller
text inputs.
The `chunk_length_schedule` determines the minimum amount of text
that needs to be sent and present in our
buffer before audio starts being generated. This is to maximise the
amount of context available to
the model to improve audio quality, whilst balancing latency of the
returned audio chunks.
The default value for `chunk_length_schedule` is: [120, 160, 250,
290].
This means that the first chunk of audio will not be generated until
you send text that
totals at least 120 characters long. The next chunk of audio will
only be generated once a
further 160 characters have been sent. The third audio chunk will be
generated after the
next 250 characters. Then the fourth, and beyond, will be generated
in sets of at least 290 characters.
Customize this array to suit your needs. If you want to generate
audio more frequently
to optimise latency, you can reduce the values in the array. Note
that setting the values
too low may result in lower quality audio. Please test and adjust as
needed.
Each item should be in the range 50-500.
title: GenerationConfig
type_:PronunciationDictionaryLocator:
type: object
properties:
pronunciation_dictionary_id:
type: string
description: The unique identifier of the pronunciation dictionary
version_id:
type: string
description: The version identifier of the pronunciation dictionary
required:
- pronunciation_dictionary_id
- version_id
description: Identifies a specific pronunciation dictionary to use
title: PronunciationDictionaryLocator
type_:InitializeConnectionMulti:
type: object
properties:
text:
type: string
enum:
- ' '
description: Must be a single space character to initiate the context.
voice_settings:
$ref: '#/components/schemas/type_:RealtimeVoiceSettings'
generation_config:
$ref: '#/components/schemas/type_:GenerationConfig'
pronunciation_dictionary_locators:
type: array
items:
$ref: '#/components/schemas/type_:PronunciationDictionaryLocator'
description: Optional pronunciation dictionaries for this context.
xi_api_key:
type: string
description: >-
Your ElevenLabs API key (if not in header). For this context's first
message only.
authorization:
type: string
description: >-
Your authorization bearer token (if not in header). For this
context's first message only.
context_id:
type: string
description: >-
A unique identifier for the first context created in the websocket.
If not provided, a default context will be used.
required:
- text
description: >-
Payload to initialize a new context in a multi-stream WebSocket
connection.
title: InitializeConnectionMulti
type_:InitialiseContext:
type: object
properties:
text:
type: string
description: The initial text to synthesize. Should end with a single space.
voice_settings:
$ref: '#/components/schemas/type_:RealtimeVoiceSettings'
generation_config:
$ref: '#/components/schemas/type_:GenerationConfig'
pronunciation_dictionary_locators:
type: array
items:
$ref: '#/components/schemas/type_:PronunciationDictionaryLocator'
description: >-
Optional list of pronunciation dictionary locators to be used for
this context.
xi_api_key:
type: string
description: >-
Your ElevenLabs API key. Required if not provided in the WebSocket
connection's header or query parameters. This applies to the
(re)initialization of this specific context.
authorization:
type: string
description: >-
Your authorization bearer token. Required if not provided in the
WebSocket connection's header or query parameters. This applies to
the (re)initialization of this specific context.
context_id:
type: string
description: >-
An identifier for the text-to-speech context. If omitted, a default
context ID may be assigned by the server. If provided, this message
will create a new context with this ID or re-initialize an existing
one with the new settings and text.
required:
- text
description: >-
Payload to initialize or re-initialize a TTS context with specific
settings and initial text for multi-stream connections.
title: InitialiseContext
type_:SendTextMulti:
type: object
properties:
text:
type: string
description: Text to synthesize. Should end with a single space.
context_id:
type: string
description: The target context_id for this text.
flush:
type: boolean
default: false
description: >-
If true, flushes the audio buffer for the specified context. If
false, the text will be appended to the buffer to be generated.
required:
- text
description: Payload to send text for synthesis to an existing context.
title: SendTextMulti
type_:FlushContext:
type: object
properties:
context_id:
type: string
description: The context_id to flush.
text:
type: string
description: The text to append to the buffer to be flushed.
flush:
type: boolean
default: false
description: >-
If true, flushes the audio buffer for the specified context. If
false, the context will remain open and the text will be appended to
the buffer to be generated.
required:
- context_id
- flush
description: Payload to flush the audio buffer for a specific context.
title: FlushContext
type_:CloseContext:
type: object
properties:
context_id:
type: string
description: The context_id to close.
close_context:
type: boolean
default: false
description: >-
Must set the close_context to true, to close the specified context.
If false, the context will remain open and the text will be ignored.
If set to true, the context will close. If it has already been set
to flush it will continue flushing. The same context id can be used
again but will not be linked to the previous context with the same
name.
required:
- context_id
- close_context
description: Payload to close a specific TTS context.
title: CloseContext
type_:CloseSocket:
type: object
properties:
close_socket:
type: boolean
default: false
description: >-
If true, closes all contexts and closes the entire WebSocket
connection. Any context that was previously set to flush will wait
to flush before closing.
description: Payload to signal closing the entire WebSocket connection.
title: CloseSocket
type_:KeepContextAlive:
type: object
properties:
text:
type: string
enum:
- ''
description: >-
An empty string. This text is ignored by the server but its presence
resets the inactivity timeout for the specified context.
context_id:
type: string
description: The identifier of the context to keep alive.
required:
- text
- context_id
description: >-
Payload to keep a specific context alive by resetting its inactivity
timeout. Empty text is ignored but resets the clock.
title: KeepContextAlive
type_v1TextToSpeechVoiceIdMultiStreamInput:sendMessageMulti:
oneOf:
- $ref: '#/components/schemas/type_:InitializeConnectionMulti'
- $ref: '#/components/schemas/type_:InitialiseContext'
- $ref: '#/components/schemas/type_:SendTextMulti'
- $ref: '#/components/schemas/type_:FlushContext'
- $ref: '#/components/schemas/type_:CloseContext'
- $ref: '#/components/schemas/type_:CloseSocket'
- $ref: '#/components/schemas/type_:KeepContextAlive'
description: Send messages to the multi-context WebSocket.
title: sendMessageMulti
