Discord Voice Live Runbook
Use this after the non-live checks pass. The harness joins a real Discord voice channel, connects the selected realtime voice agent, and validates the counters that prove the voice bridge is working. OpenAI Realtime is the default realtime reasoning/tool agent, and xAI Grok Voice can be selected as an alternate realtime agent. OpenAI is still used for Discord speaker transcription. Speech output can come from the selected realtime agent audio directly or from ElevenLabs TTS.
The xAI realtime path supports the live voice loop and tool calls, but it does not receive Discord screen-share image frames. Use OpenAI Realtime for the screen-watch snapshot workflow.
Preflight
Run these locally before joining Discord:
pnpm check
pnpm smoke
pnpm voice:native:check
pnpm voice:build
Enable voice from the Clanky TUI with:
/discord-voice
On a fresh profile that opens setup. Once voice settings exist, it shows status.
Setup can enable voice access, add allowed server IDs, optionally restrict voice
channel IDs, pin a current voice channel target, and adjust Realtime settings.
Use /discord-voice enable to turn dynamic voice access on without pinning a
channel, /discord-voice allow-server <guild-id> [...] to add allowed servers,
/discord-voice allow-channel <voice-channel-id> [...] to add optional channel
restrictions, /discord-voice join <guild-id> <voice-channel-id> to pin one
active channel, or /discord-voice status for a quick text snapshot. These
settings are stored in the active Clanky profile and hot-restart the Discord
bridge when changed.
Pinned targets do not auto-join on normal Clanky startup. Use
/discord-voice join or let the model call discord_voice_join when someone
clearly asks for voice. For dev/live-test startup auto-join, set
CLANKY_DISCORD_VOICE_AUTO_JOIN=1 or /discord-voice set auto-join on.
When voice access is enabled without a pinned target, Clanky can use the
discord_voice_status, discord_voice_join, and discord_voice_leave tools to
choose or leave a voice channel at runtime. If an allowlist is configured,
discord_voice_join rejects channels outside it.
By default OpenAI Realtime provides both the spoken response and the
tool-calling brain. The tts-provider setting is only the speech output
provider; it does not change the realtime reasoning/tool agent. To use xAI Grok
Voice as the realtime reasoning/tool agent, configure xAI separately:
/xai-login
/discord-voice set realtime-provider xai
/discord-voice set xai-model grok-voice-latest
/discord-voice set xai-voice eve
To use ElevenLabs voices for speech while keeping the selected realtime agent
for reasoning, use the /discord-voice advanced settings or the shortcut
commands:
/elevenlabs-login
/discord-voice set tts-provider elevenlabs
/discord-voice set elevenlabs-voice <voice-id>
/discord-voice set elevenlabs-model eleven_flash_v2_5
/discord-voice set elevenlabs-output-format pcm_24000
/discord-voice set eagerness 50
Env config still works and overrides the TUI profile setting when present:
CLANKY_DISCORD_VOICE_ENABLED=1
# Optional fixed target:
CLANKY_DISCORD_VOICE_GUILD_ID=...
CLANKY_DISCORD_VOICE_CHANNEL_ID=...
# Optional comma/space-separated server allowlist:
CLANKY_DISCORD_VOICE_ALLOWED_GUILD_IDS=...
# Optional comma/space-separated channel allowlist:
CLANKY_DISCORD_VOICE_ALLOWED_CHANNEL_IDS=...
# Optional xAI Grok Voice realtime agent:
CLANKY_DISCORD_VOICE_REALTIME_AGENT_PROVIDER=xai
XAI_API_KEY=...
CLANKY_XAI_REALTIME_MODEL=grok-voice-latest
CLANKY_XAI_REALTIME_VOICE=eve
# Optional ElevenLabs speech synthesis:
CLANKY_DISCORD_VOICE_TTS_PROVIDER=elevenlabs
ELEVENLABS_API_KEY=...
# or CLANKY_ELEVENLABS_API_KEY=...
CLANKY_ELEVENLABS_VOICE_ID=...
CLANKY_ELEVENLABS_MODEL=eleven_flash_v2_5
CLANKY_ELEVENLABS_OUTPUT_FORMAT=pcm_24000
# Optional voice participation eagerness from 0 to 100:
CLANKY_DISCORD_VOICE_EAGERNESS=50
Supported ElevenLabs PCM output formats are pcm_16000, pcm_22050,
pcm_24000, and pcm_44100. CLANKY_ELEVENLABS_BASE_URL or
ELEVENLABS_BASE_URL can override the API base URL.
CLANKY_DISCORD_VOICE_PARTICIPATION_EAGERNESS is also accepted for voice
eagerness. The /discord-voice advanced settings can store the ElevenLabs
voice id, model, output format, base URL, and participation eagerness in the
active profile; /elevenlabs-login stores the API key.
Credentials can come from CLANKY_DISCORD_TOKEN or from a stored
/discord-login credential in the active Clanky profile. Bot tokens are enough
for normal voice audio. Native Discord Go Live watching requires a user-token
credential.
OpenAI credentials are still required for speaker transcription and can come
from OPENAI_API_KEY, CLANKY_OPENAI_API_KEY, or a stored /openai-login API
key. The examples below show OPENAI_API_KEY; omit that line if the active
profile already has a stored OpenAI key. xAI Grok Voice additionally requires
XAI_API_KEY or a stored /xai-login API key when
CLANKY_DISCORD_VOICE_REALTIME_AGENT_PROVIDER=xai. ElevenLabs speech
additionally requires
ELEVENLABS_API_KEY, CLANKY_ELEVENLABS_API_KEY, or a stored
/elevenlabs-login API key.
This bridge uses realtime WebSocket/event transports intentionally. OpenAI
recommends WebRTC for browser or mobile clients where the client owns a
microphone track. Here, clankvox already terminates Discord RTP and exposes
PCM frames to Node, so WebSocket audio-buffer events map directly onto Discord
input/output without adding another media peer connection.
In group voice, barge-in is intentionally gated. While Clanky has spoken audio
buffered or active external TTS, ordinary speaker transcripts are recorded but
not forwarded to the response session. A speaker must explicitly say Clanky
or a configured/known STT alias such as clank, clanker, clonky,
clunky, or planky to interrupt; that path stops TTS playback, cancels the
active Realtime response, and forwards the addressed transcript. Add
CLANKY_DISCORD_VOICE_WAKE_NAMES or CLANKY_DISCORD_WAKE_NAMES for local
nicknames that should also count.
Music And Video Media
Realtime is the live voice front-end, but Pi remains the reasoning/tool/skill layer. The realtime voice bridge exposes small URL-first media controls:
play_music_url: play a resolved http(s) URL into Discord voice audio.play_video_url: start Discord Go Live for a resolved http(s) video URL and, by default, play the audio into voice too.start_music_visualizer: publish a Go Live visualizer for current music or a resolved music URL.media_pause,media_resume,media_stop,media_status: live playback controls.
For search-like requests such as "play Minecraft music" or "put on this video",
Realtime should call ask_pi first and let Pi use skills/tools to resolve the
playable URL. The media tools intentionally do not carry the old search,
queueing, and disambiguation stack.
play_music_url works with bot-token voice because it is normal Discord voice
audio. play_video_url and start_music_visualizer need a user-token
credential because they use Discord Go Live publish.
Bot-Token Audio And Tool Check
This checks voice join, Discord input audio, Realtime session acceptance, Realtime output audio, Realtime tool calling, Pi delegation, and Realtime socket health. Join the configured voice channel yourself and speak briefly during the run.
CLANKY_DISCORD_VOICE_ENABLED=1 \
CLANKY_DISCORD_VOICE_GUILD_ID=... \
CLANKY_DISCORD_VOICE_CHANNEL_ID=... \
OPENAI_API_KEY=... \
CLANKY_DISCORD_VOICE_LIVE_MS=90000 \
CLANKY_DISCORD_VOICE_REQUIRE_INPUT_AUDIO=1 \
CLANKY_DISCORD_VOICE_REQUIRE_REALTIME_SESSION=1 \
CLANKY_DISCORD_VOICE_REQUIRE_OUTPUT_AUDIO=1 \
CLANKY_DISCORD_VOICE_REQUIRE_TOOL_CALL=1 \
CLANKY_DISCORD_VOICE_REQUIRE_ASK_PI=1 \
CLANKY_DISCORD_VOICE_FAIL_ON_REALTIME_ERROR=1 \
CLANKY_DISCORD_VOICE_RESULT_PATH=tmp/discord-voice-bot-result.json \
CLANKY_DISCORD_VOICE_SCRIPTED_PROMPT="Use ask_pi to ask Pi for a one-sentence status check, then reply out loud." \
pnpm voice:live
gpt-realtime-2 is the default realtime agent model. The bridge sends
reasoning.effort=low by default for that model; override it with
CLANKY_OPENAI_REALTIME_REASONING_EFFORT=minimal|low|medium|high|xhigh when
you want to trade latency for deeper reasoning.
CLANKY_DISCORD_VOICE_REQUIRE_OUTPUT_AUDIO=1 follows the active speech
provider. For the default OpenAI speech path it requires OpenAI Realtime audio
deltas plus Discord output sends. For ElevenLabs speech it requires ElevenLabs
TTS output plus Discord output sends.
ElevenLabs Speech Check
This checks voice join, Discord input audio, Realtime session acceptance, Realtime tool calling, Pi delegation, ElevenLabs TTS streaming, Discord output sends, and Realtime socket health. Join the configured voice channel yourself and speak briefly during the run.
CLANKY_DISCORD_VOICE_ENABLED=1 \
CLANKY_DISCORD_VOICE_GUILD_ID=... \
CLANKY_DISCORD_VOICE_CHANNEL_ID=... \
OPENAI_API_KEY=... \
CLANKY_DISCORD_VOICE_TTS_PROVIDER=elevenlabs \
ELEVENLABS_API_KEY=... \
CLANKY_ELEVENLABS_VOICE_ID=... \
CLANKY_DISCORD_VOICE_LIVE_MS=90000 \
CLANKY_DISCORD_VOICE_REQUIRE_INPUT_AUDIO=1 \
CLANKY_DISCORD_VOICE_REQUIRE_REALTIME_SESSION=1 \
CLANKY_DISCORD_VOICE_REQUIRE_OUTPUT_AUDIO=1 \
CLANKY_DISCORD_VOICE_REQUIRE_TOOL_CALL=1 \
CLANKY_DISCORD_VOICE_REQUIRE_ASK_PI=1 \
CLANKY_DISCORD_VOICE_FAIL_ON_REALTIME_ERROR=1 \
CLANKY_DISCORD_VOICE_RESULT_PATH=tmp/discord-voice-elevenlabs-result.json \
CLANKY_DISCORD_VOICE_SCRIPTED_PROMPT="Use ask_pi to ask Pi for a one-sentence status check, then reply out loud." \
pnpm voice:live
Pass criteria for this path are the enabled checklist plus
externalTtsRequestCount > 0, externalTtsAudioBytes > 0,
externalTtsErrorCount = 0, and discordOutputAudioSendCount > 0 in the final
status/result JSON.
For group voice validation, add:
CLANKY_DISCORD_VOICE_REQUIRE_GROUP_AUDIO=1
Then have two voice participants overlap briefly. The status output should show
discordInputMaxConcurrentSpeakers greater than 1.
User-Token Go Live Check
Use this only with a burner Discord user account. The native screen-share path uses undocumented user-token/selfbot Discord gateway behavior.
Start a Discord Go Live stream in the configured voice channel, then run:
CLANKY_DISCORD_CREDENTIAL_KIND=user-token \
CLANKY_DISCORD_VOICE_ENABLED=1 \
CLANKY_DISCORD_VOICE_GUILD_ID=... \
CLANKY_DISCORD_VOICE_CHANNEL_ID=... \
OPENAI_API_KEY=... \
CLANKY_DISCORD_VOICE_LIVE_MS=120000 \
CLANKY_DISCORD_VOICE_REQUIRE_ALL=1 \
CLANKY_DISCORD_VOICE_FAIL_ON_REALTIME_ERROR=1 \
CLANKY_DISCORD_VOICE_RESULT_PATH=tmp/discord-voice-golive-result.json \
CLANKY_DISCORD_VOICE_SCRIPTED_PROMPT="List active screen shares, start watching the visible Go Live stream, take a snapshot, and describe what is on screen." \
pnpm voice:live
If you use stored /discord-login credentials instead of
CLANKY_DISCORD_TOKEN, make sure /discord-whoami reports user-token.
Result Artifact
Set CLANKY_DISCORD_VOICE_RESULT_PATH to write a JSON result file. The harness
writes it for preflight/startup failures and before final validation failures
are thrown, so it is useful for both passing and failing runs. Share that file
when reporting live results.
The result includes:
phase(preflight,error, orfinal)validation.passedvalidation.failuresvalidation.checkswith per-requirementid,observed,expected, andpassedvalues when bridge status is availablevalidation.requirements- bridge
statuswhen available - startup/runtime
errorwhen available - start/end timestamps and duration
Pass Criteria
The harness prints voice-live: validation PASS when all enabled requirements
are satisfied. For the full Go Live run, the important counters are:
discordInputAudioEventCount > 0discordInputMaxConcurrentSpeakers > 1when group audio is requiredrealtimeSessionUpdatedCount > 0realtimeAudioDeltaCount > 0when OpenAI speech is activeexternalTtsRequestCount > 0andexternalTtsAudioBytes > 0whenCLANKY_DISCORD_VOICE_TTS_PROVIDER=elevenlabsdiscordOutputAudioSendCount > 0realtimeFunctionCallCount > 0askPiCallCount > 0streamWatchConnectCount > 0decodedVideoFrameCount > 0- Realtime API/socket error counters stay at
0
Common Failures
Discord Go Live validation requires ... user-token: the run requested stream-watch or screen-frame validation with a bot token.expected Discord input audio events: nobody spoke in the configured voice channel, or the bot/user did not join the expected channel.expected overlapping Discord input from at least two speakers: group audio validation was enabled but the speakers did not overlap.expected OpenAI Realtime session.updated after session.update: the Realtime socket opened but OpenAI did not acknowledge the requested session configuration.expected voice output audio to be sent to Discord: the active speech provider produced output, but the bridge did not hand any audio toclankvox.expected ElevenLabs TTS output audio: ElevenLabs speech is active, but the bridge did not complete a TTS request; check the API key, voice id, model, andexternalTtsErrorCount.expected decoded Discord screen-share frames:STREAM_WATCHconnected, but no decodable frame reachedclankvox; keep the Go Live stream visible for longer and retry.- Realtime socket closes/errors: rerun with
CLANKY_DISCORD_VOICE_STATUS_MS=1000to capture the last status counters before failure.
