Voice Pipeline — Provider Abstraction and Stage Reference
Scope: Current voice transport stack, runtime modes, and canonical settings surfaces. Shared attention model:
../architecture/presence-and-attention.mdActivity model and knob map:../architecture/activity.mdCross-cutting settings contract:../reference/settings.mdCapture and ASR details:voice-capture-and-asr-pipeline.mdReply orchestration:voice-client-and-reply-orchestration.mdOutput and barge-in:voice-output-and-barge-in.mdDiscord-native stream transport:discord-streaming.mdHistorical stream-watch rollout:../archive/selfbot-stream-watch.mdclankvoxlocal docs:../../src/voice/clankvox/README.md
This document describes the voice spoke under the shared attention contract: capture, transcription, admission, transport, output, and voice-side ambient delivery.
In this fork, the Discord voice transport is selfbot-owned. Bun owns the user-account gateway/session lifecycle and reply orchestration. clankvox is the Rust media plane that owns RTP, DAVE, Opus, mixer/output pacing, and the native Go Live stream-watch / stream-publish transport legs.
1. Canonical Settings Surface
Persistence, preset inheritance, dashboard envelope shape, and save/version semantics now live in ../reference/settings.md.
This document keeps the voice-local settings surfaces that matter for voice transport and stage behavior.
Voice configuration is split across these live surfaces:
interaction.activity.*: shared reactive text/voice behavior axesagentStack.runtimeConfig.voice.*: runtime/provider transport configvoice.conversationPolicy.*: reply-path and conversation behaviorvoice.admission.*: public reply-admission policyvoice.transcription.*: ASR enablement and language hintingvoice.channelPolicy.*: channel/user access controlvoice.sessionLimits.*: session duration and concurrency limitsvoice.soundboard.*: Discord soundboard capability and catalog selectioninitiative.voice.*: proactive voice-thought cadence
Preset resolution also matters:
agentStack.presetagentStack.overrides.voiceAdmissionClassifier
2. Runtime Overview
The voice stack keeps transport and behavior separate:
runtime modechooses the realtime provider familyreply pathchooses how turns are plannedadmissiondecides whether a turn should reach generationgenerationandtoolsrun either in the provider-native loop or the orchestrator loopoutputspeaks through realtime or API TTS, thenclankvoxClientpaces generated PCM into the Rust mixer while preserving queued speech unless an interruption clears it
Shared continuity can inform this stack. Voice does not own the whole conversational mind; it owns how that continuity becomes audible in a live room.
Media-plane ownership:
- selfbot gateway/session: Discord control-plane identity, voice-state events, stream discovery dispatch
clankvox: Discord media-plane transport, encryption/decryption, and frame/audio ingress/egress- Bun voice runtime: turn lifecycle, tools, prompt assembly, commentary, and stream-watch decode/ingest after IPC
Current transport roles inside clankvox:
voice: main bidirectional voice transport for audio send/receivestream_watch: inbound Go Live receive transport for native screen watchstream_publish: outbound Go Live sender transport for native self publish
Runtime mode values:
openai_realtimevoice_agentgemini_realtimeelevenlabs_realtime
Reply-path values:
nativebridgebrain
Base defaults from settingsSchema.ts:
agentStack.runtimeConfig.voice.runtimeMode = "openai_realtime"voice.conversationPolicy.replyPath = "brain"voice.conversationPolicy.ttsMode = "realtime"voice.admission.mode = "generation_decides"
Voice runtime precedence is:
- explicit
agentStack.runtimeConfig.voice.runtimeMode - preset default
3. Reply Paths
Native
The provider owns audio input, planning, tool calls, and audio output end to end.
Properties:
- lowest orchestration overhead
- provider-native tool loop when the runtime supports it
- no local text-generation stage
Native is available on runtimes that support provider-native planning, including openai_realtime and voice_agent.
Bridge
The runtime transcribes speech locally, then forwards labeled text to the realtime provider. The provider still owns response planning and provider-native tool calls.
Properties:
- text-mediated realtime turn handling
- classifier-first admission in practice
- provider-native tool loop when supported
Brain
The orchestrator owns text generation and tool calling. The realtime provider is used as TTS transport (WebSocket streaming for OpenAI/xAI/Gemini/ElevenLabs), or OpenAI Audio API is used as TTS when voice.conversationPolicy.ttsMode = "api".
Properties:
- works with all runtime/provider combinations
- shared text/voice tool loop behavior
- generation binding comes from
agentStack.runtimeConfig.voice.generation
4. Stage Visibility Matrix
| Stage | Native | Bridge | Brain |
|---|---|---|---|
| Audio capture | yes | yes | yes |
| Transcription | provider-native or bypassed | yes | yes |
| Noise rejection / promotion gates | yes | yes | yes |
| Deterministic admission | yes | yes | yes |
| Classifier admission | no text classifier path | effectively yes | optional |
| Provider-native planning | yes | yes | no |
| Orchestrator text generation | no | no | yes |
| Realtime output transport | yes | yes | yes |
| API TTS override | no | no | yes |
| Voice thought engine | yes | yes | yes |
5. Stage Reference
Stage 1: Capture And Transcription
Canonical public ASR settings:
voice.transcription.enabledvoice.transcription.languageModevoice.transcription.languageHint
Canonical runtime transport/transcription settings:
agentStack.runtimeConfig.voice.openaiRealtime.inputAudioFormatagentStack.runtimeConfig.voice.openaiRealtime.outputAudioFormatagentStack.runtimeConfig.voice.openaiRealtime.transcriptionMethodagentStack.runtimeConfig.voice.openaiRealtime.inputTranscriptionModelagentStack.runtimeConfig.voice.openaiRealtime.usePerUserAsrBridge
These runtime settings configure bridge and file-turn transcription behavior. OpenAI transport maps configured audio formats onto the nested realtime media descriptors used in session.update payloads: pcm16 becomes audio/pcm, g711_ulaw becomes audio/pcmu, and g711_alaw becomes audio/pcma.
Stage 2: Turn Promotion And Noise Rejection
Before a turn reaches admission, the runtime applies:
- provisional capture promotion
- silence and short-clip filters
- bridge hallucination and ASR-confidence guards where applicable
Relevant modules:
src/voice/captureManager.tssrc/voice/turnProcessor.tssrc/voice/voiceDecisionRuntime.ts
Stage 3: Reply Admission
The public admission surface is:
voice.admission.modevoice.admission.musicWakeLatchSeconds
Canonical music playback / wake-latch semantics live in music.md.
This stage is the voice spoke's cost and floor gate. It is not a second conversational policy layer separate from the shared continuity contract. Its job is to decide when a voice turn is eligible to reach the main reply brain under live-room constraints.
Classifier binding is resolved through:
- preset defaults in
src/settings/agentStack.ts agentStack.overrides.voiceAdmissionClassifier
Important runtime behavior:
- if
replyPath = "bridge", the runtime always behaves as classifier-first - if
replyPath = "brain", the public admission mode preservesgeneration_decidesorclassifier_gate;generation_decidesis the default andclassifier_gateis an optional classifier-first cost gate before the main brain - if
replyPath = "native", the canonical public admission mode normalizes togeneration_decides - surviving
brainturns are generation-owned by default and the model decides reply vs[SKIP] classifier_gateandgeneration_decidesare the canonical public settings values- internal labels like
hard_classifierandgeneration_onlyare implementation details used byvoiceReplyDecision.ts
Stage 4: Generation And Tool Ownership
native and bridge use provider-native planning when the runtime supports it.
brain uses:
agentStack.runtimeConfig.voice.generation
Tool ownership:
- canonical local tool definitions come from
src/tools/toolRegistry.tsandsrc/tools/sharedToolSchemas.ts - provider-native voice exports are assembled in
src/voice/voiceToolCallToolRegistry.tsfrom the same local registry, so only tools with realtime executors are mounted - execution is still centralized in
src/voice/voiceToolCallDispatch.ts - full-brain replies use the shared orchestrator tool loop instead of provider-native replanning, then pass the exact per-turn tool list into the voice prompt
- provider-native sessions emit
realtime_tool_call_*events; brain/transport-only sessions emitvoice_brain_*events
Turn-context parity:
src/voice/voiceTurnContext.tsis the shared live-room context builder for both full-brain replies and provider-native realtime instruction refresh- that shared context normalizes participant roster, recent membership/effect events, native Discord sharers, screen-watch capability, stream-watch notes, compacted session summary, music state, and recent tool outcomes into one prompt-facing shape
src/voice/voiceMemoryContext.tsapplies the same continuity and behavioral-memory loading policy to provider-native instruction refreshes and brain-path generation turnssrc/voice/voiceToolResultSummary.tsis the canonical compact tool-result summary shape for both brain and provider-native tool loops, so follow-up reasoning sees the same post-tool facts even when the transport differs- provider-native tool completion schedules a live instruction refresh after execution, so the realtime model sees the updated tool outcome summary and room state instead of reasoning from stale pre-tool instructions
Voice tool continuation policy (voiceContinuationPolicy in sharedToolSchemas.ts):
Each tool declares whether the LLM gets a follow-up generation turn after the tool executes. This controls whether tool results are fed back to the LLM for a spoken follow-up.
| Policy | Behavior | Typical tools |
|---|---|---|
always | Tool result is always fed back to the LLM for follow-up speech. The LLM sees the result (including errors) and can respond. | video_play, music_play, web_search, browser_browse, spawn_code_worker, memory_write |
fire_and_forget | No follow-up generation. The tool is a silent side-effect; the LLM's speech from the same generation is the complete response. If the LLM needs to say something, it must include text alongside the tool call. | play_soundboard, music_skip, note_context, leave_voice_channel, start_screen_watch |
When speech is dispatched before tools execute (pre-tool flush or sentence streaming), fire_and_forget tools will not produce additional speech on failure. This is intentional — these tools are low-failure side effects where the preamble speech is the complete user-facing response.
Stage 5: Output
Conversation-policy output knobs:
voice.conversationPolicy.replyPathvoice.conversationPolicy.ttsModevoice.conversationPolicy.streaming.*
API TTS config:
agentStack.runtimeConfig.voice.openaiAudioApi.ttsModelagentStack.runtimeConfig.voice.openaiAudioApi.ttsVoiceagentStack.runtimeConfig.voice.openaiAudioApi.ttsSpeed
Stage 6: Voice Thought Engine
Canonical cadence settings:
initiative.voice.enabledinitiative.voice.eagernessinitiative.voice.minSilenceSecondsinitiative.voice.minSecondsBetweenThoughts
This is the voice transport for ambient attention. It is the spoken counterpart to the ambient text cycle, not a separate behavioral system.
Implementation note:
- the thought generator resolves provider/model from the resolved voice-initiative binding (
initiative.voice.execution)
Relevant modules:
src/voice/thoughtEngine.tssrc/voice/voiceThoughtGeneration.ts
Stage 7: Soundboard Behavior
Canonical soundboard settings:
voice.soundboard.eagernessvoice.soundboard.enabledvoice.soundboard.allowExternalSoundsvoice.soundboard.preferredSoundIds
Implementation note:
voice.soundboard.eagernessis prompt context, not a hard gate. Lower values push the runtime toward restraint; higher values let it use Discord sound effects more playfully when the joke lands.play_soundboardis the canonical soundboard mechanism on provider-nativenativeandbridgesessions. Those sessions should not emit[[SOUNDBOARD:...]]markup in spoken replies.- The canonical precise timing mechanism on the
brainpath is inline[[SOUNDBOARD:<sound_ref>]]control markup in the model text. The runtime parses those directives into an ordered speech/soundboard sequence. - Buffered brain playback routes the whole reply through that ordered sequencer.
- Streamed brain playback reuses the same ordered sequencer chunk-by-chunk. This supports
speech -> soundboard -> speechtiming inside streamed replies, but playback remains serialized rather than mixed. - Normal streamed chunk emission waits for the configured minimum completed sentences per chunk before dispatch.
maxBufferCharsand final flush still force delivery so long run-ons and short tails do not stall playback. - The default brain streaming settings are intentionally prosody-biased, not minimum-latency-biased.
minSentencesPerChunk=2and a sentence-coherent first chunk exist so realtime exact-line playback sounds like one continuous thought instead of a run of tiny restarty utterances. - If a deployment needs faster first-byte latency on slow model/tool turns, prefer a per-turn timeout fallback that relaxes chunking after a latency budget rather than lowering these defaults globally.
- In realtime streaming, any chunk that contains inline soundboard directives is treated as a strict output barrier. Earlier queued or buffered assistant speech must finish before that chunk continues, so the soundboard beat cannot jump ahead of the speech it belongs to.
- When a streamed realtime speech step precedes an inline soundboard beat, the completion wait is request-scoped. Tail flags like
botTurnOpendo not hold the beat after that specific utterance has already finished draining. - Parsing inline refs out of provider-native output transcripts remains a compatibility fallback, not the primary timing path.
6. Settings Reference
Shared Activity Axes
| Setting | Default | Meaning |
|---|---|---|
interaction.activity.responseWindowEagerness | 55 | How strongly recent engagement is framed to voice follow-up prompting/classification; the core voice recency window is still runtime-owned |
interaction.activity.reactivity | 40 | Shared tendency for emoji beats and other lightweight reactions |
Conversation Policy
| Setting | Default | Meaning |
|---|---|---|
voice.conversationPolicy.ambientReplyEagerness | 50 | Ambient voice reply willingness when not directly addressed |
voice.conversationPolicy.commandOnlyMode | false | Restrict replies toward command/wake interactions |
voice.conversationPolicy.allowNsfwHumor | true | Voice tone guardrail input |
voice.conversationPolicy.textOnlyMode | false | Disable voice output while still processing turns |
voice.conversationPolicy.defaultInterruptionMode | "speaker" | Default barge-in target |
voice.conversationPolicy.replyPath | "brain" | native, bridge, or brain |
voice.conversationPolicy.ttsMode | "realtime" | realtime or api output |
voice.conversationPolicy.thinking | "disabled" | Brain-path thinking mode (disabled, enabled, think_aloud) |
voice.conversationPolicy.thinkingBudgetTokens | 1024 | Token budget for Anthropic/Claude thinking when thinking is enabled |
voice.conversationPolicy.streaming.enabled | true | Enables streamed speech chunks on brain path |
voice.conversationPolicy.streaming.minSentencesPerChunk | 2 | Minimum completed sentences before a normal streamed chunk emits |
voice.conversationPolicy.streaming.eagerFirstChunkChars | 30 | Minimum buffered chars before the first streamed chunk can emit eagerly |
voice.conversationPolicy.streaming.maxBufferChars | 300 | Forced break size when streaming text grows too large without a clean chunk boundary |
Dashboard placement note:
- Voice thinking controls are shown in
Voice Mode -> Output -> Brainand only when the active Brain provider resolves to an Anthropic/Claude-style lane (anthropic,claude-oauth,ai_sdk_anthropic).
Soundboard Policy
| Setting | Default | Meaning |
|---|---|---|
voice.soundboard.eagerness | 40 | How readily the bot should use Discord soundboard beats when they fit |
voice.soundboard.enabled | true | Enable Discord soundboard playback in live voice sessions |
voice.soundboard.allowExternalSounds | false | Allow refs that target sounds from another guild |
voice.soundboard.preferredSoundIds | [] | Preferred refs to expose before falling back to the live guild catalog |
Admission
| Setting | Default | Meaning |
|---|---|---|
voice.admission.mode | "generation_decides" | Public admission mode |
voice.admission.musicWakeLatchSeconds | 30 | Wake follow-up window during music playback |
Classifier provider/model are resolved from preset defaults or agentStack.overrides.voiceAdmissionClassifier.
Transcription
| Setting | Default | Meaning |
|---|---|---|
voice.transcription.enabled | true | Master ASR toggle |
voice.transcription.languageMode | "auto" | Auto or fixed language mode |
voice.transcription.languageHint | "en" | Language hint for fixed/biased transcription |
Voice Runtime Config
| Setting | Default | Meaning |
|---|---|---|
agentStack.runtimeConfig.voice.runtimeMode | "openai_realtime" | Realtime runtime family |
agentStack.runtimeConfig.voice.openaiRealtime.model | "gpt-realtime" | OpenAI realtime model |
agentStack.runtimeConfig.voice.openaiRealtime.voice | "ash" | OpenAI realtime voice |
agentStack.runtimeConfig.voice.openaiRealtime.inputAudioFormat | "pcm16" | OpenAI realtime input transport format |
agentStack.runtimeConfig.voice.openaiRealtime.outputAudioFormat | "pcm16" | OpenAI realtime output transport format |
agentStack.runtimeConfig.voice.openaiRealtime.transcriptionMethod | "realtime_bridge" | Bridge vs file-turn transcription mode |
agentStack.runtimeConfig.voice.openaiRealtime.inputTranscriptionModel | "gpt-4o-mini-transcribe" | Realtime ASR model |
agentStack.runtimeConfig.voice.openaiRealtime.usePerUserAsrBridge | true | Per-speaker bridge mode |
agentStack.runtimeConfig.voice.openaiAudioApi.ttsModel | "gpt-4o-mini-tts" | API TTS model |
agentStack.runtimeConfig.voice.openaiAudioApi.ttsVoice | "alloy" | API TTS voice |
agentStack.runtimeConfig.voice.openaiAudioApi.ttsSpeed | 1 | API TTS speed |
agentStack.runtimeConfig.voice.generation | dedicated model policy | Brain-path text generation binding |
Session Limits
| Setting | Default | Meaning |
|---|---|---|
voice.sessionLimits.maxSessionMinutes | 30 | Max session duration |
voice.sessionLimits.inactivityLeaveSeconds | 300 | Auto-leave inactivity timer |
voice.sessionLimits.maxSessionsPerDay | 120 | Daily session cap |
voice.sessionLimits.maxConcurrentSessions | 3 | Concurrency cap |
Voice Thought Engine
| Setting | Default | Meaning |
|---|---|---|
initiative.voice.enabled | true | Enable proactive voice thoughts |
initiative.voice.eagerness | 50 | Probability gate before thought generation |
initiative.voice.minSilenceSeconds | 45 | Required silence before a thought attempt |
initiative.voice.minSecondsBetweenThoughts | 60 | Minimum spacing between thought attempts |
7. Provider Capabilities
Current runtime families:
| Runtime | Typical provider | Notes |
|---|---|---|
openai_realtime | OpenAI | Supports native, bridge, and brain transports |
voice_agent | xAI | Shipped native path via grok_native_agent preset |
gemini_realtime | Gemini | Realtime transport/runtime family |
elevenlabs_realtime | ElevenLabs | Full-brain runtime with WebSocket streaming TTS (ElevenLabsRealtimeClient), shared ASR bridge, and optional file-turn transcription |
Provider differences live in thin adapters. The higher-level product behavior stays in shared orchestration, prompts, and tool execution.
8. Source Files
src/settings/agentStack.tssrc/settings/settingsSchema.tssrc/voice/voiceConfigResolver.tssrc/voice/voiceReplyDecision.tssrc/voice/turnProcessor.tssrc/voice/sessionLifecycle.tssrc/voice/voiceToolCallDispatch.tssrc/voice/voiceThoughtGeneration.tssrc/voice/elevenLabsRealtimeClient.ts
