Voice Capture and ASR Pipeline
Scope: Per-user audio capture lifecycle and ASR transcription — from Discord speaking event through promotion, ASR bridge transcription, and handoff to the turn processor. Voice pipeline stages:
voice-provider-abstraction.mdOutput and barge-in:voice-output-and-barge-in.mdReply orchestration:voice-client-and-reply-orchestration.mdCross-cutting settings contract:../reference/settings.md
Part 1: Audio Capture
Persistence, preset inheritance, dashboard envelope shape, and save/version semantics live in ../reference/settings.md. This document keeps the capture lifecycle, promotion thresholds, ASR handoff, and related voice-local settings scoped to the audio pipeline itself.
This part defines the per-user audio capture state machine. Each user who speaks in a voice session gets an independent CaptureState that tracks their audio from the first Discord speaking event through promotion, finalization, and handoff to the turn processor.
1. Source of Truth
The VoiceSession owns the authoritative capture state in session.userCaptures: Map<string, CaptureState>.
- A user has an active capture if and only if they have an entry in this map.
- There is no separate
speakingUsersset —userCapturesis the single source of truth for who is currently being captured. session.userCaptures.sizeis the active capture count.
External systems provide signals but do not own capture state:
clankvoxsubprocess providesspeakingStart,speakingEnd,userAudio,userAudioEndIPC events.- OpenAI ASR provides
server_vadspeech detection confirmations. - The capture manager derives one canonical lifecycle from those signals.
Code:
src/voice/captureManager.ts— capture lifecycle, audio ingestion, promotion, finalizationsrc/voice/sessionLifecycle.ts— speaking event handlers, timer managementsrc/voice/voiceSessionTypes.ts—CaptureStateandVoiceSessiontype definitionssrc/voice/voiceAudioAnalysis.ts— PCM signal analysis functions
2. Phases
Each CaptureState progresses through a linear lifecycle:
| Phase | Meaning | Key Indicator |
|---|---|---|
provisional | Audio is being buffered but speech has not been confirmed | promotedAt === 0 |
promoted | Speech confirmed by server VAD or strong local signal; turn is "real" | promotedAt > 0 |
finalizing | Speaking ended; awaiting finalization timer or stream end | speakingEndFinalizeTimer !== null |
finalized | Capture complete; PCM handed to turn processor | Entry removed from userCaptures |
A capture that never promotes is silently discarded — no ASR call, no LLM cost.
3. Authoritative vs Heuristic Signals
| Signal | Role | Notes |
|---|---|---|
captureState.promotedAt | authoritative | 0 = provisional, > 0 = promoted. The single source of truth for promotion status. |
captureState.bytesSent | authoritative | Total PCM bytes accumulated. Used for minimum clip duration checks. |
captureState.signalPeakAbs | authoritative | Peak absolute sample value (monotonic max). Promotion and barge-in input. |
captureState.signalActiveSampleCount / signalSampleCount | authoritative | Active sample ratio = activeSampleCount / sampleCount. Core promotion metric. |
captureState.signalSumSquares | authoritative | RMS computation input. |
captureState.pcmChunks | authoritative | Raw PCM buffer array. Concatenated on finalization. |
asrState.speechDetectedUtteranceId | signal (from ASR) | Server VAD confirmation. Contributes to server_vad_confirmed promotion. Must match captureState.asrUtteranceId. See Part 2, Section 13. |
asrState.speechDetectedAt | signal (from ASR) | Timestamp of server VAD speech detection. |
speakingEndFinalizeTimer | lifecycle timer | Adaptive delay between Discord speakingEnd and capture finalization. Not a state indicator. |
idleFlushTimer | lifecycle timer | Fires when no audio arrives for a threshold period. |
maxFlushTimer | lifecycle timer | Hard cap at CAPTURE_MAX_DURATION_MS (8s). Prevents unbounded captures. |
session.lastInboundAudioAt | derived | Updated on promotion and subsequent audio. Used by reply decision for silence timing. Not part of capture state. |
4. Promotion Signals
Promotion is evaluated on every incoming audio chunk in onUserAudio. Two independent signals can trigger promotion:
| Signal | Criteria | Constants |
|---|---|---|
server_vad_confirmed | Server VAD fired for this utterance (speechDetectedUtteranceId === captureState.asrUtteranceId) AND activeSampleRatio >= 0.02 AND peak >= 0.016 AND bytesSent >= minPromotionBytes | VOICE_TURN_PROMOTION_ACTIVE_RATIO_MIN (0.02), VOICE_TURN_PROMOTION_PEAK_MIN (0.016), VOICE_TURN_PROMOTION_MIN_CLIP_MS (420) |
strong_local_audio | activeSampleRatio >= 0.14 AND peak >= 0.06 AND rms >= 0.008 AND bytesSent >= minPromotionBytes | VOICE_TURN_PROMOTION_STRONG_LOCAL_ACTIVE_RATIO_MIN (0.14), VOICE_TURN_PROMOTION_STRONG_LOCAL_PEAK_MIN (0.06), VOICE_TURN_PROMOTION_STRONG_LOCAL_RMS_MIN (0.008) |
The hybrid design is deliberate:
- Server VAD rejects ambient noise (TV, room noise) better than fixed local thresholds
- Local fallback ensures clearly strong speech promotes even if server VAD is delayed
server_vad_confirmedhas lower local thresholds because the server already validated speech
Promotion side effects:
- Does not hold, cancel, or supersede pending pre-audio assistant speech by itself
- Begins shared ASR utterance (if shared ASR mode) and flushes buffered PCM
- Updates
session.lastInboundAudioAt - Emits
voice_activity_startedlog event
5. Transition Rules
| Event | From | To |
|---|---|---|
clankvox speakingStart | (no capture) | provisional |
| Audio chunk with promotion criteria met | provisional | promoted |
clankvox speakingEnd (arms finalize timer) | promoted | finalizing |
clankvox speakingStart again (same user, clears timer) | finalizing | promoted |
New userAudio during finalize timer (clears timer) | finalizing | promoted |
| Finalize timer fires | finalizing | finalized → turn processor |
clankvox userAudioEnd | promoted / finalizing | finalized → turn processor |
| Idle flush timer fires | promoted | finalized → turn processor |
| Max duration timer fires (8s) | any active | ASR commit → transcript banked (no generation) |
| Near-silence early abort (age >= 1s, signal below threshold) | provisional | discarded |
clankvox speakingEnd timer fires on unpromotable capture | provisional | discarded |
Explicit abort (abortActiveInboundCaptures) | any active | discarded |
clankvox clientDisconnect | any active | finalized (if promoted) or discarded |
clankvox speakingEnd (no promotion, discard) | provisional | discarded |
Discard conditions (never reaches turn processor)
| Condition | Log Event |
|---|---|
| Never promoted (signal too weak for any promotion signal) | voice_turn_dropped_provisional_capture |
| Zero bytes sent (no audio data received) | voice_turn_skipped_empty_capture |
| Silence gate (aggregated PCM fails RMS/peak/activeRatio thresholds) | voice_turn_dropped_silence_gate |
| Near-silence early abort (age >= 1s, very low signal) | voice_turn_dropped_provisional_capture with near_silence_early_abort reason |
The silence gate's VOICE_SILENCE_GATE_MIN_CLIP_MS (280ms) and promotion's VOICE_TURN_PROMOTION_MIN_CLIP_MS (420ms) serve different purposes: the silence gate asks "is this audio at all?" and drops pure silence before wasting an ASR call, while promotion asks "is this speech worth processing as a turn?" A 300ms clip of faint noise correctly passes the silence gate (it's not silent) but correctly fails promotion (it's not a real utterance).
6. Cross-Domain State Reads
The capture subsystem reads state from other subsystems at these points:
| Subsystem | State Read | Where | Purpose |
|---|---|---|---|
| Assistant Output | assistantOutput.phase via getOutputChannelState | onSpeakingStart (suppression check) | Don't start captures during web lookup busy |
| Assistant Output | botTurnOpen, botTurnOpenAt | bargeInController.shouldBargeIn | Echo guard — don't barge in within 1500ms of bot speech start |
| Assistant Output | hasRecentAssistantAudioDelta, hasBufferedTtsPlayback | bargeInController.shouldBargeIn | Check if bot is actively streaming audio |
| Music | musicActive via getOutputChannelState | bargeInController.isBargeInInterruptTargetActive | Don't trigger barge-in on music-only output |
| ASR Bridge | speechDetectedUtteranceId, speechDetectedAt | hasCaptureServerVadSpeech | Server VAD confirmation for promotion. See Part 2, Section 13. |
| Session Lifecycle | session.ending | All capture operations | Bail out when session is ending |
| Session Identity | client.user?.id | onSpeakingStart | Ignore bot's own speaking events |
| Barge-In | bargeInSuppressionUntil | isBargeInOutputSuppressed | Suppress outbound audio after barge-in |
Timing-sensitivity notes
All signal metrics (bytesSent, signalSampleCount, signalActiveSampleCount, signalPeakAbs, signalSumSquares) are updated synchronously in the onUserAudio hot path. This is essential because:
- Promotion checks run on the same tick as audio accumulation
- Barge-in signal assertions read these metrics synchronously
- Multiple users can have overlapping audio chunks on the same event loop tick
7. The Turn Output
When a promoted capture finalizes, the concatenated PCM buffer is routed based on session mode:
Realtime Session With File ASR Override
turnProcessor.queueFileAsrTurn({ session, userId, pcmBuffer, captureReason })
Realtime Mode (with ASR bridge)
captureManager.runAsrBridgeCommit()→commitAsrUtterance()→queueRealtimeTurnFromAsrBridge()→turnProcessor.queueRealtimeTurn()with transcript overrides
Per-user ASR keeps the provider's committed realtime item_id bound to the utterance object that issued the commit. Late final transcript events therefore stay attached to the correct committed turn even if a fresh provisional capture starts before the provider finishes streaming the transcript.
If assistant speech is already active and transcript-overlap interrupts are enabled, queueRealtimeTurnFromAsrBridge() does not always forward the finalized turn immediately. Instead it may stage that ASR result behind an overlap burst decision:
pendingburst decision: keep the finalized bridge turn in a per-utterance staging mapinterrupt: cut assistant output, then flush the staged turn into the normal realtime queueignore: drop the staged turn entirely so laughter/backchannel does not become a user turn
See Part 2: ASR Bridge for the full commit and transcript resolution flow.
Realtime Mode (without ASR bridge)
turnProcessor.queueRealtimeTurn({ session, userId, pcmBuffer, captureReason })(turn processor runs its own ASR)
The RealtimeQueuedTurn contains the PCM buffer, capture metadata, and (if ASR bridge was active) pre-computed transcript, logprobs, and timing data.
max_duration as Chunking (Not Turn Boundary)
max_duration finalization commits the ASR audio buffer to get a transcript back, but does NOT push the turn into the generation queue. Instead, the transcript is banked and merged with subsequent chunks until a real speech-end signal arrives.
Flow:
max_duration fires
→ ASR commit → transcript banked in accumulator
→ DO NOT queue for generation
→ wait for next finalization event
next finalization (stream_end / speaking_end / another max_duration)
→ if stream_end or speaking_end:
→ merge accumulated transcripts + this chunk's transcript
→ turnProcessor.queueRealtimeTurn() with merged transcript
→ if another max_duration:
→ bank this chunk too, keep waiting
Key behaviors:
max_durationcommits to ASR are still sent — the OpenAI buffer needs to be committed so transcription can run. This is the "chunking" role.- Banked transcripts are stored per-user on the capture or ASR bridge state. Each chunk's transcript is appended in order.
- Only a real speech-end signal (
stream_end,speaking_end,userAudioEnd) triggers generation with the merged transcript. - If the user disconnects or session ends, banked transcripts are flushed as a final turn.
- The
captureReasonon the queued turn should reflect the final trigger (e.g.,stream_end), notmax_duration.
Edge cases:
- User speaks for 20s (two max_duration chunks + stream_end): three ASR commits, three partial transcripts banked, one merged turn queued on stream_end.
- User speaks for 8s and goes silent (max_duration, then idle flush): max_duration banks, idle flush triggers generation with accumulated content.
- User speaks for 8s and disconnects: max_duration banks, disconnect flushes as final turn.
Motivation: Without this, a user mid-sentence at the 8s cap gets a reply to an incomplete utterance:
01:06:36 voice_turn_finalized reason=max_duration transcript="Um, can you play me..."
01:06:37 voice_turn_addressing allow=true (generation starts on incomplete sentence)
01:06:40 voice_turn_finalized reason=stream_end transcript="On eBay" (continuation arrives)
01:06:40 realtime_reply_requested replyText="oh yeah what do you want me to throw on"
(bot replies to fragment, ignores real intent)
Room-Coalesce Flush on Capture Cleanup
When cleanupCapture() removes a user from userCaptures, it checks whether that was the last active capture in the session. If the pending realtime turn queue has held turns (from room-aware coalescing — see voice-client-and-reply-orchestration.md), the flush fires immediately via flushHeldRoomCoalesceTurns().
max_duration exception: captures that finalize due to hitting the 8s CAPTURE_MAX_DURATION_MS cap do NOT trigger the room-coalesce flush. The user is still speaking — a new capture will start on the next speakingStart event and eventually finalize with a real speech-end reason. Flushing held turns at the 8s boundary would defeat the purpose of room-aware coalescing by processing turns without the full room context.
The cleanupCapture(reason) function passes the finalization reason through, and skips the flush when reason === "max_duration".
8. Speaking End Debounce
The speakingEndFinalizeTimer uses an adaptive delay (resolveSpeakingEndFinalizeDelayMs) that scales based on system load:
- More active captures → longer delay (avoids premature finalization during multi-speaker crosstalk)
- Turn backlog → longer delay (system is busy processing previous turns)
- Base delay is short for responsive single-speaker interaction
If speakingStart fires again for the same user, or new userAudio arrives during the timer window, the timer is cleared and the capture continues accumulating audio.
9. Incident Debugging
When a user speaks but no turn reaches the brain:
- Check for
voice_turn_dropped_provisional_capture— the capture never promoted. Look at promotion thresholds vs actual signal metrics. - Check for
voice_turn_dropped_silence_gate— the aggregated PCM was too quiet. - Check for
voice_turn_skipped_empty_capture— no audio data was received from the subprocess. - If a promoted turn still shows
voice_realtime_transcription_empty, inspecttrackedUtteranceId,activeUtteranceId,finalSegmentCount, andpartialCharson that event before blaming provisional capture. A later provisional drop can belong to a different weak follow-on capture. - If none of the above, check
voice_activity_startedfor promotion confirmation, then look downstream at noise rejection gates (logprob confidence, bridge fallback hallucination).
When captures are too aggressive (noise triggers turns):
- Check promotion reason —
strong_local_audiowith low actual signal suggests threshold tuning needed. - Check if server VAD is active —
server_vad_confirmedshould catch most ambient noise. - Check near-silence early abort — if not firing, the thresholds may need lowering.
10. Regression Tests
These cases should remain covered:
- Provisional captures that never promote should be silently discarded without ASR cost
server_vad_confirmedpromotion requires both server VAD match AND local signal thresholdsstrong_local_audiopromotion fires without server VAD when signal is clearly strongspeakingEnd→speakingStartwithin debounce window continues the same capture- Max duration timer (8s) commits ASR but banks transcript without queuing generation
- Banked transcripts merge correctly on subsequent stream_end
- Multiple max_duration chunks accumulate and merge in order
- Idle flush after max_duration triggers generation with banked content
- Disconnect after max_duration flushes banked content
- Near-silence early abort fires at 1s for very weak signal
- Promoted captures that fail silence gate are still dropped (redundant safety net)
abortActiveInboundCapturescleanly tears down all active captures- System speech (thoughts) is cancelled on capture promotion
Current coverage:
src/voice/voiceAudioAnalysis.test.ts(signal analysis functions)src/voice/voiceSessionManager.lifecycle.test.ts(integration scenarios)
Part 2: ASR Bridge
This part defines the ASR (Automatic Speech Recognition) bridge state machine. Each ASR bridge session (AsrBridgeState) manages a WebSocket connection to OpenAI's Realtime Transcription API, buffers audio during connection delays, and resolves transcripts for finalized captures.
11. Source of Truth
Each VoiceSession owns its ASR state through:
session.openAiAsrSessions: Map<string, AsrBridgeState>— per-user ASR sessions (one WebSocket per active speaker)session.openAiSharedAsrState: AsrBridgeState | null— single shared ASR session (one WebSocket for all speakers)session.perUserAsrEnabled: boolean— snapshot at join timesession.sharedAsrEnabled: boolean— snapshot at join time
The AsrBridgeState is the core per-session state object. It tracks the WebSocket lifecycle, audio buffering, utterance state, and transcript resolution.
Code:
src/voice/voiceAsrBridge.ts— ASR session lifecycle, audio streaming, commit/transcript resolutionsrc/voice/openaiRealtimeTranscriptionClient.ts— WebSocket client for OpenAI Realtime Transcription APIsrc/voice/voiceConfigResolver.ts— ASR mode resolution from settingssrc/voice/captureManager.ts— integration between capture lifecycle and ASR
12. Phases
Each AsrBridgeState has a phase field tracking its WebSocket lifecycle:
| Phase | Meaning |
|---|---|
idle | No WebSocket connection. Initial state and post-teardown state. |
connecting | WebSocket is opening. Audio is buffered in pendingAudioChunks. |
ready | WebSocket is open and accepting audio. Pending audio flushed on transition. |
committing | A transcript commit is in progress (commitInputAudioBuffer sent, awaiting response). |
closing | WebSocket is being torn down. |
Phase query helpers: asrPhaseCanAcceptAudio (connecting or ready), asrPhaseIsConnected (ready or committing), asrPhaseCanCommit (ready), asrPhaseIsCommitting (committing), asrPhaseIsClosing (closing).
13. Authoritative vs Heuristic Signals
| Signal | Role | Notes |
|---|---|---|
asrState.phase | authoritative | Canonical WebSocket lifecycle phase. Guards all operations. |
asrState.userId | authoritative (shared mode) | Active user lock. Only one user at a time can use the shared bridge. Null when unlocked. |
asrState.client | authoritative | The OpenAiRealtimeTranscriptionClient instance. Null when idle. |
asrState.utterance | authoritative | Current utterance state: finalSegments, partialText, lastEventAt. Updated by WebSocket transcript events. |
asrState.speechDetectedUtteranceId | signal | Server VAD confirmation. Read by capture promotion logic (see Section 4). Must match captureState.asrUtteranceId. |
asrState.speechDetectedAt | signal | Timestamp of server VAD speech detection. |
asrState.pendingAudioChunks | buffer | Audio queued during connecting phase. Flushed on transition to ready. Capped at 10s (480,000 bytes). |
asrState.pendingAudioBytes | buffer metric | Total bytes in pending buffer. Used for overflow trimming. |
asrState.committingUtteranceId | guard | Ensures audio flush targets the correct utterance during commit. |
asrState.connectPromise | deduplication | Prevents concurrent connect attempts. Multiple callers await the same promise. |
asrState.consecutiveEmptyCommits | heuristic | Circuit breaker: after 3 consecutive empty commits with >1s audio, force-close and reconnect. |
asrState.idleTimer | lifecycle timer | Closes the WebSocket after idle TTL expires. Cleared on new utterance begin. |
14. Transition Rules
| Event | From | To |
|---|---|---|
ensureAsrSessionConnected called | idle | connecting |
| WebSocket opens, session.update sent | connecting | ready |
commitAsrUtterance called | ready | committing |
| Transcript resolved (or timeout) | committing | ready |
closePerUserAsrSession / closeSharedAsrSession called | any | closing |
| WebSocket closed, cleanup complete | closing | idle |
| Idle TTL timer fires | ready | closing → idle |
| Circuit breaker (3 consecutive empty commits) | committing | closing → idle → connecting → ready (reconnect) |
ASR bridge session updates use session.update with session.type = "transcription" and nested audio.input fields for format, noise reduction, turn detection, and transcription. Configured g711_ulaw and g711_alaw inputs are mapped to OpenAI's audio/pcmu and audio/pcma media descriptors.
15. Per-User vs Shared Mode
Per-User Mode (perUserAsrEnabled)
- One
AsrBridgeStateper active speaker inopenAiAsrSessionsmap - Audio streams immediately from capture start (provisional audio is included)
- Each user's ASR session is independent — no contention
- Idle sessions are closed after
OPENAI_ASR_SESSION_IDLE_TTL_MS - Sessions are eagerly pre-connected on session start for the initial speaker
Shared Mode (sharedAsrEnabled)
- Single
AsrBridgeStateinopenAiSharedAsrState - User locking:
asrState.userIdacts as a mutex — only one user at a time - Audio streaming begins only after capture promotion (not during provisional phase)
- Commit uses a different path:
commitInputAudioBuffer→waitForSharedAsrCommittedItem(promise-based waiter) →waitForSharedAsrTranscriptByItem(pollsfinalTranscriptsByItemId) - After commit,
releaseSharedAsrActiveUserunlocks, thentryHandoffSharedAsrchecks for other promoted captures waiting - Handoff replays buffered PCM chunks from the waiting capture
Mode Selection (voiceConfigResolver.ts)
Per-user requires ALL of: session active, provider supports perUserAsr, OpenAI API key, not text-only mode, transcriptionMethod === "realtime_bridge", reply path is text-mediated (bridge or brain), usePerUserAsrBridge === true.
Shared requires ALL of: session active, provider supports sharedAsr (all providers), OpenAI API key, not text-only mode, transcriptionMethod === "realtime_bridge", reply path is text-mediated (bridge or brain), per-user is NOT enabled.
16. Audio Buffering
Audio arrives via appendAudioToAsr on every onUserAudio chunk:
- If phase is
connecting: queue asAsrPendingAudioChunkwith utterance ID, cap at 10s - If phase is
ready: attempt flush viaflushPendingAsrAudio, then send directly - If phase is
committing: queue for next utterance (utterance ID mismatch guard prevents mixing)
flushPendingAsrAudio sends all pending chunks to the WebSocket client, matching utterance IDs. Chunks for stale utterances are skipped.
17. Transcript Resolution
Per-User Commit Flow
commitAsrUtterancecalled with finalized capture's PCM- Phase:
ready→committing - Flush remaining pending audio
- Call
client.commitInputAudioBuffer() waitForAsrTranscriptSettle: pollutterance.finalSegmentsuntil stable or timeout- Build
AsrCommitResultwith transcript, timing, model info, logprobs - Phase:
committing→ready - Schedule idle close timer
If the commit times out empty but the same utterance produces a late final segment shortly after, the capture manager still watches that committed utterance object during the late-recovery window. A new provisional utterance for the same speaker does not cancel recovery of the older committed transcript.
If that late-recovery window also ends empty, both bridge empty-drop paths treat the newer speech as noise or abandonment rather than replaying older assistant audio deterministically. If the same speaker had just committed a live barge-in, the runtime hands interruption context back to the normal voice brain on the next real turn. Otherwise the runtime simply clears the no-turn capture and lets any still-valid queued assistant output drain naturally.
Malformed provider transcripts that contain OpenAI reserved control-token syntax such as <|...|>, vq_*_audio_*, audio_future*, or end_of_task are dropped at the ASR bridge boundary and again at the bridge-turn handoff if needed. Punctuation-only bridge results such as "?", "...", or similar prosody noise are also treated as empty ASR. These malformed or punctuation-only transcripts are treated the same as empty bridge results for recovery and interruption handoff, and they never enter realtime turn context, memory lookup, or admitted user turns. This guard is ASR-only: assistant directives such as [[TO:...]] and [[SOUNDBOARD:...]] remain valid on assistant-generation paths.
If that late transcript revises a turn that has already been admitted but has not started audio yet, the turn processor replaces the older queued turn in place and replays the corrected revision with a fresh reply scope. The corrected utterance is treated as the same turn becoming more complete, not as stale newer work that should be dropped.
Per-user item association follows the committed item_id first. When OpenAI server VAD auto-commits a turn before local capture finalization enters committing, the bridge still binds that item_id to the current active utterance. This prevents a final transcript such as "stop music" from being misattached to an older turn through previous_item_id.
Shared Commit Flow
- Validate user lock matches
- Register
pendingCommitRequestwith user ID - Call
client.commitInputAudioBuffer() waitForSharedAsrCommittedItem: await promise resolved byinput_audio_buffer.committedeventwaitForSharedAsrTranscriptByItem: pollfinalTranscriptsByItemIdfor committed item's transcript- Build
AsrCommitResult - Release user lock
tryHandoffSharedAsr: scan for other promoted captures waiting
The AsrCommitResult
{
transcript: string;
asrStartedAtMs: number;
asrCompletedAtMs: number;
transcriptionModelPrimary: string;
transcriptionModelFallback: string | null;
transcriptionPlanReason: string;
usedFallbackModel: boolean;
captureReason: string;
transcriptLogprobs: Array<{token, logprob, bytes}> | null;
}
This flows through queueRealtimeTurnFromAsrBridge into the turn processor as *Override fields on RealtimeQueuedTurn, skipping the turn processor's own ASR. See Section 7 for the full routing.
Canonical policy note:
- raw PCM transcription plan selection is shared across realtime turn processing, file-ASR turns, and music-command interception
gpt-4o-mini-transcribekeeps the short-clip no-fallback optimization only foropenai_realtime- otherwise the mini model gets a single full-model fallback to
whisper-1
18. Cross-Domain State Reads
| Subsystem | State Read | Where | Purpose |
|---|---|---|---|
| Capture Manager | capture.promotedAt | tryHandoffSharedAsr | Only hand off to promoted captures |
| Capture Manager | capture.sharedAsrBytesSent | tryHandoffSharedAsr | Skip captures that already sent shared ASR audio |
| Capture Manager | capture.pcmChunks | tryHandoffSharedAsr | Replay buffered audio during handoff |
| Capture Manager | capture.bytesSent | tryHandoffSharedAsr | Skip captures with no audio |
| Session Lifecycle | session.ending | All ASR operations | Abort on session teardown |
| Session Config | session.realtimeInputSampleRateHz | commitAsrUtterance | PCM duration estimation |
| Settings | voiceRuntime.openaiRealtime.* | resolveAsrModelParams | Model, language, prompt configuration |
| App Config | appConfig.openaiApiKey | ensureAsrSessionConnected | API key for WebSocket auth |
19. Client Events
The OpenAiRealtimeTranscriptionClient emits:
| Event | Handler | Effect on ASR State |
|---|---|---|
transcript | wireClientEvents | Updates utterance.finalSegments / partialText, sets lastTranscriptAt. Shared mode: populates finalTranscriptsByItemId. |
speech_started | wireClientEvents | Sets speechDetectedAt, speechDetectedUtteranceId, speechActive = true. Used by capture promotion (see Section 4) and, in transcript-overlap sessions, arms a pending same-speaker interrupt sustain window for the currently authorized interrupter. Before assistant audio starts, that same authorized speech_started does not hold or cancel a generation_only reply by itself. The runtime keeps re-checking the same assertive acoustic gate used for raw barge-in while that utterance stays active, so an early under-threshold speech_started can still mature into a real interrupt once assistant output is actually live. |
speech_stopped | wireClientEvents | Sets speechActive = false. In transcript-overlap sessions this also releases an uncommitted pending same-speaker interrupt so the staged turn can flush normally. |
error_event | wireClientEvents | Logs error. May trigger session close depending on severity. |
socket_closed | wireClientEvents | Transitions phase to idle. Clears client reference. |
20. Incident Debugging
When ASR produces no transcript for audible speech:
- Check
phase— was the session inreadystate? Ifconnecting, audio may have overflowed the 10s buffer. - Check
committingUtteranceId— did it match the current utterance? Stale utterance ID = audio sent to wrong commit. - Check
consecutiveEmptyCommits— circuit breaker may have fired, triggering reconnect. - Check logprob confidence — transcript may have been produced but dropped by
VOICE_ASR_LOGPROB_CONFIDENCE_THRESHOLD.
When shared ASR hangs:
- Check
asrState.userId— is the user lock stuck? A capture that failed to release would block all subsequent users. - Check
pendingCommitResolvers— are there unresolved promises waiting forcommittedevents? - Check
tryHandoffSharedAsr— did the handoff scan find the waiting capture?
21. Regression Tests
These cases should remain covered:
- Audio buffered during
connectingphase is flushed on transition toready - Buffer overflow at 10s cap drops oldest chunks, not newest
- Per-user sessions close after idle TTL
- Shared mode user lock prevents concurrent access
- Shared mode handoff replays buffered PCM to the next user
- Circuit breaker reconnects after 3 consecutive empty commits
speechDetectedUtteranceIdonly confirms promotion for the matching capture- Session teardown closes all ASR sessions cleanly
- Logprob confidence gating drops low-confidence transcripts downstream
Current coverage:
src/voice/voiceConfigResolver.test.ts(mode resolution)src/voice/voiceSessionManager.lifecycle.test.ts(integration scenarios)src/voice/voiceAsrBridge.test.ts(per-user/server-VAD item binding and bridge lifecycle)
