Runtime Logs and Local Loki
This project emits structured runtime action logs as JSON lines and can stream them into a local Loki stack for debugging.
Structured log output
Runtime action logs are produced from Store.onActionLogged and include:
tsinstance— unique instance identifier fromCLANKER_INSTANCE_IDenv var (defaults to"default")sourcelevelkindeventagentguild_idchannel_idmessage_iduser_idusd_costcontentmetadata
Sensitive metadata keys are redacted (apiKey, token, accessToken, refreshToken, authorization, secret, etc.). Operational counters such as tokens and input_tokens remain visible.
Prompt logs for turn-level debugging are attached under metadata.replyPrompts
and stay hidden by default in the dashboard Action Stream.
Pretty stdout rules:
- actual delivered speech/transcripts stay emphasized in pretty stdout when the log carries
metadata.transcriptormetadata.incomingTranscript - output delivery stays labeled as
saidwhenmetadata.transcriptSource=output - request-time
metadata.replyTextstays visible in pretty stdout, but it is not rendered with the same bold speech emphasis as delivered transcripts llm_callruntime lines now summarize the returned model text the same way, plustoolNames,toolCallCount,responseChars,responseShape, andstopReasonwhen the provider reports onemetadata.rawContentSummaryandmetadata.responseDiagnosticscapture compact provider response shape for silent-output debugging: output item types, content part types, tool-call argument presence/counts, streamed delta/done text character counts, streamed function-call item/delta counts, recovered streamed tool-call counts/names, final output item count, and final provider status. These fields intentionally avoid logging full raw provider payloads.- normal-operation reply admission / bridge handoff logs now carry transcript character counts instead of duplicating full text; use ASR segment logs and final assistant transcript logs when you need exact wording
Canonical prompt-log coverage:
- text reply turns:
sent_reply,sent_message,reply_skipped - voice classifier decisions:
voice_turn_addressing - full-brain voice generation turns:
realtime_reply_requested,realtime_reply_skipped - full-brain voice tool loop:
voice_brain_tool_call,voice_brain_generation_failed - provider-native realtime tool loop:
realtime_tool_call_started,realtime_tool_call_completed,realtime_tool_call_failed - realtime bridge/native prompt refresh and forwarded turns:
openai_realtime_instructions_updated,openai_realtime_text_turn_forwarded
metadata.replyPrompts uses one shared shape:
systemPromptinitialUserPromptfollowupUserPromptsfollowupSteps
The Voice tab complements these per-event logs with a live SSE snapshot view:
promptState.instructionsshows the current realtime/system instructions the VC path is running with.promptState.classifier,promptState.generation, andpromptState.bridgeshow the latest captured classifier, full-brain, and bridge-forward prompt bundles for the active session.- Screen-share state in the Voice tab separates recent note-loop output (
streamWatch.visualFeed) from the prompt note buffer the VC brain sees (streamWatch.notePayload).
Environment settings
Add to .env:
CLANKER_INSTANCE_ID=default
RUNTIME_STRUCTURED_LOGS_ENABLED=true
RUNTIME_STRUCTURED_LOGS_STDOUT=true
RUNTIME_STRUCTURED_LOGS_FILE_PATH=data/logs/runtime-actions.ndjson
DASHBOARD_SETTINGS_SAVE_DEBUG=false
CLANKER_INSTANCE_ID distinguishes multiple bot instances in Loki/Grafana. Set a unique value per instance (e.g. clanky, clanky-2). Promoted to a first-class Loki label {instance="..."}.
RUNTIME_STRUCTURED_LOGS_FILE_PATH is the file Promtail tails into Loki.
DASHBOARD_SETTINGS_SAVE_DEBUG=true enables the otherwise-silent dashboard settings save success log.
Log pipeline coverage
All structured runtime events flow through Store.onActionLogged → ndjson file → Promtail → Loki.
| Source | Loki? | How |
|---|---|---|
store.logAction() calls (text runtime, voice runtime, LLM, tools, dashboard) | Yes | Direct pipeline |
ClankvoxClient Bun-side events (clankvox_voice_server_update, clankvox_gateway_op4_forwarded, etc.) | Yes | logAction callback injected at voice join |
Clankvox Rust tracing events (clankvox_voice_ready, clankvox_first_video_frame_forwarded, DAVE decrypt, H264 decoder, etc.) | Yes | IPC log message → _handleMessage → logAction |
MusicPlayer events (music_player_ytdlp_resolved, music_player_ytdlp_resolution_failed, etc.) | Yes | logAction callback injected at session init |
VoiceMusicPlayback events (voice_music_search_complete, voice_music_request_complete) | Yes | Direct store.logAction |
Dashboard settings events (settings_save_rejected_stale, settings_save_success, etc.) | Yes | Direct store.logAction |
Bot lifecycle events (bot_logged_in, slash_command_browse_error, etc.) | Yes | Direct store.logAction |
LLM OAuth events (claude_oauth_init_failed, claude_oauth_warmup_completed, etc.) | Yes | Direct store.logAction |
| Shutdown fallback console lines | No | Last-resort fallback when store may be closing |
| Fatal startup errors (before store is created) | No | console.error — store does not exist yet |
| Golden harness CLI output | No | Intentional CLI report printer |
Clankvox Rust tracing events are bridged via a custom tracing::Layer that serializes
info/warn/error events as OutMsg::Log IPC messages. These flow through the TLV
control channel to clankvoxClient._handleMessage(), which routes them to logAction.
The existing stderr fmt layer is preserved for developer terminal visibility.
Loki ingestion delay
After stopping a session, the most recent events may not be queryable in Grafana for 15–30 seconds while the Loki ingester flushes its in-memory chunks. If you need immediate access to the latest logs, query the ndjson file directly:
rg '<sessionId>' data/logs/runtime-actions.ndjson | rg '"event":"<event_name>"'
Local Loki stack
The repository includes:
docker-compose.loki.ymlops/loki/loki-config.ymlops/loki/promtail-config.ymlops/loki/grafana/provisioning/datasources/loki.yml
Start:
bun run logs:loki:up
Stop:
bun run logs:loki:down
Multi-instance log ingestion
Promtail uses a glob path (/var/log/clanker/*/runtime-actions.ndjson) so it picks up logs from any mounted instance directory. Each instance's log directory is mounted as a subdirectory under /var/log/clanker/ in docker-compose.loki.yml. The instance Loki label (from CLANKER_INSTANCE_ID) distinguishes instances in Grafana queries.
See multi-instance.md for full setup instructions.
Usage flow
- Start Loki stack (
bun run logs:loki:up). - Start bot (
bun run start). - Open Grafana at
http://localhost:3000(admin/admin). - In Explore, use Loki datasource and query:
{job="clanker_runtime"}
Filter examples:
{job="clanker_runtime",kind="voice_runtime"}
{job="clanker_runtime",agent="voice",level="error"}
{job="clanker_runtime",instance="clanky-2"}
Incident analysis and replay
Loki/Grafana is the raw event substrate, not the final debugging experience.
For day-to-day operator work, the highest-leverage observability surface is a turn debugger that reconstructs one incident from the underlying structured events.
Prioritize this workflow over generic metrics dashboards:
- open any sent message, reply, error, or suspicious runtime event
- reconstruct the full turn timeline from correlated logs
- show the raw JSON beside a human-readable replay
- let operators pivot into broader log search only after the local turn makes sense
Minimum replay payload for a text or multimodal turn:
- trigger message and
triggerMessageIds - ingress admission decision (
reply_admission_decision) - queue-side drop reasons (
reply_queue_gate_rejected) when a queued turn never reaches generation - reply-pipeline boundary (
reply_pipeline_gate) when a queued turn reaches the pipeline but stops before generation - tool availability snapshot (
reply_tool_availability) before the model decides whether to call tools - coalesced recent-message window
- addressing / admission decision
- prompt bundle (
metadata.replyPrompts) - LLM calls in order, including
toolNames,toolCallCount,stopReason, and transcript/output previews - tool loop steps and tool results (voice brain tool calls carry
toolResultSummarywith structured result data including search result titles, IDs, and selected tracks) - tool failures such as
browser_browse_failed, including runtime, provider/model, URL, and error details - attachment artifacts, image lookup results, fetched pages, or other retrieved context
- final delivered action: reaction, sent message, sent reply, skip, or failure
- memory side effects such as retrieval, embedding, and reply ingestion
- cost and latency breakdown (
performance.*,usd_cost)
For startup downtime incidents, include the startup catchup boundary events too:
startup_catchup_channel_scan_completestartup_catchup_beginstartup_catchup_channel_scannedwhen any channel actually queues catchup repliesstartup_catchup_complete
Correlation keys to preserve across the UI:
metadata.botIdmetadata.deploymentmetadata.triggerMessageIdmetadata.sessionIdmetadata.turnIdmetadata.sourcemetadata.stagemetadata.allowmetadata.reasonmessage_idmetadata.triggerMessageIdmetadata.triggerMessageIdsmetadata.sessionIdguild_idchannel_iduser_id
Text reply incident workflow
When a text message appears to be ignored before any prompt or LLM call happens, start with the ingress and queue boundaries:
reply_admission_decisionreply_queue_gate_rejectedreply_pipeline_gatereply_tool_availability
Suggested query:
{job="clanker_runtime",kind="text_runtime"} |= "reply_"
Inspect these metadata fields together:
botIddeploymenttriggerMessageIdsessionIdturnIdallowreasonaddressSignalattentionModeattentionReasonrecentReplyWindowActivequeueDepthforceRespondsendBudgetAllowedtalkNowAllowedctxBuiltincludedToolsexcludedTools
Interpretation notes:
reply_admission_decisionis the text ingress decision before the message is ever queued for generation.reason=hard_addressmeans the text path admitted the turn immediately because it was a direct address or exact bot-name hit.reason=recent_reply_windowmeans the turn stayed in the active follow-up window from a recent bot reply.reason=cold_ambient_llm_decidesmeans the turn was ambient and admitted to the LLM for it to decide whether to respond or[SKIP].reply_queue_gate_rejectedmeans the message was already queued, but the queue worker later dropped it because settings, identity, permissions, or duplicate-trigger state changed before send-time.reply_pipeline_gatemeans the queued turn reachedmaybeReplyToMessagePipeline().reason=readyis the successful boundary crossing; other reasons explain why the pipeline exited before any LLM call.reply_tool_availabilityrecords the tool set that was actually exposed to the model for that text turn, including excluded tool names with collapsed reasons likesettings_disabled,budget_blocked,provider_unconfigured, andno_history_images.- These decision-boundary events should all carry the same debugger anchor bundle:
botId,deployment,triggerMessageIdorsessionId,turnId,source,stage,allow, andreason.
Design rule:
- analytics charts are secondary
- incident reconstruction, replay, and search are primary
If a weird behavior requires an operator to manually stitch together five log lines across captioning, tool calls, memory, and final delivery, the logs are present but the debugging product is still incomplete.
Voice LLM call correlation
Voice brain generation calls emit llm_call events under kind="llm_call",
not kind="voice_runtime". To join them with voice pipeline events, filter
by sessionId (which both event families carry) or by
source="voice_realtime_generation".
Suggested query:
{job="clanker_runtime",kind="llm_call"} |~ "voice_realtime_generation"
Or by sessionId:
{job="clanker_runtime",kind="llm_call"} |~ "<sessionId>"
Key llm_call metadata for voice latency triage:
provider,model— which LLM handled the turnusage.inputTokens,usage.outputTokens— token countsusage.cacheReadTokens,usage.cacheWriteTokens— prompt cache hit rateresponseChars— output lengthresponseShape— compact stdout-friendly provider response shape summary, for exampleraw=array[0] delta=0 done=0 extracted=0 finalOut=0rawContentSummary/responseDiagnostics— structured provider output-shape diagnostics used when token usage is nonzero but the parsed reply text or tool calls are empty. OpenAI Responses streaming also records streamed function-call deltas; when a completed response omitsoutput[], the runtime can recover named streamed calls into the normal tool-call path and reports that asstreamRecoveredToolCallCount.stopReason—end_turn,tool_use,max_tokensstreaming— whether streaming transport was usedsessionId— voice session correlation key
Voice prompt size triage
realtime_reply_requested carries per-turn prompt size summary fields for
quick bloat detection without parsing the full replyPrompts bundle:
systemPromptChars— static system prompt lengthuserPromptChars— per-turn dynamic user prompt lengthtoolCount— number of tool definitions exposedtoolDefinitionChars— total JSON chars of tool schemastotalPromptChars— sum of system + user + context + tool charscontextTurnsSent— conversation context messages includedcontextCharsSent— total chars in context messages
Use these to spot prompt growth over a session without parsing JSON.
Voice memory attribution
When voice feels redundant or expensive, attribute memory work before changing prompting behavior.
Start with these events:
voice_generation_memory_loadedvoice_realtime_instruction_memory_loadedmemory_embedding_call
Suggested query:
{job="clanker_runtime",kind=~"voice_runtime|memory_embedding_call"}
Inspect these metadata fields together:
memorySourcetraceSourcecontinuityLoadMsbehavioralMemoryLoadMstotalLoadMsusedCachedBehavioralFactsuserFactCountrelevantFactCountbehavioralFactCount
Interpretation notes:
voice_generation_memory_loadedis the orchestrator brain path that feeds Claude/OpenAI/etc. generationvoice_realtime_instruction_memory_loadedis the separate OpenAI/Gemini/xAI realtime instruction refresh pathmemory_embedding_call.traceSourceattributes the embedding API call to the caller:voice_realtime_behavioral_memory:generation— behavioral fact retrieval from the brain/generation path (voiceReplies.ts)voice_realtime_behavioral_memory:instruction_refresh— behavioral fact retrieval from the realtime instruction refresh path (instructionManager.ts)voice_realtime_instruction_context— conversation continuity from instruction refreshvoice_realtime_generation— conversation continuity from generationmemory_query— explicit memory search tool call
- in transport_only sessions (brain mode), the instruction refresh path skips memory retrieval entirely — only the generation path should produce embedding calls
- if both
:generationand:instruction_refreshtraces appear on the same turn, the session is provider_native and both paths are expected
Voice tool ownership attribution
When you need to distinguish provider-native realtime tools from the full-brain/orchestrator path, use event family plus ownership metadata together.
Start with these events:
voice_brain_tool_callvoice_brain_generation_failedrealtime_tool_call_startedrealtime_tool_call_completedrealtime_tool_call_failed
Interpretation notes:
voice_brain_*events come from the full-brain reply path. They should carrymetadata.replyPath="brain"andmetadata.realtimeToolOwnership="transport_only".- Failed
voice_brain_tool_callevents also carrymetadata.errorwith the tool-returned error summary when one is available, so you do not need to infer the failure frommetadata.isErroralone. voice_brain_tool_callcarriesmetadata.toolResultSummarywith a structured summary of what the tool returned. For search/play tools this includes result titles, IDs, platforms, and selected track info. For browser tools it includes session ID and completion status. For other tools it falls back to a truncated raw string. This eliminates the need to cross-reference tool results from follow-up LLM call logs.realtime_tool_call_*events come from provider-native tool execution. These happen only when the session owns provider tools directly (realtimeToolOwnership="provider_native").session.modestill tells you which realtime runtime carried audio. It does not, by itself, tell you who owned planning or tools.
Native Screen Watch Fallback Attribution
When screen watch falls back to the share link and you need to know why native Discord watch did not bind, inspect these events together:
voice_screen_watch_capabilityscreen_watch_native_start_failedscreen_watch_started_nativestream_watch_reused_programmaticscreen_watch_link_fallback_startedscreen_watch_link_reused
Interpretation notes:
voice_screen_watch_capabilityis the per-turn capability snapshot from the full-brain voice path. It records whether screen watch was supported, enabled, currently available, what the native-specific reason/status were, and whetherstart_screen_watchwas actually exposed to the model on that turn.voice_screen_watch_capabilitycan report native watch as ready from discovered Go Live state before the active-sharer roster has frame evidence or full stream credentials.stream_discovery_go_live_bootstrap_seededmarks the moment Bun turnsVOICE_STATE_UPDATE.self_stream=trueinto provisional per-session Go Live state, beforeSTREAM_CREATEorSTREAM_SERVER_UPDATEarrive.stream_discovery_go_live_bootstrap_clearedmarks provisional bootstrap state being cleared afterVOICE_STATE_UPDATE.self_stream=falsewhen Discord never promoted that share into full discovered credentials.screen_watch_native_start_failedis the authoritative native-start failure event before fallback.screen_watch_started_nativemeans the Bun runtime actually bound a native Discord share target. Inspectmetadata.reusedandmetadata.frameReadyto tell whether it attached to an existing watch and whether usable pixels were already available.stream_watch_reused_programmaticis the manager-level reuse breadcrumb. It means a same-target native watch stayed active and Bun re-subscribed without resetting frame context.screen_watch_link_fallback_startedandscreen_watch_link_reusednow carrymetadata.nativeFailureReasonso the fallback line still explains why native did not win.
Useful metadata fields:
toolExposedsupportedenabledavailablenativeSupportednativeEnablednativeAvailablenativeStatusreasonnativeReasonfallbackrequestedTargetUserIdselectionReasonnativeActiveSharerCountnativeActiveSharerUserIdsgoLiveStreamUserIdgoLiveStreamCredentialsReadystreamKeynativeDecoderSupportedruntimeModevoiceChannelIdreusedframeReady
When you need to locate the failure layer for native Discord screen watch, follow the cross-process breadcrumb trail in order:
- Rust voice-conn observed Discord state:
clankvox_discord_video_state_observed - Rust saw an OP18 payload that looked video-adjacent but did not match the current video-state shape:
clankvox_voice_ws_unclassified_op18 - Rust capture supervisor applied state:
clankvox_native_video_state_received - Rust emitted state over IPC:
clankvox_native_video_state_emitted - Bun session applied active-sharer state:
native_discord_screen_share_state_updated - Rust accepted the watch subscription:
clankvox_native_video_subscribe_requested - Rust refreshed Discord sink wants:
clankvox_video_sink_wants_updated - Rust forwarded the first subscribed frame:
clankvox_first_video_frame_forwarded - Bun ingested a frame for model context:
stream_watch_frame_ingested - Bun failed to decode a forwarded frame before ingest:
native_discord_video_decode_failed
Interpretation notes:
- if
screen_watch_native_start_failedsaysrequested_target_not_actively_sharingand noclankvox_discord_video_state_observedappears, the Rust process never observed native share state from Discord - if
clankvox_voice_ws_unclassified_op18appears, Discord sent OP18 data that the current Rust parser did not recognize as video state - if you see
ignoring video state payload without user_id, Discord sent video-like state without the user identity field our Rust parser requires - if
clankvox_discord_video_state_observedappears withoutnative_discord_screen_share_state_updated, the issue is between Rust IPC emission and Bun session-state application - if state-update logs appear but
clankvox_first_video_frame_forwardednever appears, native share discovery worked but no subscribed frame made it through - if
native_discord_video_decode_failedappears withmetadata.timedOut=trueor contentffmpeg_decode_timeout, Bun killed the per-frameffmpegdecode watchdog before a JPEG reached stream-watch ingest - if
clankvox_video_sink_wants_skipped_no_connectionappears, a subscription request arrived before the voice connection was ready; latervoice_readyor another state update should trigger a retry path
Voice output incident workflow
When a voice turn is transcribed correctly but the bot does not answer, use the assistant output state machine doc first:
Start with these events:
voice_turn_addressingvoice_barge_in_gatevoice_direct_address_interruptbot_audio_startedopenai_realtime_response_doneopenai_realtime_active_response_cleared_stalerealtime_assistant_utterance_queuedrealtime_assistant_utterance_drain_blockedrealtime_assistant_utterance_queue_drained
Important interpretation rule:
- top-level
reason="output_locked"is a coarse public deny label outputLockReasonis the authoritative blocker for reply/output lock incidents
Suggested query:
{job="clanker_runtime",kind=~"voice_runtime|voice_turn_out"} |= "voice_turn_addressing"
Inspect these metadata fields together:
allowreasonoutputLockReasonblockerspendingResponseRequestIdpendingResponseSourcebackpressureActivestagesourcemsSinceAssistantReplyretryAfterMs
Common blockers:
outputLockReason=bot_audio_bufferedoutputLockReason=pending_responseoutputLockReason=openai_active_responseoutputLockReason=awaiting_tool_outputs
Operator notes:
voice_barge_in_gateis promotion-scoped, not chunk-scoped. It captures the first summarized interruption decision for a promoted user capture when some assistant output context is active.realtime_assistant_utterance_queuedtells you why spoken playback was queued instead of sent immediately. Look atblockers,outputLockReason,pendingResponse*, andttsBufferedSamples. Active user captures are no longer a blocker — the bot speaks when it has something to say, even while humans are talking.realtime_assistant_utterance_drain_blockedfires when queued spoken playback tried to drain but some blocker still held the floor. It is deduped by blocker signature, so a new line usually means the wedge changed state rather than simple polling noise.- if
voice_barge_in_gateshowsallow=false, treatreasonas the primary interruption blocker andoutputLockReasonas supporting context about what the assistant was doing. - if
voice_barge_in_gateshowsallow=false reason=music_only_playback, that means music was the only remaining output lock. Buffered or live bot speech should not resolve to this reason. - if
voice_barge_in_gateshowsallow=false reason=transcript_overlap_interrupts_enabled, the system intentionally waited for transcript-burst interruption logic instead of cutting audio from raw PCM. - if
voice_barge_in_gateshowsallow=false reason=local_only_promotion_pending_server_vad, the capture promoted from strong local audio but had not yet been confirmed by server VAD for live interruption. voice_interrupt_speech_started_pending initialReason=insufficient_capture_bytesmeans the authorized same-speaker overlap is still live and the runtime will keep checking whether it matures into a valid interrupt; the first under-threshold chunk is no longer a permanent miss by itself. IfeventType=local_capture_overlap, that pending sustain loop came from promoted local capture rather than providerinput_audio_buffer.speech_started.voice_interrupt_speech_started_retry_scheduledmeans a pending same-speaker overlap stayed active but still had not crossed the raw cut gate on the last sustain check; inspectcaptureBytesSent,minCaptureBytes, and signal metrics before blaming policy or music. Note:echo_guard_activeis a terminal denial — if the sustain recheck hits the echo guard, the pending interrupt is released rather than rescheduled, preventing the bot from cutting its own audio via echo.voice_interrupt_on_speech_started_sustain speechStillActiveSource=local_capturemeans the runtime committed the cut from an actively growing promoted capture even though providerspeech_startednever arrived in time;provider_speech_startedmeans the sustain path was driven by provider VAD as usual.realtime_reply_superseded_newer_inputnow includesreplyUserIdandpendingOtherSpeakerQueueDepth. Only turns from the reply target speaker count as interrupting; chatter from other participants is tracked but does not trigger supersede.- on
voice_barge_in_interruptorvoice_output_lock_interrupt, readinterruptAcceptanceModeandinterruptAcceptedseparately fromresponseCancelSucceeded/truncateSucceeded; async-confirmation providers can accept a locally committed cut before the provider emits its later interruption/completion event. - if
providerInterruptConfirmationPending=true, the runtime already cut local playback and accepted the interrupt, but the provider did not give an immediate ack for that cut path. - if
outputLockReason=bot_audio_bufferedpersists for more than a couple seconds afteropenai_realtime_response_done, suspect staleclankvoxplayback telemetry rather than real remaining speech - if
voice_turn_addressingshowsreason=bot_turn_openbut a same-momentvoice_direct_address_interruptfollows, the turn cut through output lock because it was an allowed wake-word / bot-name interruption - if a deferred turn keeps rescheduling, inspect whether
voice_activity_startedis followed byvoice_turn_dropped_silence_gate; silence-only captures should not be treated the same as real live speech
Bot utterance playback telemetry
When speech sounds cut off mid-sentence or you need to verify how much of the model's intended text was actually spoken aloud, use the per-utterance playback telemetry.
Start with this event:
bot_utterance_completed
This fires when the TTS provider signals it has finished generating audio for a
given utterance (ElevenLabs isFinal, OpenAI response.done, etc.). It
captures the full audio delivery picture for that single utterance.
Suggested query:
{job="clanker_runtime",kind="voice_runtime"} |= "bot_utterance_completed"
Inspect these metadata fields together:
utteranceTextLength— characters of text sent to the TTS providerutteranceTextPreview— first 120 characters of the intended speechhadAudio— did any audio reach clankvox at allaudioDeliveredBytes— PCM bytes actually forwarded to clankvox for playbackaudioDeliveredChunks— number of audio_delta chunks forwardedaudioSuppressedBytes— PCM bytes dropped by barge-in suppressionaudioSuppressedChunks— number of audio_delta chunks dropped by barge-in suppressionestimatedPlaybackMs— estimated playback duration from delivered bytes (PCM 16-bit mono at output sample rate)firstAudioLatencyMs— time from utterance request to first audio chunk delivereddeliveryRatio— percentage of total audio bytes that were delivered vs suppressed (100 = all delivered, 0 = all suppressed)ttsBufferedSamplesAtDone— clankvox TTS buffer depth when the provider signalled done; high values mean audio is still in the pipelinettsProviderEndReason— why the TTS provider ended (e.g.completed,cancelled)hadBargeSuppression— whether barge-in suppression was active when this response finished
Interpretation notes:
- if
deliveryRatio < 100andaudioSuppressedBytes > 0, a barge-in interrupt caused some TTS audio to be silently dropped. The bot was generating speech but the user interrupted, and the remaining audio was suppressed. - if
estimatedPlaybackMsis much shorter than expected forutteranceTextLength(rough rule: ~150 WPM, or ~4 chars per 100ms), the TTS stream may have ended prematurely. - if
hadAudio=falseandutteranceTextLength > 0, the TTS provider returned no audio at all for that text. Check ElevenLabs idle-timeout reconnect events orvoice_text_utterance_failed. - if
ttsBufferedSamplesAtDoneis high (> 48000 = ~1 second), audio is still in the clankvox pipeline waiting to be played. This is normal if the response just finished; it is suspicious if the bot goes silent while buffer samples remain. - pair with
realtime_assistant_utterance_queue_clearedto see how many queued sentence chunks from the same brain turn were dropped before they could be spoken. ThedroppedSourcesfield shows what kinds of utterances were lost anddroppedTextCharsshows the total unspoken text. - for end-to-end speech reconstruction, correlate
bot_utterance_completedwithrealtime_reply_requested(text generated) andvoice_turn_finalized(user speech) usingsessionIdandrequestId.
Video transport and DAVE decrypt workflow
When the screen watch pipeline fails to produce frames, start with the transport and DAVE layers before looking at Bun-side decode failures.
Start with these events:
clankvox_transport_decrypt_failedclankvox_h264_frame_nal_diagnosticclankvox_first_video_frame_forwardednative_discord_video_decode_failedstream_watch_frame_ingested
Interpretation notes:
clankvox_transport_decrypt_failedfires once per unique payload type when transport-level (AES-GCM / XChaCha20) decryption fails. If this fires for audio PT (111) or video PTs (103/105), the transport crypto AAD computation may be wrong. Inbound RTCP packets (PT 72-76) are filtered before decrypt and should never appear here.clankvox_h264_frame_nal_diagnosticlogs NAL types, keyframe status, and frame byte count at frame 1-5 and every 100th frame.video_keyframe_count=0after hundreds of frames means no SPS/IDR has arrived.nal_types=[7, 8, 1]means SPS+PPS were prepended to a P-slice from the depacketizer cache (normal for Go Live).- DAVE video decrypt works at near 100% on the main voice connection after the RTP padding strip fix. DAVE video passthrough warnings (
DAVE: video frame from user ... appears unencrypted) are expected only on Go Live streams, where thedaveycrate still misclassifies encrypted frames asUnencryptedWhenPassthroughDisabled. These frames are validated bylooks_like_valid_h264and dropped when they appear encrypted. Audio passthrough is still allowed on the stream watch transport since that audio channel is not used for ASR. native_discord_video_decode_failedwithffmpeg_decode_timeoutapplies only to VP8 frames (H264 is decoded in-process by clankvox's persistent OpenH264 decoder and does not use ffmpeg). This means ffmpeg hung waiting for EOF on a VP8 decode.native_discord_video_decode_failedwithdeblocking_filter_idc out of rangeorreference count overflowon frames that haveheaderHexstarting with valid SPS bytes means DAVE-encrypted video was forwarded as cleartext. The DAVE video drop fix should prevent this. This applies only to VP8 ffmpeg decode; H264 decode errors surface through clankvox-side log events (see below).native_discord_video_decode_failedwithh264_missing_parameter_setsis a legacy VP8/ffmpeg path error. H264 parameter set management is handled internally by the persistent OpenH264 decoder in clankvox.
H264 persistent decoder (clankvox)
The persistent OpenH264 decoder in clankvox emits its own diagnostic events:
clankvox_persistent_h264_decoder_created— a new per-user OpenH264 decoder instance was created for H264 decode. Logged once per user per session.clankvox_first_h264_frame_decoded— the first successful H264 frame decode for a user session. Indicates the decoder has valid reference state and is producing JPEG output.clankvox_openh264_decode_error— an individual H264 frame failed to decode. Includes error detail from OpenH264. Sporadic errors are normal (especially after mid-stream attach); sustained errors trigger a decoder reset.clankvox_openh264_decoder_reset— the decoder was destroyed and recreated after 50 consecutive decode errors. This clears all accumulated reference frame state. A fresh IDR keyframe is needed to resume decode.clankvox_decoder_reset_requesting_pli— after a decoder reset, clankvox sends PLI to Discord requesting a fresh keyframe for decoder re-initialization.clankvox_frame_diff_periodic— periodic scene-cut metrics emitted alongside decoded frames. Includes frame difference scores used for scene-change detection in the stream-watch pipeline.
Interpretation notes:
- If
clankvox_first_h264_frame_decodednever appears afterclankvox_persistent_h264_decoder_created, the decoder is not receiving decodable frames. Check DAVE decrypt status and whether IDR keyframes are arriving. - If
clankvox_openh264_decoder_resetfires repeatedly, the stream may lack IDR keyframes entirely or DAVE decrypt is failing on keyframes. Cross-reference withclankvox_h264_frame_nal_diagnosticto confirm keyframe presence. - The decoder processes ALL H264 frames (IDR + P-frames) for reference state accumulation, but JPEG emission is rate-limited to
nativeDiscordMaxFramesPerSecond. Decode errors on P-frames before the first IDR are expected and not actionable.
Useful metadata fields:
codeckeyframertpTimestampframeBytespendingH264AccessUnitCounttimedOutheaderHex
Voice input / VAD workflow
When the bot appears to ignore speech or ambient audio keeps opening turns, inspect the provisional-capture path before looking at reply admission.
Start with these events:
voice_activity_startedvoice_turn_dropped_provisional_captureopenai_realtime_asr_speech_startedopenai_realtime_asr_speech_stoppedvoice_realtime_transcription_emptyfile_asr_transcription_emptyopenai_realtime_asr_bridge_empty_dropped
Important interpretation rules:
voice_activity_startednow means a provisional capture promoted to a real turnpromotionReason=server_vad_confirmedmeans OpenAI Realtime transcription VAD confirmed speech for that utterancepromotionReason=strong_local_audiomeans the local fallback promoted without waiting for VADvoice_turn_dropped_provisional_capturemeans the capture never became a real turn and was discarded before normal reply admissionvoice_realtime_transcription_emptyincludestrackedUtteranceId,activeUtteranceId,finalSegmentCount, andpartialCharsso you can tell whether the commit went empty while a newer live utterance was already activefile_asr_transcription_emptymeans the local file-turn transcription path returned no transcript before admission/generationopenai_realtime_asr_bridge_empty_droppedmeans the bridge never forwarded any transcript into turn processing, so no downstream LLM generation happened for that utterance; punctuation-only ASR fragments such as"?"collapse into this pathvoice_interrupt_unclear_turn_handoff_requestedshould appear only when that exact bridge utterance already committed a real interrupt; stale interruption context alone should not synthesize this recovery eventvoice_interrupt_unclear_turn_handoff_skippedmeans empty/unclear ASR tried to enter interruption recovery, but the bridge utterance did not have valid committed interrupt context; inspectskipReasonbefore blaming ASR quality alonevoice_turn_addressingandopenai_realtime_text_turn_forwardednow report transcript length metadata rather than the full utterance text; inspectopenai_realtime_asr_final_segmentfor the exact user wording in realtime bridge sessions
Suggested query:
{job="clanker_runtime",kind="voice_runtime"} |= "voice_activity_started"
Inspect these metadata fields together:
promotionReasonpromotionServerVadConfirmedpromotionBytespromotionPeakpromotionRmspromotionActiveSampleRatio
Ambient-noise triage:
- repeated
voice_activity_startedfollowed byvoice_realtime_transcription_empty,file_asr_transcription_empty, oropenai_realtime_asr_bridge_empty_droppedusually means local promotion is still too permissive for the room - repeated
voice_turn_dropped_provisional_capturemeans the new provisional gate is working and the noise is being rejected before it becomes a turn - a promoted turn followed immediately by
voice_turn_dropped_provisional_captureoften means the drop belongs to a second weak follow-on capture, not the already-promoted utterance - if
openai_realtime_asr_speech_startednever appears for a promoted turn andpromotionReason=strong_local_audio, the fallback path promoted without server VAD confirmation - if promoted turns keep interrupting ambient bot speech, check
voice_system_speech_cancelled_for_user_speechtogether withrealtime_reply_skippedto confirm the thought path was preempted before audio
