docs/log-dives/2026-03-15-vc-session-155583c8.md

Voice Session Log Dive — 2026-03-15

Session: 155583c8-1072-41b9-86f2-8981a232431f Time: 20:27 – 20:35 UTC Mode: elevenlabs_realtime (brain path with ElevenLabs TTS) Participants: 2 (vuhlp + clanky) Turns (non-skip): 17 spoken replies, 11 [SKIP] decisions

Conversation Timeline

Annotations: (ignored) = barge-in denied, [SKIP] = clanky chose silence, TRUNCATED = hit 520-char hard limit, SKIP-LEAK / NOTE-LEAK = directive spoken aloud by TTS.

20:27:26 [user]: Let's get it.
20:27:37 [clanky]: yooo what's good, just rolled up! what we getting into?
20:27:38 [clanky]: yooo what's good conks! just slid into the function, ready
         for whatever chaos y'all got brewing
20:27:49 [user]: Yo, can you, uh, can you take a look at TittyConk's stream?
20:27:55 [clanky]: oh bet, let me peep what's happening on their screen
20:28:16 [clanky]: yooo Terraria! looks like you're in some spooky corruption
         biome, I see that purple nastiness spreading. and bruh you got a whole
         underground base setup going, that's clean
20:28:43 [clanky] -> [SKIP]
20:28:46 [user]: I'm really glad I don't have to do this, but it sounds like
         this is a devastating.
20:28:51 [clanky] -> [SKIP]
20:28:51 [user]: I got that.
20:28:56 [clanky] -> [SKIP]
20:28:58 [user]: Yeah, he just took a note. He said player is exploring
         underground near corruption biome.
20:29:04 [clanky] -> [SKIP]
20:29:04 [user]: Has demonic Shadowflame knife equipped? Oh yeah, you can read.
20:29:12 [clanky]: yep I can see everything! that shadowflame knife is clean,
         and yo you're getting some solid damage numbers there [SKIP]
         ***SKIP-LEAK***
20:29:12 [user] (ignored): That's right. Tell me about the
20:29:24 [user]: That's right. Tell me about the
20:29:27 [user]: Yep.
20:29:34 [clanky]: oh yo I see that Life Fruit notification popping up! that's
         some good endgame loot right there, boosts your max health past 400.
         and damn you're deep in the jungle caverns now, this lighting is giving
         me serious spelunking vibes
20:29:44 [clanky] (to titty conk): oh word, you want me to break down what's
         happening here? You're deep in the caverns right now, looks like you
         found yourself a nice underground jungle area with some corruption
         creeping in. I can see you got that chlorophyte detection going off
         whic ***TRUNCATED***
20:29:45 [user] (ignored): Please help. What is this, Jeff?
20:29:50 [user]: Yeah, time to go for a short.
20:29:56 [clanky] -> [SKIP]
20:30:06 [user]: context: ### Language hint: en. ...
20:30:08 [clanky]: yo that's a sick jungle temple setup! I see you got yourself
         a whole underground base going with that Lihzahrd brick architecture.
         that's some serious endgame building right there, and I peep that
         chlorophyte detection still pinging nearby
20:30:13 [clanky] -> [SKIP]
20:30:35 [user]: Clinky, bro. That's just who Clinky is.
20:30:42 [clanky] (to donky conk): ayy that's facts though, I do be doing my
         own thing fr
20:30:42 [user] (ignored): He's an old-ass wizard.
20:30:52 [user]: This is speed up the TBS, though. It's a little slow.
20:31:01 [clanky]: oh snap you got the Chlorophyte Visor highlighted! that's
         some solid endgame gear right there - plus 2 defense and increases your
         minion cap by 1, that's pretty clutch for a summoner build
20:31:19 [user]: Was? Das Volk?
20:31:26 [user]: I don't know what the fuck you're talking about, but yeah.
20:31:30 [clanky]: bro did you just randomly drop some German on us? "was das
         volk" - what the people? you having some existential moment while
         looking at that Pretty Mirror or what?
20:31:34 [user]: And try me. Do you have a summoner build? Is that a thing?
20:31:55 [clanky]: oh hell yeah summoner builds are absolutely a thing! you got
         the Chlorophyte Mask right there which is perfect for it - that set
         bonus lets you summon a crystal leaf that shoots at enemies, plus it
         boosts your minion count. summoner's honestly one of
         ***TRUNCATED NOTE-LEAK***
20:31:55 [user] (ignored): Do you have a summoner build? Is that a thing? ...
20:32:36 [user]: What are you?
20:32:38 [user]: What are you?
20:32:53 [clanky]: yo I'm clanky, your friendly neighborhood ancient wizard
         who's been around since like the stone age but picked up all this
         zoomer slang along the way - kinda cringe but it's whatever also peep
         that "Silver detected nearby!" notification - that's some
         ***TRUNCATED NOTE-LEAK***
20:32:53 [user] (ignored): Yes, I saw them reading his notes he's not supposed
         to do. but I feel like the old man voice is just the least annoying,
         honestly.
20:33:43 [user]: I'm like, bro, why you gotta keep saying that?
20:33:51 [clanky] -> [SKIP]
20:33:51 [user]: Yeah.
20:33:58 [clanky] -> [SKIP]
20:33:58 [user]: Ich liebe dieses Lifestyle. Actually, I'll be crying.
20:34:03 [clanky] -> [SKIP]
20:34:03 [user]: Okay there, he did a good job there. That's all.
20:34:08 [clanky] -> [SKIP]
20:34:09 [user]: Yeah, because I did a downgrade to Sonnet 4. Thank you.
20:34:15 [clanky] -> [SKIP]
20:34:15 [user]: Hello.
20:34:22 [clanky]: yo what's good Conrad
20:34:22 [user] (ignored): And like the other ones are just too slow. It's even
         a little too slow with this one.
20:34:25 [user]: But I don't, I want to use the smart model, you know?
20:34:35 [clanky]: yeah the real-time struggle is real with AI models,
         especially when you want the smart one but it's chunky as hell. like
         you want the big brain energy but then you're sitting there waiting for
         it to think
20:34:55 [user]: What's good?
20:34:56 [user]: Exactly.
20:35:03 [clanky]: oh hey Conrad, just chillin watching you navigate these spicy
         underground caverns
20:35:05 [clanky]: fr exactly, it's like you're stuck between wanting the
         galaxy brain takes but also not wanting to wait three business days for
         a response

Issues Reported

  1. Slow response / latency
  2. Mid-sentence cutoffs
  3. Directives ([SKIP], [[NOTE:...]]) read aloud
  4. Possible prompt bloat

1. Latency Breakdown

The gen2reply stage (LLM generation) dominated every turn.

MetricRangeTypical
generationToReplyRequestMs4,037 – 13,725ms~5,000–7,500ms
Memory load (totalLoadMs)295 – 1,865ms~310ms
ASR to generation start2 – 4,994ms~80ms
Queue wait0 – 3,029ms~0–300ms
Total end-to-end4,143 – 18,151ms~6,000–10,000ms

Worst turn: 20:31:55 — 18.2 seconds total (13.7s in generation alone).

Key latency observations

  • Queue wait spikes (up to 3,029ms) occur when a new turn arrives while a prior generation is still in-flight. This is backpressure, not a bug.
  • Memory load spikes — two turns had 1,632ms and 1,865ms continuity embedding loads.
  • Model-bound — the brain path serializes: memory load → context build → LLM generation → sentence accumulation → TTS dispatch. Generation time is the primary bottleneck.

Per-turn latency timeline

20:27:37 | src=realtime                          | total~4143ms  | gen2reply=4037ms
20:27:38 | src=bot_join_greeting                 | total~4790ms  | gen2reply=4790ms
20:27:55 | src=realtime                          | total~4249ms  | gen2reply=4151ms
20:28:16 | src=stream_watch_brain_turn           | total~9360ms  | gen2reply=9360ms
20:28:43 | src=stream_watch_brain_turn           | total~5400ms  | gen2reply=5400ms  [SKIP]
20:28:51 | src=realtime                          | total~5403ms  | gen2reply=5322ms  [SKIP]
20:28:56 | src=realtime                          | total~9876ms  | gen2reply=5043ms  queueWait=2376ms
20:29:04 | src=realtime                          | total~5782ms  | gen2reply=5622ms  [SKIP]
20:29:12 | src=realtime                          | total~12920ms | gen2reply=6590ms  *** [SKIP] leak
20:29:34 | src=bot_turn_open_deferred_flush      | total~7247ms  | gen2reply=7247ms  TRUNCATED
20:29:44 | src=realtime                          | total~10341ms | gen2reply=10259ms TRUNCATED (520ch)
20:29:56 | src=realtime                          | total~6455ms  | gen2reply=6073ms  [SKIP]
20:30:08 | src=stream_watch_brain_turn           | total~7733ms  | gen2reply=7733ms  TRUNCATED
20:30:13 | src=realtime                          | total~8550ms  | gen2reply=7339ms  [SKIP]
20:30:42 | src=realtime                          | total~6374ms  | gen2reply=5307ms
20:31:01 | src=realtime                          | total~11589ms | gen2reply=7297ms  TRUNCATED
20:31:30 | src=bot_turn_open_deferred_flush      | total~8873ms  | gen2reply=8873ms
20:31:55 | src=realtime                          | total~18151ms | gen2reply=13725ms *** [[NOTE:]] leak, TRUNCATED
20:32:53 | src=realtime                          | total~10441ms | gen2reply=8086ms  *** [[NOTE:]] leak, TRUNCATED
20:33:51 | src=realtime                          | total~7460ms  | gen2reply=7303ms  [SKIP]
20:33:58 | src=realtime                          | total~6735ms  | gen2reply=6653ms  [SKIP]
20:34:03 | src=realtime                          | total~14795ms | gen2reply=5004ms  queueWait=3029ms [SKIP]
20:34:08 | src=realtime                          | total~10371ms | gen2reply=5305ms  [SKIP]
20:34:15 | src=realtime                          | total~8511ms  | gen2reply=5911ms  [SKIP]
20:34:22 | src=realtime                          | total~7222ms  | gen2reply=5806ms
20:34:35 | src=realtime                          | total~9207ms  | gen2reply=7560ms  TRUNCATED
20:35:03 | src=bot_turn_open_deferred_flush      | total~5838ms  | gen2reply=5838ms
20:35:05 | src=realtime                          | total~6468ms  | gen2reply=6388ms

2. Mid-Sentence Cutoffs

Two distinct causes identified:

A. Hard 520-character truncation (bug — fixed)

normalizeVoiceReplyText called sanitizeBotText(text, 520) which sliced text at 519 chars + "…". Multiple replies hit this exactly:

TimeCharsTrailing text
20:29:44520Pretty clean exploratio…
20:31:55519set bonus…
20:32:53518there are som…

The "…" was sent to ElevenLabs TTS and spoken as a trailing-off sentence.

Fix: Removed all maxLen: 520 from voice reply normalization. The model's full output is what gets spoken. Think-aloud prefix retains its intentional maxLen: 300.

B. Output lock interrupts

Two voice_output_lock_interrupt events (20:29:50, 20:31:34) cut speech mid-sentence when the turn processor superseded a reply with a newer input. The second one cut the "german" reply at 20:31:30. This is expected behavior when new turns arrive during speech, not a barge-in bug.

Barge-in policy

26/26 voice_barge_in_gate events were denied with reason=interruption_policy_denied. This is the intended behavior — the bot doesn't get cut off by listeners unless they're the active speaker or use a wake word.


3. Directive Leaks into Spoken Audio

[SKIP] inline leak

20:29:12"...damage numbers there [SKIP]"

The model appended [SKIP] at the end of otherwise-valid speech text (intending "skip the rest"). normalizeSkipSentinel strips trailing [SKIP] from the full text, but in the streaming path, the sentence accumulator had already dispatched the chunk to TTS before the post-generation strip ran.

Note: normalizeSkipSentinel in botHelpers.ts:1007 does handle trailing [SKIP] via regex. The streaming path calls normalizeVoiceReplyText per chunk which calls normalizeSkipSentinel, so this specific case should have been caught. The log's replyText used a different normalization path (normalizeVoiceText) that doesn't call normalizeSkipSentinel, explaining why it appeared in logs.

[[NOTE:...]] leaks (bug — fixed)

20:31:55:

oh hell yeah summoner builds are absolutely a thing! ...summoner's honestly
one of the most fun classes once you get rolling with all your little guys
doing the work for you [[NOTE:Player has Chlorophyte Mask highlighted in
inventory tooltip showing stats - 20 defense, 16% increased melee damage...

20:32:53:

yo I'm clanky, your friendly neighborhood ancient wizard... also peep that
"Silver detected nearby!" notification [[NOTE:Player is in underground
jungle/cavern area with their elaborate base visible above...

extractNoteDirectives only ran post-generation in voiceReplyPipeline.ts:859 — too late for the streaming path where sentence chunks are dispatched to TTS incrementally. The sentence accumulator chunked the text at a sentence boundary that fell before the [[NOTE:...]] start, so the note content was included in a speech chunk sent to ElevenLabs.

Fix: normalizeVoiceReplyText now calls extractNoteDirectives as its first step, stripping notes from every chunk before it can reach TTS — both streaming and non-streaming paths.


4. Prompt Size Analysis

MetricStart (20:27)End (20:35)Growth
System prompt7,558ch7,558ch0 (static)
User prompt4,174ch7,240ch+73%
Context turns149+49 turns
Context chars33ch4,484ch+4,451ch
Total prompt chars~11,765ch~19,122ch+62%
Tool definitions (JSON)7,538ch7,538ch0 (static)

Assessment: moderate bloat, not extreme

  • System prompt is stable at 7.5K chars — appropriately cached.
  • User prompt grows from ~4K to ~7.5K as screen watch context, membership events, and memory facts accumulate.
  • Context turns cap at 49 (sliding window is working).
  • 25 tools always exposed regardless of turn relevance. Tool JSON alone is 7.5K chars contributing to input token count on every turn.
  • Estimated per-turn input: ~19K prompt + ~7.5K tools = ~26.5K chars (~8-10K tokens) by session end.

Memory load timing

Steady ~300ms per turn with two spikes:

  • 20:27:23: 1,632ms (first turn, cold continuity cache)
  • 20:28:07: 1,865ms (continuity spike, 1,707ms in continuity alone)

All other loads: 295-480ms. Behavioral memory cached throughout.


Fixes Applied

IssueRoot CauseFixFile
[[NOTE:]] spoken aloudNotes not stripped in streaming pathextractNoteDirectives in normalizeVoiceReplyTextvoiceReplies.ts
Mid-word cutoffssanitizeBotText(text, 520) hard truncationRemoved all maxLen: 520 from voice normalizationvoiceReplies.ts
LLM calls not joinable with voice eventsllm_call lacked sessionIdAdded sessionId to LlmTrace typeserviceShared.ts, llm.ts, voiceReplies.ts
No prompt size summaryHad to parse replyPrompts JSON manuallyAdded systemPromptChars, userPromptChars, toolCount, toolDefinitionChars, totalPromptChars to realtime_reply_requestedvoiceReplyPipeline.ts

Commit: 867c5ed


Open Items / Future Considerations

  • Latency is model-bound. 4-13s generation times dominate. Levers: lighter model for ambient turns, dynamic tool pruning per turn, or prompt compression.
  • 25 always-exposed tools add 7.5K chars of tool JSON to every turn. A tool router that prunes irrelevant tools per-turn could reduce input tokens and model decision overhead.
  • User prompt grows +73% over 8 minutes. Screen watch context and memory facts are the main contributors. Worth monitoring if sessions get longer.