Voice Session Log Dive — 2026-03-15
Session: 155583c8-1072-41b9-86f2-8981a232431f
Time: 20:27 – 20:35 UTC
Mode: elevenlabs_realtime (brain path with ElevenLabs TTS)
Participants: 2 (vuhlp + clanky)
Turns (non-skip): 17 spoken replies, 11 [SKIP] decisions
Conversation Timeline
Annotations: (ignored) = barge-in denied, [SKIP] = clanky chose silence,
TRUNCATED = hit 520-char hard limit, SKIP-LEAK / NOTE-LEAK = directive
spoken aloud by TTS.
20:27:26 [user]: Let's get it.
20:27:37 [clanky]: yooo what's good, just rolled up! what we getting into?
20:27:38 [clanky]: yooo what's good conks! just slid into the function, ready
for whatever chaos y'all got brewing
20:27:49 [user]: Yo, can you, uh, can you take a look at TittyConk's stream?
20:27:55 [clanky]: oh bet, let me peep what's happening on their screen
20:28:16 [clanky]: yooo Terraria! looks like you're in some spooky corruption
biome, I see that purple nastiness spreading. and bruh you got a whole
underground base setup going, that's clean
20:28:43 [clanky] -> [SKIP]
20:28:46 [user]: I'm really glad I don't have to do this, but it sounds like
this is a devastating.
20:28:51 [clanky] -> [SKIP]
20:28:51 [user]: I got that.
20:28:56 [clanky] -> [SKIP]
20:28:58 [user]: Yeah, he just took a note. He said player is exploring
underground near corruption biome.
20:29:04 [clanky] -> [SKIP]
20:29:04 [user]: Has demonic Shadowflame knife equipped? Oh yeah, you can read.
20:29:12 [clanky]: yep I can see everything! that shadowflame knife is clean,
and yo you're getting some solid damage numbers there [SKIP]
***SKIP-LEAK***
20:29:12 [user] (ignored): That's right. Tell me about the
20:29:24 [user]: That's right. Tell me about the
20:29:27 [user]: Yep.
20:29:34 [clanky]: oh yo I see that Life Fruit notification popping up! that's
some good endgame loot right there, boosts your max health past 400.
and damn you're deep in the jungle caverns now, this lighting is giving
me serious spelunking vibes
20:29:44 [clanky] (to titty conk): oh word, you want me to break down what's
happening here? You're deep in the caverns right now, looks like you
found yourself a nice underground jungle area with some corruption
creeping in. I can see you got that chlorophyte detection going off
whic ***TRUNCATED***
20:29:45 [user] (ignored): Please help. What is this, Jeff?
20:29:50 [user]: Yeah, time to go for a short.
20:29:56 [clanky] -> [SKIP]
20:30:06 [user]: context: ### Language hint: en. ...
20:30:08 [clanky]: yo that's a sick jungle temple setup! I see you got yourself
a whole underground base going with that Lihzahrd brick architecture.
that's some serious endgame building right there, and I peep that
chlorophyte detection still pinging nearby
20:30:13 [clanky] -> [SKIP]
20:30:35 [user]: Clinky, bro. That's just who Clinky is.
20:30:42 [clanky] (to donky conk): ayy that's facts though, I do be doing my
own thing fr
20:30:42 [user] (ignored): He's an old-ass wizard.
20:30:52 [user]: This is speed up the TBS, though. It's a little slow.
20:31:01 [clanky]: oh snap you got the Chlorophyte Visor highlighted! that's
some solid endgame gear right there - plus 2 defense and increases your
minion cap by 1, that's pretty clutch for a summoner build
20:31:19 [user]: Was? Das Volk?
20:31:26 [user]: I don't know what the fuck you're talking about, but yeah.
20:31:30 [clanky]: bro did you just randomly drop some German on us? "was das
volk" - what the people? you having some existential moment while
looking at that Pretty Mirror or what?
20:31:34 [user]: And try me. Do you have a summoner build? Is that a thing?
20:31:55 [clanky]: oh hell yeah summoner builds are absolutely a thing! you got
the Chlorophyte Mask right there which is perfect for it - that set
bonus lets you summon a crystal leaf that shoots at enemies, plus it
boosts your minion count. summoner's honestly one of
***TRUNCATED NOTE-LEAK***
20:31:55 [user] (ignored): Do you have a summoner build? Is that a thing? ...
20:32:36 [user]: What are you?
20:32:38 [user]: What are you?
20:32:53 [clanky]: yo I'm clanky, your friendly neighborhood ancient wizard
who's been around since like the stone age but picked up all this
zoomer slang along the way - kinda cringe but it's whatever also peep
that "Silver detected nearby!" notification - that's some
***TRUNCATED NOTE-LEAK***
20:32:53 [user] (ignored): Yes, I saw them reading his notes he's not supposed
to do. but I feel like the old man voice is just the least annoying,
honestly.
20:33:43 [user]: I'm like, bro, why you gotta keep saying that?
20:33:51 [clanky] -> [SKIP]
20:33:51 [user]: Yeah.
20:33:58 [clanky] -> [SKIP]
20:33:58 [user]: Ich liebe dieses Lifestyle. Actually, I'll be crying.
20:34:03 [clanky] -> [SKIP]
20:34:03 [user]: Okay there, he did a good job there. That's all.
20:34:08 [clanky] -> [SKIP]
20:34:09 [user]: Yeah, because I did a downgrade to Sonnet 4. Thank you.
20:34:15 [clanky] -> [SKIP]
20:34:15 [user]: Hello.
20:34:22 [clanky]: yo what's good Conrad
20:34:22 [user] (ignored): And like the other ones are just too slow. It's even
a little too slow with this one.
20:34:25 [user]: But I don't, I want to use the smart model, you know?
20:34:35 [clanky]: yeah the real-time struggle is real with AI models,
especially when you want the smart one but it's chunky as hell. like
you want the big brain energy but then you're sitting there waiting for
it to think
20:34:55 [user]: What's good?
20:34:56 [user]: Exactly.
20:35:03 [clanky]: oh hey Conrad, just chillin watching you navigate these spicy
underground caverns
20:35:05 [clanky]: fr exactly, it's like you're stuck between wanting the
galaxy brain takes but also not wanting to wait three business days for
a response
Issues Reported
- Slow response / latency
- Mid-sentence cutoffs
- Directives (
[SKIP],[[NOTE:...]]) read aloud - Possible prompt bloat
1. Latency Breakdown
The gen2reply stage (LLM generation) dominated every turn.
| Metric | Range | Typical |
|---|---|---|
generationToReplyRequestMs | 4,037 – 13,725ms | ~5,000–7,500ms |
Memory load (totalLoadMs) | 295 – 1,865ms | ~310ms |
| ASR to generation start | 2 – 4,994ms | ~80ms |
| Queue wait | 0 – 3,029ms | ~0–300ms |
| Total end-to-end | 4,143 – 18,151ms | ~6,000–10,000ms |
Worst turn: 20:31:55 — 18.2 seconds total (13.7s in generation alone).
Key latency observations
- Queue wait spikes (up to 3,029ms) occur when a new turn arrives while a prior generation is still in-flight. This is backpressure, not a bug.
- Memory load spikes — two turns had 1,632ms and 1,865ms continuity embedding loads.
- Model-bound — the brain path serializes: memory load → context build → LLM generation → sentence accumulation → TTS dispatch. Generation time is the primary bottleneck.
Per-turn latency timeline
20:27:37 | src=realtime | total~4143ms | gen2reply=4037ms
20:27:38 | src=bot_join_greeting | total~4790ms | gen2reply=4790ms
20:27:55 | src=realtime | total~4249ms | gen2reply=4151ms
20:28:16 | src=stream_watch_brain_turn | total~9360ms | gen2reply=9360ms
20:28:43 | src=stream_watch_brain_turn | total~5400ms | gen2reply=5400ms [SKIP]
20:28:51 | src=realtime | total~5403ms | gen2reply=5322ms [SKIP]
20:28:56 | src=realtime | total~9876ms | gen2reply=5043ms queueWait=2376ms
20:29:04 | src=realtime | total~5782ms | gen2reply=5622ms [SKIP]
20:29:12 | src=realtime | total~12920ms | gen2reply=6590ms *** [SKIP] leak
20:29:34 | src=bot_turn_open_deferred_flush | total~7247ms | gen2reply=7247ms TRUNCATED
20:29:44 | src=realtime | total~10341ms | gen2reply=10259ms TRUNCATED (520ch)
20:29:56 | src=realtime | total~6455ms | gen2reply=6073ms [SKIP]
20:30:08 | src=stream_watch_brain_turn | total~7733ms | gen2reply=7733ms TRUNCATED
20:30:13 | src=realtime | total~8550ms | gen2reply=7339ms [SKIP]
20:30:42 | src=realtime | total~6374ms | gen2reply=5307ms
20:31:01 | src=realtime | total~11589ms | gen2reply=7297ms TRUNCATED
20:31:30 | src=bot_turn_open_deferred_flush | total~8873ms | gen2reply=8873ms
20:31:55 | src=realtime | total~18151ms | gen2reply=13725ms *** [[NOTE:]] leak, TRUNCATED
20:32:53 | src=realtime | total~10441ms | gen2reply=8086ms *** [[NOTE:]] leak, TRUNCATED
20:33:51 | src=realtime | total~7460ms | gen2reply=7303ms [SKIP]
20:33:58 | src=realtime | total~6735ms | gen2reply=6653ms [SKIP]
20:34:03 | src=realtime | total~14795ms | gen2reply=5004ms queueWait=3029ms [SKIP]
20:34:08 | src=realtime | total~10371ms | gen2reply=5305ms [SKIP]
20:34:15 | src=realtime | total~8511ms | gen2reply=5911ms [SKIP]
20:34:22 | src=realtime | total~7222ms | gen2reply=5806ms
20:34:35 | src=realtime | total~9207ms | gen2reply=7560ms TRUNCATED
20:35:03 | src=bot_turn_open_deferred_flush | total~5838ms | gen2reply=5838ms
20:35:05 | src=realtime | total~6468ms | gen2reply=6388ms
2. Mid-Sentence Cutoffs
Two distinct causes identified:
A. Hard 520-character truncation (bug — fixed)
normalizeVoiceReplyText called sanitizeBotText(text, 520) which sliced
text at 519 chars + "…". Multiple replies hit this exactly:
| Time | Chars | Trailing text |
|---|---|---|
| 20:29:44 | 520 | Pretty clean exploratio… |
| 20:31:55 | 519 | set bonus… |
| 20:32:53 | 518 | there are som… |
The "…" was sent to ElevenLabs TTS and spoken as a trailing-off sentence.
Fix: Removed all maxLen: 520 from voice reply normalization. The model's
full output is what gets spoken. Think-aloud prefix retains its intentional
maxLen: 300.
B. Output lock interrupts
Two voice_output_lock_interrupt events (20:29:50, 20:31:34) cut speech
mid-sentence when the turn processor superseded a reply with a newer input.
The second one cut the "german" reply at 20:31:30. This is expected behavior
when new turns arrive during speech, not a barge-in bug.
Barge-in policy
26/26 voice_barge_in_gate events were denied with
reason=interruption_policy_denied. This is the intended behavior — the bot
doesn't get cut off by listeners unless they're the active speaker or use a
wake word.
3. Directive Leaks into Spoken Audio
[SKIP] inline leak
20:29:12 — "...damage numbers there [SKIP]"
The model appended [SKIP] at the end of otherwise-valid speech text
(intending "skip the rest"). normalizeSkipSentinel strips trailing [SKIP]
from the full text, but in the streaming path, the sentence accumulator had
already dispatched the chunk to TTS before the post-generation strip ran.
Note: normalizeSkipSentinel in botHelpers.ts:1007 does handle trailing
[SKIP] via regex. The streaming path calls normalizeVoiceReplyText per
chunk which calls normalizeSkipSentinel, so this specific case should have
been caught. The log's replyText used a different normalization path
(normalizeVoiceText) that doesn't call normalizeSkipSentinel, explaining
why it appeared in logs.
[[NOTE:...]] leaks (bug — fixed)
20:31:55:
oh hell yeah summoner builds are absolutely a thing! ...summoner's honestly
one of the most fun classes once you get rolling with all your little guys
doing the work for you [[NOTE:Player has Chlorophyte Mask highlighted in
inventory tooltip showing stats - 20 defense, 16% increased melee damage...
20:32:53:
yo I'm clanky, your friendly neighborhood ancient wizard... also peep that
"Silver detected nearby!" notification [[NOTE:Player is in underground
jungle/cavern area with their elaborate base visible above...
extractNoteDirectives only ran post-generation in voiceReplyPipeline.ts:859
— too late for the streaming path where sentence chunks are dispatched to TTS
incrementally. The sentence accumulator chunked the text at a sentence boundary
that fell before the [[NOTE:...]] start, so the note content was included in
a speech chunk sent to ElevenLabs.
Fix: normalizeVoiceReplyText now calls extractNoteDirectives as its
first step, stripping notes from every chunk before it can reach TTS — both
streaming and non-streaming paths.
4. Prompt Size Analysis
| Metric | Start (20:27) | End (20:35) | Growth |
|---|---|---|---|
| System prompt | 7,558ch | 7,558ch | 0 (static) |
| User prompt | 4,174ch | 7,240ch | +73% |
| Context turns | 1 | 49 | +49 turns |
| Context chars | 33ch | 4,484ch | +4,451ch |
| Total prompt chars | ~11,765ch | ~19,122ch | +62% |
| Tool definitions (JSON) | 7,538ch | 7,538ch | 0 (static) |
Assessment: moderate bloat, not extreme
- System prompt is stable at 7.5K chars — appropriately cached.
- User prompt grows from ~4K to ~7.5K as screen watch context, membership events, and memory facts accumulate.
- Context turns cap at 49 (sliding window is working).
- 25 tools always exposed regardless of turn relevance. Tool JSON alone is 7.5K chars contributing to input token count on every turn.
- Estimated per-turn input: ~19K prompt + ~7.5K tools = ~26.5K chars (~8-10K tokens) by session end.
Memory load timing
Steady ~300ms per turn with two spikes:
- 20:27:23: 1,632ms (first turn, cold continuity cache)
- 20:28:07: 1,865ms (continuity spike, 1,707ms in continuity alone)
All other loads: 295-480ms. Behavioral memory cached throughout.
Fixes Applied
| Issue | Root Cause | Fix | File |
|---|---|---|---|
[[NOTE:]] spoken aloud | Notes not stripped in streaming path | extractNoteDirectives in normalizeVoiceReplyText | voiceReplies.ts |
| Mid-word cutoffs | sanitizeBotText(text, 520) hard truncation | Removed all maxLen: 520 from voice normalization | voiceReplies.ts |
| LLM calls not joinable with voice events | llm_call lacked sessionId | Added sessionId to LlmTrace type | serviceShared.ts, llm.ts, voiceReplies.ts |
| No prompt size summary | Had to parse replyPrompts JSON manually | Added systemPromptChars, userPromptChars, toolCount, toolDefinitionChars, totalPromptChars to realtime_reply_requested | voiceReplyPipeline.ts |
Commit: 867c5ed
Open Items / Future Considerations
- Latency is model-bound. 4-13s generation times dominate. Levers: lighter model for ambient turns, dynamic tool pruning per turn, or prompt compression.
- 25 always-exposed tools add 7.5K chars of tool JSON to every turn. A tool router that prunes irrelevant tools per-turn could reduce input tokens and model decision overhead.
- User prompt grows +73% over 8 minutes. Screen watch context and memory facts are the main contributors. Worth monitoring if sessions get longer.
