Unified Stream Watch Pipeline
Status: proposed
References:
docs/voice/screen-share-system.mddocs/voice/voice-client-and-reply-orchestration.mdsrc/voice/voiceStreamWatch.ts— frame ingestion, vision triage, commentary triggerssrc/voice/voiceReplyPipeline.ts— generation pipeline, frame/note injectionsrc/voice/clankvox/src/video_decoder.rs— H264 decode, scene-cut metrics, change scores
Problem
The screen watch system has two mutually exclusive modes (direct and context_brain) that each solve half the problem:
context_brain: cheap vision model accumulates notes on a steady interval, but the main brain only fires on "high urgency" (rarely triggered in practice). Notes are always injected, but the brain almost never gets proactive commentary turns.direct: main brain sees raw frames on every proactive turn and writes[[NOTE:...]]inline, but note accumulation is blocked by the same gates that block commentary (audio quiet window,hasQueuedVoiceWork,isStreamWatchPlaybackBusy). During active conversation, notes go stale.
Observed issues from log analysis (Mar 16 2026 sessions)
-
Vision diff never triggers. Change scores during a Terraria boss fight maxed at 0.016 against a threshold of 0.15. All 31 commentary turns across 3 sessions were interval-timer fires (
changeTriggered: false). The clankvox diffing works — scores vary with screen activity — but the threshold is ~10x too high. -
Notes go stale during conversation. The audio quiet window (2.2s) and
hasQueuedVoiceWorkgate block both commentary AND note accumulation in direct mode. During a 60-second stretch of active gaming chatter, Clanky's visual memory freezes. -
Voice replies carry unnecessary image weight. Every voice turn during screen share attached the raw JPEG (~1500-2000 tokens), even conversational turns unrelated to the screen. Fixed in commit
52e35b6— voice replies now only get rolling notes, not the image. -
Two modes, redundant settings. Operators choose between two pipelines that have separate model configs, interval settings, and entry limits. The mental model is confusing and the modes don't compose.
Design
Merge the two modes into a single pipeline with two decoupled loops:
┌─────────────────────────────────────────────────────┐
│ FRAME INGESTION │
│ clankvox → DecodedVideoFrame IPC → ingestStreamFrame│
│ (2fps, rate-limited by maxFramesPerMinute) │
└──────────────┬──────────────────────────┬────────────┘
│ │
▼ ▼
┌──────────────────────────┐ ┌──────────────────────┐
│ NOTE-TAKER LOOP │ │ LATEST FRAME STORE │
│ │ │ (always updated) │
│ Adaptive interval: │ └──────────┬───────────┘
│ │ │
│ score >= changeThreshold│ │
│ → fire now (cooldown │ │
│ permitting) │ │
│ score < staticFloor │ │
│ → idle interval (30s) │ │
│ otherwise │ │
│ → normal interval(10s)│ │
│ scene cut │ │
│ → fire immediately │ │
│ │ │
│ NOT gated by: │ │
│ - Audio quiet window │ │
│ - hasQueuedVoiceWork │ │
│ - isStreamWatchPlayback │ │
│ │ │
│ Throttled only by: │ │
│ - Vision model latency │ │
│ (natural backpressure)│ │
│ - Change cooldown (1s) │ │
│ │ │
│ Produces: │ │
│ - Rolling notes │ │
│ - Stored in │ │
│ brainContextEntries │ │
│ - Never speaks │ │
└──────────┬───────────────┘ │
│ │
▼ │
┌──────────────────────────────────────────▼──────────┐
│ VOICE BRAIN │
│ │
│ On user-speech turns: │
│ - Sees rolling notes (always fresh) │
│ - No image attached │
│ - Responds to conversation naturally │
│ │
│ On commentary turns (proactive): │
│ - Sees rolling notes + current frame │
│ - Gated by audio quiet window + interval │
│ - Decides to speak or [SKIP] │
│ - Writes additional [[NOTE:...]] inline │
│ │
│ On direct-address about screen: │
│ - Sees rolling notes + current frame (re-attached) │
│ - "what's on screen?" / "what do you see?" │
└─────────────────────────────────────────────────────┘
Note-taker loop
A standalone async loop that runs independently of the voice reply pipeline.
Adaptive interval based on screen activity:
every frame arrival:
if scene cut:
fire immediately (cooldown permitting)
else if changeScore >= changeThreshold (0.01):
fire immediately (cooldown permitting)
else if changeScore < staticFloor (0.005):
use idle interval — noteIdleIntervalSeconds (30s)
else:
use normal interval — noteIntervalSeconds (10s)
The interval adapts to what's on screen. An active boss fight ticks every 10s (or faster on big visual changes). A game lobby with ambient particle effects stretches to 30s. A completely static screen also uses 30s. The static floor filters out ambient motion (cursor blinks, subtle animations, screensaver-style effects) that produce nonzero change scores but aren't worth a vision call when the last note is still fresh.
What it does NOT care about:
- Whether anyone is talking (no audio quiet window)
- Whether the bot is speaking or generating (no
hasQueuedVoiceWork) - Whether playback is active (no
isStreamWatchPlaybackBusy)
What throttles it:
- Natural backpressure:
awaitthe vision model call before allowing the next one. Can't fire faster than the model responds. - Change-triggered cooldown (
changeMinIntervalSeconds, default 2s) prevents rapid-fire on sustained high-change content maxFramesPerMinutestill applies at the frame ingestion layer
What it produces:
- A short observation note appended to
brainContextEntries - No urgency classification needed — the note-taker doesn't decide whether to trigger commentary
- No speech, no output lock interaction, no playback
Note lifecycle (already implemented):
When notes exceed maxNoteEntries, the oldest entries are evicted into pendingCompactionNotes. The existing context compaction system (voiceContextCompaction.ts) folds these into the running compactedContextSummary alongside conversation turns. So temporal continuity is preserved even as the rolling buffer turns over — the summary retains the arc ("started in lobby, fought Eye of Cthulhu, respecced to mage") while recent notes keep granular detail ("Duke Fishron at 43k HP, dodging tornado").
Model: Configurable separately (noteProvider, noteModel). Should be cheap and fast — haiku-class or flash-class. The prompt is the same context_brain triage prompt minus the urgency field: just "describe what you see in one line."
Commentary loop
The existing proactive commentary mechanism, simplified:
When it fires:
- Steady interval (
commentaryIntervalSeconds, default ~15-20s) - Change-triggered early fire (using the same change scores, with a separate commentary cooldown)
- First frame (
share_start)
Gated by (same as today):
- Audio quiet window (2.2s since last inbound audio)
- No pending voice work (
hasQueuedVoiceWork) - No active playback (
isStreamWatchPlaybackBusy) autonomousCommentaryEnabledtoggle
What it sees:
- Fresh rolling notes from the note-taker (always up to date, even during active conversation)
- Current raw frame (image attached for commentary turns only)
- Full conversation context
What it produces:
- Spoken commentary or
[SKIP] - Additional
[[NOTE:...]]inline observations (stored alongside note-taker notes)
Voice reply turns (user speech)
No change from the current post-52e35b6 behavior:
- Rolling notes injected via
streamWatchBrainContext(always fresh now thanks to decoupled note-taker) - No image attached
- If the user directly asks about the screen ("what's on screen?", "what do you see?"), re-attach the current frame
The screen-question detection can be a simple heuristic or left to the model's reasoning — the notes should be sufficient for most cases, and the frame re-attach is a nice-to-have optimization.
Settings Consolidation
Removed settings (after migration)
brainContextMode— no longer two modesbrainContextEnabled— notes always run when screen watch is activedirectMinIntervalSeconds— replaced bycommentaryIntervalSecondsdirectMaxEntries— replaced bymaxNoteEntriesdirectChangeThreshold— replaced by unifiedchangeThresholddirectChangeMinIntervalSeconds— replaced bychangeMinIntervalSecondsbrainContextMinIntervalSeconds— replaced bynoteMinIntervalSecondsbrainContextMaxEntries— replaced bymaxNoteEntriesminCommentaryIntervalSeconds— replaced bycommentaryIntervalSeconds
New unified settings
| Setting | Default | Description |
|---|---|---|
noteProvider | "claude-oauth" | LLM provider for note-taker vision calls |
noteModel | "claude-haiku-3-5" | Model for note-taker (cheap/fast) |
noteIntervalSeconds | 10 | Normal interval for note-taker ticks (3-120) |
noteIdleIntervalSeconds | 30 | Interval when screen is static / ambient motion only (10-120) |
staticFloor | 0.005 | Change scores below this are treated as static (0.001-0.05) |
maxNoteEntries | 12 | Max rolling notes kept in brain context (1-24) |
changeThreshold | 0.01 | Visual change score that triggers immediate note-taker tick (0.005-1.0). Based on observed data: Terraria boss fights peak at ~0.016, idle screens ~0.001. Start low and tune up if too chatty. |
changeMinIntervalSeconds | 2 | Cooldown between change-triggered note ticks (1-30) |
commentaryIntervalSeconds | 15 | Min seconds between proactive commentary turns (5-120) |
commentaryProvider | (inherit voice) | LLM provider for commentary brain turns |
commentaryModel | (inherit voice) | Model for commentary brain turns |
autonomousCommentaryEnabled | true | Master toggle for proactive commentary |
Retained unchanged: enabled, maxFramesPerMinute, maxFrameBytes, keyframeIntervalMs, nativeDiscordMaxFramesPerSecond, nativeDiscordPreferredQuality, nativeDiscordPreferredPixelCount, nativeDiscordPreferredStreamType, sharePageMaxWidthPx, sharePageJpegQuality.
Implementation Plan
Phase 1: Decouple note-taker from commentary gates
-
Extract the note-taker into its own async loop function (
runNoteTakerLoop) that:- Runs on a
setInterval/ frame-driven timer - Calls the vision model with the current frame
- Appends the result to
brainContextEntries - Has no interaction with the output lock, voice work queue, or audio quiet window
- Awaits each vision call before allowing the next (natural backpressure)
- Skips calls when change score is near zero and interval hasn't elapsed
- Runs on a
-
Start the note-taker loop when screen watch activates, stop it when screen watch ends.
-
Remove the note-accumulation responsibility from
maybeTriggerDirectStreamWatchBrainTurnandmaybeTriggerStreamWatchCommentary.
Phase 2: Lower vision diff threshold
- Change
directChangeThresholddefault from0.15to0.04 - Validate against log data from live sessions — the Terraria boss fight scores (0.001-0.016) suggest 0.04 would catch scene transitions and major gameplay changes while ignoring minor character animations
Phase 3: Simplify commentary triggers
- Remove
brainContextModeswitch — one pipeline, always - Commentary fires on interval + change trigger, gated by audio quiet window + voice work gates
- Commentary always attaches the current frame (it's the only turn type that does)
- Remove the urgency classification from the note-taker — it just takes notes
Phase 4: Settings migration and documentation
- Map old settings to new settings with backward compat normalization
- Update dashboard UI to reflect unified pipeline
- Update
docs/voice/screen-share-system.md— document the note lifecycle end-to-end: note-taker → rolling buffer (12 recent) → eviction →pendingCompactionNotes→ compaction into session summary → summary injected into all prompts. This is currently undocumented in the canonical doc.
Phase 5: Screen-question frame re-attach (optional)
- Detect when a directly-addressed voice turn is asking about the screen
- Re-attach the current frame for that specific reply
- This is a nice-to-have — the rolling notes should cover most cases
Key Source Files to Change
| File | Changes |
|---|---|
src/voice/voiceStreamWatch.ts | New runNoteTakerLoop, simplify commentary triggers, remove mode switch |
src/voice/voiceReplyPipeline.ts | Already done (image drop). Optional: screen-question frame re-attach |
src/settings/settingsSchema.ts | New unified settings, deprecate old mode-specific settings |
src/store/normalize/voice.ts | Migration normalization for old settings |
src/voice/voiceSessionManager.ts | Start/stop note-taker loop on watch lifecycle |
docs/voice/screen-share-system.md | Update to reflect unified pipeline |
dashboard/src/ | Update stream watch settings UI |
Open Questions
-
Note-taker prompt: Should it be the same as the current context_brain triage prompt (one-line observation), or should we give it more structure (e.g. "what changed since last note")?
-
Note deduplication: The current system evicts old entries by count. Should the note-taker also skip storing a note if it's semantically identical to the most recent one? (Cheap heuristic: exact string match after normalization.)
-
Commentary interval tuning: Current direct mode fires every ~8-10s (though gates delay it). With fresh notes always available, commentary can be less frequent. 15-20s default feels right for "reacting to what's happening" without being chatty, but this needs live tuning.
-
Frame re-attach for screen questions: Is simple keyword detection ("what's on screen", "what do you see", "look at") sufficient, or should the addressing classifier handle this?
