docs/voice/music.md

Voice Music

Scope: Canonical music interaction rules for voice sessions — playback phases, output lock interaction, wake-word pause/resume behavior, wake latch semantics, and duck/unduck behavior while the bot speaks. Voice output state machine: voice-output-and-barge-in.md Voice reply admission/orchestration: voice-client-and-reply-orchestration.md Provider/runtime pipeline: voice-provider-abstraction.md Cross-cutting settings contract: ../reference/settings.md


1. Purpose

Persistence, preset inheritance, dashboard envelope shape, and save/version semantics live in ../reference/settings.md. This document only covers music-local voice behavior and the settings that shape that behavior.

Music behavior crosses several voice subsystems:

  • playback lifecycle
  • reply output locking
  • admission gating while music is active
  • wake-word interruption
  • bot-speech ducking

Slash music commands that imply VC playback are also voice-entry points. /music play, /music add, and /music next treat the command itself as an explicit invitation to join the requester's current voice channel when no voice session exists yet. If join policy, permissions, or runtime prerequisites block that bootstrap, playback does not start.

This document is the canonical source of truth for the music-specific rules. Other voice docs should summarize only the part that matters to their own state machine and link back here.

Music Conversation State

Read the diagram as the shortest canonical mental model:

  • paused_wake_word is still the explicit floor-taking state
  • once music resumes and the wake latch is open, ordinary conversation belongs to the main reply brain again
  • the dedicated music brain only exists to short-circuit compact music-side control/disambiguation turns

2. Canonical Concepts

Two separate mechanisms matter:

  • MusicPlaybackPhase: whether music is conceptually present, audible, paused, or idle
  • music wake latch: whether the bot should keep listening for short follow-ups without requiring another wake word

They are related but not the same:

  • music can be playing while the wake latch is open or closed
  • music can be paused_wake_word while the bot is taking the floor
  • the wake latch affects admission behavior, not the playback transport itself

Music is an overlay on top of shared attention, not a third attention mode. Clanker can be ACTIVE or AMBIENT while music is present.

3. Playback Phases

Important phase meanings:

  • loading: a play request has already been accepted and playback startup is in progress; command-only behavior and the music output lock are active even before first PCM arrives
  • playing: music is audible; output lock and command-only music behavior are active
  • paused_wake_word: music was auto-paused because someone explicitly addressed the bot
  • paused: music was paused intentionally, not by wake-word handoff
  • idle: no active music session

Important implications:

  • output lock for music is tied to active playback intent, including the loading phase before audio becomes audible
  • ducking is relevant only while music is playing
  • paused_wake_word is a handoff state, not just a generic pause
  • a resume request does not flip phase back to playing until clankvox confirms actual playback via playerState=playing
  • if a paused phase has no known resumable track state, resume is rejected and the stale music phase collapses back to idle

4. Wake-Word Handoff

Fresh wake while music is actively playing:

  • explicit bot-name / wake-word address pauses music immediately
  • phase becomes paused_wake_word
  • this creates clean conversational floor-taking instead of talking over full-volume music

Wake-word-paused music resume:

  • music does not resume at response.done
  • it resumes only after the assistant reply has actually drained from clankvox
  • the short botTurnOpen guard must also clear before resume happens
  • phase stays paused_wake_word until clankvox confirms that music is really playing again
  • if the user barges in with a new live capture while music is paused_wake_word, auto-resume waits until that interrupting capture clears instead of restarting music into the user's next turn

This means audible UX is anchored to real playback completion, not model completion.

5. Wake Latch

The wake latch is a short follow-up window during music playback.

Public control:

  • voice.admission.musicWakeLatchSeconds

Canonical behavior:

  • a fresh direct address or bot-name cue during active music arms the latch
  • while music is still paused_wake_word, ordinary follow-ups stay owned by the wake-word speaker who opened that pause
  • for ordinary replies spoken over still-playing music, the passive follow-up window refreshes after assistant speech actually settles, not while buffered reply audio is still draining
  • when wake-word-paused music resumes after assistant playback drains, the latch renews from that real resume moment
  • once a follow-up capture is promoted while the latch is open, that turn keeps its eligibility even if finalization lands just after the latch expires
  • while the latch is open, normal conversational follow-ups can pass reply admission without repeating the wake word
  • when the latch expires, non-command non-wake chatter goes back to being denied during active music

The latch is intentionally simple once music is back to playing:

  • open or closed
  • not conversational ownership for ordinary follow-ups
  • command/disambiguation ownership still applies separately where the command system requires it

6. Music Brain

Music control is model-owned once a turn is allowed past the deterministic safety gates.

Deterministic layer responsibilities:

  • verify that the turn is even eligible while music is active
  • enforce wake-word ownership while music is still paused_wake_word
  • enforce the simple open/closed wake latch while music is back to playing
  • keep acoustic safety and barge-in safety separate from conversational decisions

When agentStack.runtimeConfig.voice.musicBrain.mode is dedicated_model, only compact music-control turns go to a small music brain first. That model sees:

  • the heard transcript
  • a tiny slice of recent spoken context: the last assistant reply and the previous turn from the same speaker when available
  • current playback phase and queue state
  • whether the turn was a fresh direct address
  • whether the wake latch is open
  • whether the paused wake-word conversation is still owned by this speaker
  • only the music tool surface

The dedicated music brain then returns one of two outcomes:

  • consumed: handle the turn with music tools and stop there; no normal spoken reply is required
  • pass: this was not really a music-side command; let the main reply path decide whether to answer

The dedicated music brain does not choose pause or duck handoffs anymore. Those are main-brain floor-control decisions. Pending music-choice followups also stay out of the dedicated music brain. The ordinary reply brain sees the active option list in prompt context and decides the follow-up tool call itself.

The dedicated model binding lives under agentStack.runtimeConfig.voice.musicBrain. It stays separate from the reply admission/classifier model and still applies when reply admission is set to generation_decides. Presets still expose a small fallback model when this mode is turned on, but the default runtime mode is disabled, so the main reply brain owns music handoff unless the user explicitly enables the dedicated music brain.

When agentStack.runtimeConfig.voice.musicBrain.mode is disabled, the deterministic safety gates still decide whether the turn is eligible during active music, but the dedicated music brain is bypassed. The main reply brain then owns the temporary music handoff decision itself:

  • media_reply_handoff(mode=pause) for a one-reply floor-taking pause that auto-resumes
  • media_reply_handoff(mode=duck) for a one-reply duck/unduck handoff
  • no handoff tool call when it wants to speak normally over current playback state
  • [SKIP] when the wake word got attention but the bot decides not to respond

Persistent playback tools stay separate:

  • media_pause means leave playback paused beyond the current reply
  • media_resume means request paused playback now; the phase only returns to playing after transport confirmation
  • media_stop means stop playback

7. Pause Versus Duck

This is the intended nuanced behavior:

  • first fresh wake during active music: pause
  • after music resumes and the latch is open: ordinary follow-up replies can stay conversational without another wake word, and the main reply brain decides whether that reply should pause, duck, do nothing, or stay silent
  • a brand-new explicit wake word during that same latch-open window: pause again
  • if the bot speaks while music is still live and no handoff was claimed, it should usually favor a quick reaction or short answer unless the moment clearly wants more
  • if the brain chooses media_reply_handoff, that only means this reply can temporarily take the floor and playback auto-restores afterward; the bot still decides whether the answer stays brief or goes longer

So “latch-open follow-up” and “fresh wake” are intentionally different:

  • latch-open follow-up means “conversation can continue naturally”
  • fresh wake means “the user explicitly wants the bot to take the floor again”

8. Ducking

Ducking is gain-only:

  • duck/unduck lowers and restores music volume
  • it does not pause the track
  • it is used only when the main reply brain explicitly chooses media_reply_handoff(mode=duck) and assistant speech happens while music remains in the playing phase

This is the steady-state path for post-resume follow-ups that do not reopen a fresh wake-word pause.

9. Output Lock Interaction

Music interacts with reply output in two ways:

  • active music playback contributes an orthogonal output lock (music_playback_active)
  • wake-word-paused music temporarily clears the floor so the assistant can answer cleanly

Important distinction:

  • music_playback_active is not part of the assistant output phase machine
  • it is composed with assistant output state at reply-lock evaluation time

For the full output state machine, see voice-output-and-barge-in.md.

10. Admission Interaction

During active music:

  • no wake latch: non-wake chatter is denied
  • while music is paused_wake_word: only the wake-word speaker's ordinary follow-ups can continue without another wake word
  • wake latch open: follow-ups can continue without repeating the wake word
  • a recent same-speaker follow-up immediately after a successful barge-in also stays eligible even if no wake latch is open; both the music prefilter and the final reply admission layer honor that follow-up so interrupted speech is not reclassified as background chatter
  • fresh wake-word/direct-address turns go straight to the main reply brain
  • exact compact control words like pause, stop, skip, and resume use an immediate fast path when the dedicated music brain is enabled
  • fuzzy control turns use the dedicated music brain only to decide whether they should be consumed as music-side commands
  • pending music disambiguation followups always go straight to the main reply brain for ordinary reply planning, even if music is still active
  • with the dedicated music brain disabled, even those control turns go straight to the main reply brain
  • once music-mode handling returns pass, or when the dedicated music brain is disabled, the turn continues through the normal reply admission/reply-generation path
  • the main reply brain sees the active pending query and option list during ordinary reply planning, even if playback has not started yet and music is still effectively idle
  • requester-only cancellation of an active music choice prompt still clears that pending state locally for commands like never mind
  • explicit text-side disambiguation fallback still uses cheap exact/ordinal/title matching, then may use a bounded resolver over the active option list; it never invents a new option id or starts a fresh search from that followup alone
  • music_play treats selection_id as advisory when a query is also present; if the selection id is stale or malformed, the tool logs the bad id and falls back to query search instead of failing the whole play request
  • video_play uses the same playback/disambiguation machinery but constrains lookup to YouTube and hands the selected URL to the outbound publish pipeline when that runtime path is active
  • video_search is the explicit “show me some YouTube options” capability; when site layout or thumbnails matter more than raw candidates, the model can choose browser_browse instead
  • music_queue_next and music_queue_add can resolve ordinary queue requests directly from query text, or reuse an exact selection_id/track id when one is already known
  • for "play X, then queue Y" turns, the intended tool order is music_play first and music_queue_next second in the same tool turn; this avoids stranding the queue intent behind async playback startup
  • spoken confirmations should not claim a track is queued until music_queue_next or music_queue_add has actually succeeded
  • pending music disambiguation or command followups still use the canonical voiceCommandState ownership rules managed by VoiceSessionManager
  • raw PCM music turns use the same transcription-plan and mini-model fallback policy as ordinary voice turns

The wake latch does not force a reply. It only stops the music prefilter from hard-swallowing a turn before the active music-decision layer can decide what kind of handoff, if any, should happen.

11. Logging And Debugging

When debugging music conversation behavior, start with:

  • voice_music_stop_check
  • voice_music_output_halt_preserved_newer_turn
  • voice_music_paused_for_wake_word
  • voice_music_resumed
  • voice_music_resume_unavailable
  • voice_music_output_halted
  • voice_turn_addressing
  • openai_realtime_response_done

Interpretation rules:

  • decisionReason=swallowed: music prefilter consumed the turn before normal reply admission
  • decisionReason=interrupted_reply_followup: a recent same-speaker follow-up after successful barge-in stayed eligible despite music still playing
  • decisionReason=fast_path_pause|stop|skip|resume: an exact compact control word was consumed immediately without invoking the music brain
  • decisionReason=music_brain_consumed: the music brain handled the turn itself, usually with music tools
  • decisionReason=music_brain_pass: the music brain decided this was not really a music-side command and let the ordinary reply path continue
  • decisionReason=main_brain_decides: the turn was forwarded straight to the main reply brain; inspect gateDecisionReason to see whether it came from direct address, wake-latch followup, interrupted-reply followup, or disabled music-brain routing
  • voice_music_resume_unavailable: the session believed music was paused, but no resumable queue/current-track state remained; JS clears the stale paused phase instead of pretending playback restarted
  • voice_music_output_halt_preserved_newer_turn: a late async music_play start finished after a newer same-session reply turn had already been accepted but had not requested playback yet. Music still starts, but JS preserves that newer pre-playback turn instead of aborting it with a session-wide clearPendingResponse()
  • interrupted assistant speech should clear any queued realtime assistant utterances from the abandoned reply before new playback begins
  • paused_wake_word followed by resume after playback drain and capture clear is the expected clean handoff path

12. Code Anchors

  • src/voice/voiceMusicPlayback.ts
  • src/voice/musicWakeLatch.ts
  • src/voice/replyManager.ts
  • src/voice/voiceReplyDecision.ts
  • src/voice/voiceSessionTypes.ts
  • src/voice/sessionLifecycle.ts

Product language: music should feel like background atmosphere that politely makes room when directly invoked, then slips back under the conversation without forcing the user to keep re-summoning the bot.