Media
This document is the canonical product-level guide to media in the selfbot: what the agent can do, how those capabilities fit together, and which deep-dive docs own the transport internals.
Media is one system here, not a pile of separate modes. The same autonomous brain can decide to:
- play music
- play a YouTube video
- search for options first
- browse the web visually before deciding
- start watching a live Discord screen share
The product contract is capability-first: the brain sees available context and tools, then decides what helps most.
Media Surfaces
Music playback
Core tools:
music_playmusic_searchmusic_queue_nextmusic_queue_addmedia_stopmedia_pausemedia_resumemedia_skipmedia_now_playingmedia_reply_handoff
Use this surface for ordinary audio-first listening and queue control.
Deep dive: ../voice/music.md
Video playback
Core tools:
video_playvideo_search
This surface reuses the same underlying playback/disambiguation machinery as music, but it constrains lookup to YouTube and is intended for “put a video on” requests.
If the request is specific enough, video_play should resolve and start
playback directly. If the request is ambiguous, the brain can use
video_search, ask a follow-up question, or use browser_browse when seeing
the YouTube page/thumbnails would help.
Outbound native self publish is a runtime capability behind this surface. The model does not need a separate “start stream publish” tool for ordinary video playback.
Deep dives:
Video context extraction
Core tool:
video_context
Agent-initiated tool for extracting metadata, transcripts, and keyframe images from video URLs and animated GIFs. Supports YouTube (custom scraper, no external deps), plus TikTok, X/Twitter, Reddit, Twitch, Streamable, Tenor/Giphy GIF pages, and any other yt-dlp-supported source when yt-dlp and ffmpeg are available on the host.
The agent decides when to call this tool. It is preferred over web_scrape or
browser_browse when the goal is video-specific content like transcripts or
metadata. For pages requiring JS rendering or interaction, the agent falls back
to browser_browse.
Current-turn Discord video uploads, animated GIF uploads, and coalesced
video/GIF links are surfaced to the model as VID n references in prompt
context. The model may call video_context with a direct url or with a
videoRef like VID 1.
When a user explicitly asks what is in a GIF/video and a current-turn VID n
reference exists, the reply pipeline pre-inspects the media with
video_context and attaches sampled keyframes before the first reply model
call. If extraction is unavailable or fails, the prompt says that no visual
evidence is available so the model should not bluff from the URL or filename.
Direct video URLs (for example Discord CDN .mp4/.webm) and direct .gif
URLs are processed without yt-dlp. Non-direct hosts can use yt-dlp when
available. Keyframe and ASR fallback extraction rely on ffmpeg.
Keyframe sampling is duration-aware. Before running ffmpeg, the service
probes clip duration with ffprobe; if the clip is shorter than
keyframeIntervalSeconds × maxKeyframesPerVideo the interval is compressed to
duration / maxKeyframesPerVideo (floored at ~1/15s) so a short looping GIF
still yields up to maxKeyframesPerVideo evenly-spaced frames instead of one.
The probed duration also backfills durationSeconds on direct sources whose
summary path doesn't expose it, so logs and prompts no longer report
durationSeconds: null for Tenor/Giphy MP4s once frames are extracted.
Missing ffmpeg or yt-dlp is reported as a local runtime dependency blocker,
not as weak visual evidence. The context payload and logs include
missingDependencies plus keyframeErrorCode / transcriptErrorCode, and the
prompt/tool result says that no GIF/video pixels were inspected until the
operator installs the dependency and restarts the bot.
Settings: media.videoContext controls enablement and extraction parameters
(transcript length, keyframe interval, ASR fallback).
Current tool surface: video_context is available to the text/reply-loop
orchestrator. Voice prompts can still include video links and playback context,
but provider-native realtime does not expose video_context until there is a
voice executor for the extraction flow.
Screen watch
Core tool:
start_screen_watch
This is the inbound visual-context surface. The model does not choose between native Discord watch and the share-link fallback directly. Runtime does.
Deep dives:
Browser visual context
Core tools:
browser_browseshare_browser_sessionstop_video_share
This is not “media playback,” but it is part of the same visual system because the browser can be both:
- a reasoning tool for choosing media
- a live visual source for outbound native stream-publish flows
In text replies, images returned by tools can stay private reasoning context or be attached to the final Discord message if the brain explicitly chooses that output shape. Tool images are not auto-posted just because they exist.
Headless browser sessions still render pixels. The runtime can capture those offscreen frames without requiring a visible window.
Deep dives:
Image history context
Current-message image attachments and current-turn direct image links can be sent directly to a vision-capable model as part of the reply turn. Animated GIFs are treated as motion media instead of still images so the video-context path can sample frames. Recent image references from channel history are captioned in the background and cached so later turns can search or recall them without re-reading the image every time.
For URL-backed images, including Discord CDN attachment URLs, Bun fetches the image bytes first and sends base64 image data to the vision model. Providers do not receive Discord CDN URLs to download server-side. If the runtime can no longer fetch the URL, the caption is skipped and the existing cached caption, if any, remains the available context.
Autonomy Rules
The brain should not follow a rigid ladder like:
if ambiguous -> browser -> if failed -> search
Instead:
- use
music_play/video_playfor ordinary "play this now" requests - use
music_search/video_searchwhen the user explicitly wants options - use
video_contextwhen a shared link is a video and you want transcript/metadata - use
web_scrapefor lightweight page text extraction - use
browser_browsewhen page appearance, thumbnails, or navigation matter - use
share_browser_sessionwhen the user should see the live browser itself - use
start_screen_watchwhen live visual context would materially help
The point is not to encode a flowchart. The point is to give the agent enough capabilities to choose well.
Runtime Ownership
High level:
- Bun owns prompts, tool routing, session lifecycle, discovery/control-plane logic, memory, and dashboard state.
clankvoxowns the Discord media plane: RTP, Opus, DAVE, audio output, and native Go Live stream transport.
That means media behavior is split between product orchestration in Bun and transport/media execution in Rust.
Deep dives:
../architecture/overview.md../voice/voice-provider-abstraction.md../../src/voice/clankvox/README.md
Current Shape
As of this repo snapshot:
- music playback is fully shipped
video_play/video_searchexist as product-layer capabilities on top of the playback stack- native Discord screen watch is validated live end to end
- native outbound self publish exists as runtime transport work, with the product surface intentionally kept narrower than the underlying transport
- browser sessions can already feed native outbound browser share flows, even when headless
Deep-Dive Ownership
Use the following split:
- product behavior and capability boundaries: this document
- music queue/playback/disambiguation semantics:
../voice/music.md - inbound screen-watch pipeline and prompt context:
../voice/screen-share-system.md - Discord-native Go Live watch/publish transport:
../voice/discord-streaming.md - browser runtime/session behavior:
browser.md
Historical planning doc:
../archive/selfbot-stream-watch.mdremains useful as an implementation narrative, but it is not the primary product doc anymore
Product language: media is one agentic capability family. Music, video, screen watch, and browser vision are different entry points into the same shared context-and-action system.
