docs/voice/discord-streaming.md

Native Discord Screen Share

Scope: native Discord Go Live in this repo: inbound watch, outbound self publish, what is built, what is still open, and how the share-link fallback fits. Canonical media hub: ../capabilities/media.md Product surface: screen-share-system.md Transport stack: voice-provider-abstraction.md Direct integration plan: ../archive/selfbot-stream-watch.md clankvox local docs: ../../src/voice/clankvox/README.md Reference implementations: Discord-video-stream, Discord-video-selfbot

This repo supports native Discord video in both directions:

  • Native Go Live screen watch: subscribe to an active Discord Go Live stream through Discord's voice media protocol. Uses a separate stream_watch RTC transport.
  • Webcam video watch: subscribe to a user's webcam video on the main voice connection. No separate transport needed — video arrives alongside audio on the same UDP socket. Activated as a fallback when the target has their webcam on but is not Go Live streaming.
  • Native Discord self publish: create our own Go Live stream and send H264 video through the stream server connection.
  • Share-link fallback: send /share/:token, capture with getDisplayMedia(), and POST JPEG frames back into the bot. This transport can be disabled completely with STREAM_LINK_FALLBACK=false.

The model still only sees start_screen_watch for inbound visual context (both Go Live and webcam). Outbound publish is a runtime capability tied to the music pipeline, not a new conversational tool.

In the full-brain voice path, start_screen_watch is exposed as soon as native screen watch is supported and enabled for the session, even if only discovered Go Live state exists and no active share has been bound yet. That lets the brain initiate OP20 STREAM_WATCH in response to "can you see my stream now?" instead of waiting for a fully ready watch state or frame-backed sharer roster before the tool appears.

Current Status

Status validated March 13, 2026.

The native Discord screen watch pipeline is built end to end in clankvox and Bun:

  • selfbot gateway stream discovery for VOICE_STATE_UPDATE.self_stream, STREAM_CREATE, STREAM_SERVER_UPDATE, STREAM_DELETE, and GUILD_CREATE voice state scan for pre-existing streamers on connect
  • Gateway OP20 STREAM_WATCH request dispatch from the selfbot session
  • clankvox stream_watch IPC + second transport role
  • video receive, DAVE decrypt, H264/VP8 depacketization
  • H264: persistent in-process OpenH264 decoder in clankvox decodes all frames (IDR + P-frames), turbojpeg encodes to JPEG, emitted as DecodedVideoFrame IPC
  • VP8: raw bitstream forwarded to Bun for per-frame ffmpeg keyframe decode
  • explicit-target native watch can start from discovered Go Live state before native video-state frames have populated the sharer list

Validated live on the selfbot runtime (March 13, 2026):

  • the selfbot receives stream discovery events
  • STREAM_WATCH yields stream credentials
  • Bun forwards those credentials into clankvox
  • clankvox opens the second stream transport, completes the modern watcher handshake, reaches DAVE-ready, and forwards encrypted H264 frames back to Bun
  • DAVE MLS E2EE session completes successfully on the stream watch transport. DAVE video decrypt works at near 100% on the main voice connection (after the RTP padding strip fix), but per-frame video decrypt still fails for Go Live streams — the davey crate classifies Go Live video as UnencryptedWhenPassthroughDisabled. Screen watch frames arrive during the post-commit unencrypted window instead.
  • H264 frames are decoded in-process by clankvox's persistent OpenH264 decoder and emitted as pre-encoded JPEG via DecodedVideoFrame IPC
  • VP8 keyframes are decoded by Bun via per-frame ffmpeg
  • stream-watch note/commentary pipeline ingests frames and produces accurate visual commentary
  • DAVE channel ID derivation (BigInt(rtcServerId) - 1) confirmed working
  • two-checkpoint link fallback suppression prevents duplicate transports when native watch is active

The native Discord self-publish path is also built in code:

  • selfbot gateway can send OP18 STREAM_CREATE, OP19 STREAM_DELETE, and OP22 STREAM_SET_PAUSED
  • Bun tracks self stream discovery and routes self-owned credentials into a dedicated stream_publish transport role
  • clankvox opens a second sender-side stream connection, advertises H264, encrypts video with DAVE, and packetizes outbound RTP
  • Bun currently drives publish from two source families:
    • music/video relay: start on music play, pause on music pause, stop on music idle/error
    • browser-session share: explicit share_browser_session start/stop around an active browser session
  • voice.streamWatch.visualizerMode controls how music publish renders:
    • "cqt" is the default and starts one shared ffmpeg pipeline inside clankvox that emits PCM for the main voice connection and H264 visualizer access units for Go Live
    • stream_publish_play_visualizer attaches the sender transport to that already-running visualizer feed instead of starting a second media fetch
    • "off" preserves the legacy URL-backed source-video relay path when a real source video track is preferred
  • music publish is no longer limited to YouTube-backed video tracks when a visualizer is active; any playback URL the music pipeline already resolved for the active track can drive the visualizer

Sender-side live Discord validation is still pending in this repo snapshot. What is validated today is protocol shape, transport wiring, automated coverage for create/resume/switch control-plane behavior, browser-session frame forwarding coverage in Bun, and cargo test coverage inside clankvox.

The share-link fallback remains the recovery transport if native watch does not connect or later becomes unhealthy. When the native stream_watch transport fails or disconnects after start, Bun tears the native watch down and requests the share-link path directly when requester + text-channel context are known.

Set STREAM_LINK_FALLBACK=false to disable that transport entirely. In that mode, runtime stays native-only: capability reporting stops advertising the share-link path, native startup failures do not create fallback sessions, and post-connect transport failures log a skipped recovery instead of issuing a share-link request.

Link fallback suppression

When a native stream watch is active, the link fallback is suppressed at two checkpoints inside tryStartLinkFallback:

  1. pre_create — before creating the link session, check if native watch is already active for the target user with a ready transport or decoded frame
  2. post_compose — after composing the link message but before sending, re-check in case native watch became ready during async link creation

This prevents the race where the voice brain's start_screen_watch tool call resolves before stream credentials arrive, triggers a link fallback, and then native watch connects 200ms later — resulting in both transports running in parallel. shouldSuppressLinkFallbackDueToNativeWatch checks transport status, decoded frame presence, or active sharer state.

If the brain asks for start_screen_watch again for the same target while the native watch is already active, Bun reuses that watch instead of reconnecting the stream_watch transport. Tool results now separate transport attachment from visual readiness: frameReady=false means the watch is up but no decoded or buffered image frame exists yet, so the agent should not claim it can already see the screen.

startVoiceScreenWatch also accepts a preferredTransport parameter. Callers can pass "link" to skip native entirely and go straight to the share-link path — used for recovery when native transport fails after initial connection. If STREAM_LINK_FALLBACK=false, "link" requests are rejected.

Why The Regular Voice Connection Cannot See Go Live Streams

Discord uses a dual-connection architecture for Go Live:

┌─────────────────────────────────────────────────────────────┐
│  Regular Voice Connection                                   │
│  - Audio (Opus) send/receive                                │
│  - Speaking state (OP5)                                     │
│  - DAVE encryption                                          │
│  - No Go Live video SSRCs on the main voice leg             │
│  - No OP12 video state received                             │
│  - Endpoint: from VOICE_SERVER_UPDATE                       │
│  - Server ID: guild_id                                      │
└─────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│  Stream Connection (separate voice server)                  │
│  - Video frame send/receive                                 │
│  - Video SSRCs assigned in OP2 Ready streams[] array        │
│  - OP12 video state exchanged                               │
│  - OP15 media sink wants for quality negotiation            │
│  - Endpoint: from STREAM_SERVER_UPDATE                      │
│  - Server ID: rtc_server_id from STREAM_CREATE              │
└─────────────────────────────────────────────────────────────┘

Confirmed via live debugging in the older bot-token runtime on March 12, 2026:

  • OP2 Ready on the main voice connection returned video_ssrc=None
  • No OP12 or OP18 video state was ever sent to the main voice connection
  • Adding video: true and streams to the main voice Identify payload did not change this
  • Sending OP12 on the main voice connection did not change this either

The selfbot fork changes the account/control-plane model, not the fact that Go Live uses a separate stream server connection.

Go Live video state and video frames live on a separate stream server that requires its own connection.

Webcam Video on the Voice Connection

Unlike Go Live, webcam video (when a user enables their camera in a voice channel) lives on the main voice connection:

  • Discord sends OP12/OP18 video state updates with streamType: "video" on the main voice WebSocket
  • The user's webcam SSRC appears alongside their audio SSRC on the same UDP socket
  • No separate transport, stream key, or Go Live discovery is needed
  • clankvox sends OP15 media sink wants on the voice connection to request the webcam SSRC
  • The same H264 depacketization, DAVE E2EE decryption, persistent decoder, and JPEG pipeline applies

When both a Go Live stream_watch transport and webcam video exist simultaneously, clankvox partitions media sink wants: screen-share SSRCs route to stream_watch, webcam SSRCs route to the voice connection. The capture supervisor allows webcam frames from the voice transport to pass through even when a stream_watch connection is active (it only suppresses screen-type frames to avoid duplicates).

On the TS side, enableWatchStreamForUser first tries Go Live discovery. If that fails (no stream key — user isn't Go Live streaming), it checks sharerHasWebcamOnly() for webcam streams and subscribes with preferredStreamType: null. Frames flow through the same DecodedVideoFrame IPC → ingestStreamFrame → vision model path.

Discord Protocol: Go Live Stream Architecture

Gateway events (main Discord gateway, not voice gateway)

These are standard Discord gateway dispatch events, not voice WebSocket opcodes:

Gateway opcodeEvent nameDirectionPurpose
18STREAM_CREATEClient → ServerStart a Go Live stream
19STREAM_DELETEClient → ServerStop a Go Live stream
20STREAM_WATCHClient → ServerSubscribe to watch someone's stream
22STREAM_SET_PAUSEDClient → ServerPause/resume a stream

Gateway dispatch events received from Discord:

EventPayloadPurpose
STREAM_CREATE{ stream_key, rtc_server_id }Stream created, provides stream server ID
STREAM_SERVER_UPDATE{ stream_key, endpoint, token }Stream server ready, provides connection credentials
STREAM_DELETE{ stream_key }Stream ended
VOICE_STATE_UPDATE{ ..., self_stream, self_video }User voice state, includes streaming flag

Stream key format

Stream keys identify a specific Go Live stream:

guild:<guildId>:<channelId>:<userId>    # Guild voice channel
call:<channelId>:<userId>               # DM / group call

Voice gateway opcodes (on the stream connection)

These are the same opcodes as the regular voice connection but exchanged on the stream-specific voice server:

Voice opcodeNamePurpose
0IDENTIFYAuthenticate with stream server
1SELECT_PROTOCOLNegotiate codecs and encryption
2READYServer assigns SSRCs (including video SSRCs in streams[])
4SELECT_PROTOCOL_ACKSession description with encryption keys
5SPEAKINGSignal speaking/streaming status (flag 2 = video)
12VIDEOAnnounce/receive video stream attributes
15MEDIA_SINK_WANTSRequest specific video quality/SSRCs

Protocol Flow: Watching a Go Live Stream

Sending (reference control-plane shape)

1. joinVoice()
   → Gateway OP4 VOICE_STATE_UPDATE { guild_id, channel_id }
   → Receive VOICE_STATE_UPDATE { session_id }
   → Receive VOICE_SERVER_UPDATE { endpoint, token }
   → Connect regular voice WebSocket to endpoint
   → Standard voice handshake (OP0→OP2→OP1→OP4)

2. createStream()
   → Gateway OP18 STREAM_CREATE { type, guild_id, channel_id }
   → Gateway OP22 STREAM_SET_PAUSED { stream_key, paused: false }
   → Receive STREAM_CREATE { stream_key, rtc_server_id }
   → Receive STREAM_SERVER_UPDATE { stream_key, endpoint, token }
   → Connect stream voice WebSocket to stream endpoint
   → Stream handshake: OP0 Identify with { server_id: rtc_server_id, video: true, streams: [...] }
   → OP2 Ready includes video SSRCs in streams[] array
   → OP12 VIDEO to announce stream attributes
   → Send video frames via RTP/UDP

Publishing / sending (runtime flow)

1. A publishable source becomes active
   → music playback resolves a publishable source URL
   → if `voice.streamWatch.visualizerMode != "off"`, `music_play` starts one shared `ffmpeg` pipeline that emits both PCM audio and H264 visualizer video
   → if `voice.streamWatch.visualizerMode == "off"`, Bun keeps the legacy URL-backed source-video relay path
   → browser-session share resolves a live browser session key

2. Create or reuse self stream
   → If no self stream exists, send OP18 STREAM_CREATE
   → Send OP22 STREAM_SET_PAUSED { paused: false }
   → Wait for STREAM_CREATE + STREAM_SERVER_UPDATE discovery

3. Connect sender transport
   → Bun sends stream_publish_connect to clankvox
   → clankvox opens stream voice connection to stream endpoint
   → OP0 Identify uses rtc_server_id + main voice session_id
   → OP1 SelectProtocol advertises sender-side H264
   → OP12 VIDEO announces active screen stream attributes

4. Push media
   → visualizer-mode music publish sends `stream_publish_play_visualizer`
   → clankvox attaches the sender transport to the shared music visualizer H264 queue
   → legacy URL-backed music publish sends `stream_publish_play`
   → browser-session share sends stream_publish_browser_start + stream_publish_browser_frame payloads into clankvox
   → clankvox encodes or relays the active source to H264 Annex-B access units
   → DAVE encrypt + RTP packetize each access unit
   → OP5 speaking uses flag 2 on the stream connection

5. Pause / stop
   → music pause sends OP22 STREAM_SET_PAUSED { paused: true }
   → music resume reuses the same stream when discovery is still live
   → browser-session share stop sends OP19 STREAM_DELETE and tears the sender transport down
   → music stop/error/idle sends OP19 STREAM_DELETE and tears the sender transport down

Receiving / watching (runtime flow)

1. Detect Go Live
   → Listen for VOICE_STATE_UPDATE with self_stream: true
   → Seed provisional per-user session Go Live state from that early signal
   → OR listen for STREAM_CREATE dispatch event
   → OR scan GUILD_CREATE voice_states on connect for users already streaming

   The voice session keeps all discovered Go Live users in `goLiveStreams`.
   `goLiveStream` is only the current primary/legacy view, not the source of truth.

2. Subscribe to stream
   → Gateway OP20 STREAM_WATCH { stream_key }
   → If STREAM_CREATE has not arrived yet, synthesize stream_key from guild/channel/user IDs
   → Receive STREAM_CREATE { stream_key, rtc_server_id }
   → Receive STREAM_SERVER_UPDATE { stream_key, endpoint, token }

3. Connect to stream server
   → Open second voice WebSocket to stream endpoint
   → OP0 Identify with { server_id: rtc_server_id, video: true, streams: [...] }
   → OP2 Ready assigns video SSRCs in streams[] array
   → OP12 VIDEO with video state from the streamer

4. Request video quality
    → OP15 MEDIA_SINK_WANTS with `streams`, `pixelCounts`, and `any` default quality

 5. Receive video frames
    → Encrypted RTP/UDP video packets arrive
    → Transport decrypt (AES-GCM/XChaCha20) → depacketize H264/VP8 → DAVE decrypt attempt
     → DAVE video decrypt works on the main voice connection; Go Live streams still fail (frames forwarded from post-commit unencrypted window)
    → H264: persistent OpenH264 decoder in clankvox → turbojpeg JPEG encode → DecodedVideoFrame IPC → stream-watch pipeline
    → VP8: raw bitstream IPC → Bun ffmpeg decode to JPEG → stream-watch pipeline

OP0 Identify (stream connection)

{
  "op": 0,
  "d": {
    "server_id": "<rtc_server_id from STREAM_CREATE>",
    "user_id": "<selfbot user id>",
    "session_id": "<session_id from VOICE_STATE_UPDATE>",
    "token": "<token from STREAM_SERVER_UPDATE>",
    "video": true,
    "streams": [{ "type": "screen", "rid": "100", "quality": 100 }],
    "max_dave_protocol_version": 1
  }
}

OP2 Ready (stream connection response)

{
  "op": 2,
  "d": {
    "ssrc": 12345,
    "ip": "1.2.3.4",
    "port": 50000,
    "modes": ["aead_aes256_gcm_rtpsize", "aead_xchacha20_poly1305_rtpsize"],
    "streams": [
      {
        "type": "video",
        "rid": "100",
        "ssrc": 12346,
        "rtx_ssrc": 12347,
        "active": false,
        "quality": 100
      }
    ]
  }
}

Video SSRCs are in d.streams[].ssrc, not in a top-level d.video_ssrc field.

OP15 Media Sink Wants

clankvox sends Discord media sink wants with streams and pixelCounts maps plus a top-level any default quality:

{
  "op": 15,
  "d": {
    "any": 100,
    "streams": {
      "12346": 100,
      "12347": 0
    },
    "pixelCounts": {
      "12346": 921600.0,
      "12347": 230400.0
    }
  }
}

The pixelCounts map is required for Discord to send video at the requested resolution. Without it, Discord may send the lowest quality layer where keyframes are very sparse. The any field provides a default quality for SSRCs not explicitly listed.

Stream speaking flag

On the stream connection, speaking flag 2 indicates video (vs flag 1 for audio on the regular connection):

{ "op": 5, "d": { "speaking": 2, "delay": 0, "ssrc": 12345 } }

What Is Built

Video receive pipeline (clankvox, Rust)

All of this is implemented and tested:

  • OP12/OP18 remote video state parsing and tracking (voice_conn.rs)
  • OP14 session/codec update handling (voice_conn.rs)
  • OP15 media sink wants send (voice_conn.rs, capture_supervisor.rs)
  • Protected RTCP receiver-report / PLI / FIR keyframe feedback while waiting for the first renderable stream frame (voice_conn.rs, capture_supervisor.rs)
  • Video codec advertisement in SelectProtocol: H264, VP8 decode (voice_conn.rs)
  • Non-Opus RTP acceptance: H264 (PT 103), VP8 (PT 105) (voice_conn.rs)
  • DAVE video decrypt path (dave.rs)
  • H264 depacketization: single NAL, STAP-A, FU-A (video.rs)
  • VP8 depacketization: payload descriptor stripping, frame reassembly (video.rs)
  • RTP sequence gap detection to prevent cross-packet frame corruption (video.rs)
  • Persistent H264 decode via OpenH264 (openh264 crate v0.9) with per-user decoder state (video_decoder.rs)
  • JPEG encode via turbojpeg (turbojpeg crate v1.4) for decoded H264 frames (video_decoder.rs)
  • DecodedVideoFrame IPC emission with pre-encoded JPEG, dimensions, and scene-cut metrics (video_decoder.rs, ipc.rs)
  • Video frame IPC: UserVideoState, UserVideoFrame, UserVideoEnd, DecodedVideoFrame (ipc.rs)
  • Dedicated bounded outbound video queue with backpressure logging (ipc.rs)
  • Per-user video subscription management (capture_supervisor.rs)
  • Remote video state merge semantics for partial updates (capture_supervisor.rs)
  • Session reconnect teardown of stale video state (capture_supervisor.rs)

Audio receive pipeline (clankvox, Rust)

All of this is implemented and tested:

  • OP5 speaking payload parsing into the audio SSRC map (voice_conn.rs)
  • DAVE audio decrypt on the main voice transport (dave.rs, voice_conn.rs)
  • Audio SSRC remap fallback from successful DAVE decrypt when OP5 is delayed or missing (voice_conn.rs)
  • Opus decode plus PCM downmix/resample into Bun-visible UserAudio IPC (capture_supervisor.rs, audio_pipeline.rs, ipc.rs)

Bun runtime integration

  • clankvoxClient.ts: IPC for subscribe_user_video, unsubscribe_user_video, video events
  • nativeDiscordScreenShare.ts: Active sharer tracking, target resolution
  • sessionLifecycle.ts: Video state/frame/end event handlers, onDecodedVideoFrame for H264 JPEG ingest
  • screenShare.ts: Native-first with share-link fallback, telemetry
  • voiceStreamWatch.ts: stream-watch pipeline, frame ingest from DecodedVideoFrame (H264) and ffmpeg (VP8)

Video send pipeline (clankvox + Bun)

Implemented in code:

  • streamDiscovery.ts: OP18/OP19/OP22 send helpers plus self stream discovery
  • voiceStreamPublish.ts: session state, visualizer-aware source resolution, and self-stream connect orchestration
  • voiceBrowserStreamPublish.ts: browser-session capture loop and frame forwarding into clankvox
  • sessionLifecycle.ts: publish transport lifecycle, recovery, and stream discovery callbacks
  • clankvoxClient.ts: stream_publish_* IPC commands, including stream_publish_play_visualizer, and transport state relay
  • clankvox: sender-side stream_publish transport role, H264 codec negotiation, DAVE video encrypt, RTP packetization, shared music visualizer pipeline, legacy URL relay, and browser-frame encode path

Current rollout boundary:

  • outbound publish is wired to music/video relay and explicit browser-session share, not a general-purpose arbitrary file/share tool
  • music relay defaults to a shared audio visualizer (voice.streamWatch.visualizerMode: "cqt") and only falls back to source-video relay when that setting is "off"
  • sender transport is H264-only

What is still open

  1. RTX retransmission receive — the current UDP receive path still drops RTX packets, so packet-loss recovery remains limited until retransmission support lands
  2. Live sender validation — the sender path is implemented, but this repo snapshot still needs Discord-live validation against a real selfbot session
  3. Broader publish surfaces — outbound publish still centers on music lifecycle and explicit browser-session share; there is no general-purpose arbitrary file/share tool yet

Implementation Plan

Current Receive Flow

  • Selfbot gateway records active Go Live streams and credentials in stream discovery state
  • On connect (or full reconnect), GUILD_CREATE voice states are scanned for users with self_stream: true to detect streams that were already active before the bot connected
  • start_screen_watch requests OP20 STREAM_WATCH and records the expected stream key on the voice session
  • If credentials are already present, Bun immediately sends stream_watch_connect to clankvox
  • If credentials arrive later, discovery callbacks connect the stream_watch transport when STREAM_SERVER_UPDATE lands
  • clankvox emits role-aware transport_state, userVideoState, userVideoFrame, and userVideoEnd
  • For H264, clankvox decodes all frames (IDR + P-frames) via a persistent OpenH264 decoder, encodes to JPEG via turbojpeg, and emits DecodedVideoFrame IPC messages; Bun ingests the pre-encoded JPEG directly
  • For VP8, Bun decodes sampled keyframes to JPEG via per-frame ffmpeg
  • Decoded frames from either codec path feed into the existing stream-watch commentary path
  • STREAM_DELETE prunes the cached sharer and tears native watch down
  • Native transport failure/disconnect tears native watch down and directly requests share-link fallback when requester + text-channel context exist

Current Publish Flow

  • music playback enters playing
  • voiceMusicPlayback.ts resolves the playback URL for the active track and starts music_play
  • if voice.streamWatch.visualizerMode != "off", music_play starts a shared ffmpeg pipeline in clankvox that emits PCM audio plus H264 visualizer access units
  • voiceStreamPublish.ts resolves a publishable source from session music state or an active browser-session share
  • if no self stream exists yet, Bun sends OP18 STREAM_CREATE
  • self stream discovery callbacks hand STREAM_CREATE / STREAM_SERVER_UPDATE credentials back into the owning voice session
  • Bun sends stream_publish_connect, then either stream_publish_play_visualizer for shared visualizer mode or stream_publish_play for legacy URL relay
  • clankvox starts or reuses the sender-side stream transport, marks video active, and sends H264 RTP frames
  • music pause sends stream_publish_pause plus OP22 paused: true
  • music resume reuses the live stream when possible instead of recreating it
  • music idle/error/stop sends stream_publish_stop, stream_publish_disconnect, and OP19 STREAM_DELETE

Product Behavior

The model still only sees start_screen_watch. Runtime resolves transport:

  1. Check if target user has self_stream: true (Go Live active)
  2. If yes → open stream connection, subscribe, receive native frames
  3. If no → fall back to share-link path

The share-link fallback remains the recovery path when:

  • The target user is not Go Live sharing
  • Stream connection fails to establish
  • Selfbot stream subscription fails or the native transport is unavailable
  • Multiple sharers are active and target is ambiguous

Settings

Native subscription tuning and music Go Live visualizer settings live under voice.streamWatch:

  • visualizerMode (default: "cqt")
  • nativeDiscordMaxFramesPerSecond (default: 2)
  • nativeDiscordPreferredQuality (default: 100)
  • nativeDiscordPreferredPixelCount (default: 921600 / 1280x720)
  • nativeDiscordPreferredStreamType (default: "screen")

There is no separate standalone settings block for outbound native publish. Music playback reuses voice.streamWatch.visualizerMode to choose between the shared audio-visualizer path and the legacy source-video relay path.

Screen Watch Note Pipeline

voice.streamWatch now uses one screen-watch pipeline:

  • Every admitted frame updates the latest-frame buffer.
  • A separate note loop keeps rolling screen notes fresh with noteProvider, noteModel, noteIntervalSeconds, noteIdleIntervalSeconds, staticFloor, changeThreshold, and changeMinIntervalSeconds.
  • Proactive commentary uses commentaryIntervalSeconds plus the normal voice quiet-window gates, and can optionally override the voice model with commentaryProvider / commentaryModel.
  • Raw frames are attached only for commentary turns and direct screen questions. Normal conversation turns rely on the rolling notes.

The full product-level behavior and settings contract live in screen-share-system.md.

Observability

Stream discovery layer (implemented)

  • native_discord_go_live_detected — selfbot gateway saw self_stream: true
  • stream_discovery_existing_streamer_detectedGUILD_CREATE voice state scan found a user already streaming on connect
  • stream_discovery_guild_create_scan_completeGUILD_CREATE scan finished with count of existing streamers found
  • stream_discovery_go_live_bootstrap_seeded — Bun seeded provisional session Go Live state from self_stream: true
  • stream_discovery_go_live_bootstrap_cleared — Bun cleared stale provisional session Go Live state after self_stream: false
  • native_discord_stream_watch_requested — Gateway OP20 sent
  • native_discord_stream_server_receivedSTREAM_SERVER_UPDATE received
  • native_discord_stream_connection_started — second clankvox connection opening

Video receive pipeline (implemented)

  • clankvox_discord_video_state_observed — voice connection saw OP12/OP18 video state
  • clankvox_native_video_state_received — capture supervisor applied video state
  • clankvox_native_video_state_emitted — state emitted over IPC to Bun
  • native_discord_screen_share_state_updated — Bun updated active-sharer state
  • clankvox_native_video_subscribe_requested — Rust accepted video subscription
  • clankvox_video_sink_wants_updated — Rust recalculated OP15 media sink wants
  • clankvox_first_video_frame_forwarded — first video frame forwarded to Bun
  • stream_watch_frame_ingested — Bun ingested frame into stream-watch pipeline

Failure diagnostics

  • screen_watch_native_start_failed — native path failed, includes reason, nativeActiveSharerCount, selectionReason
  • screen_watch_link_fallback_started — fell back to share-link, includes nativeFailureReason
  • native_discord_stream_transport_link_fallback_requested — native transport degraded after start and Bun requested link-only recovery
  • native_discord_stream_transport_link_fallback_failed — degraded native transport could not recover into the share-link path

Risk Assessment

  • DAVE video decrypt. DAVE video decrypt works at near 100% on the main voice connection after the RTP padding strip fix (strip_rtp_padding in rtp.rs). Go Live streams are a separate, still-open problem: the davey library classifies Go Live video frames as UnencryptedWhenPassthroughDisabled when they are actually encrypted. Video frames that fail DAVE decrypt are dropped (validated by looks_like_valid_h264). Screen watch relies on the initial unencrypted frames that arrive during the DAVE transition window after session commit. Stream connections use DAVE channel ID BigInt(rtc_server_id) - 1n (confirmed working live).
  • PLI/FIR keyframe requests do not work. clankvox uses protocol: "udp" for stream connections. Discord's media server only processes RTCP PLI/FIR feedback from WebRTC peers. PLI/FIR packets are still sent as a best-effort hint (periodic, after decoder reset, after DAVE ready), but keyframes cannot be requested on demand. The persistent H264 decoder compensates by processing all frames (IDR + P-frames) for reference state accumulation, and auto-resets with PLI after 50 consecutive errors.
  • Transport crypto AAD for RTP. Discord's rtpsize AEAD modes authenticate the RTP fixed header + CSRC + 4-byte extension prefix as AAD. The extension body is part of the ciphertext. decrypt() recomputes AAD from raw packet bytes and must NOT use parse_rtp_header's header_size (which includes the extension body). Inbound RTCP (PT 72-76 per RFC 5761 mux) is filtered before decrypt.
  • VP8 ffmpeg decode EOF. VP8 still uses per-frame ffmpeg decode on the Bun side. H264 decode is handled entirely in-process by clankvox's persistent OpenH264 decoder and does not use ffmpeg.
  • RTX loss recovery is still limited. The receive path currently traces and drops RTX payloads; retransmission support remains future work.
  • Sender compatibility is narrower than receive. Outbound publish is still H264-only and still centered on music lifecycle or browser-session share. visualizerMode: "off" also depends on a URL-backed source relay path being available.

Reference: Discord-video-stream

The Discord-video-stream library implements the sending side of Go Live for selfbot accounts. Key observations:

  • Uses discord.js-selfbot-v13, not regular discord.js — selfbot tokens may have different capabilities than bot tokens
  • VoiceConnection for regular voice, StreamConnection for Go Live — both extend BaseMediaConnection
  • Stream connection uses rtc_server_id from STREAM_CREATE as its serverId
  • OP0 Identify always includes video: true and streams: [{ type: "screen", rid: "100", quality: 100 }]
  • Video SSRCs come from OP2 Ready d.streams[0].ssrc and d.streams[0].rtx_ssrc, not d.video_ssrc
  • Uses protocol: "webrtc" with SDP in SelectProtocol, not protocol: "udp"
  • STREAM_WATCH (OP20) is defined but not implemented (library only sends, doesn't watch)
  • Speaking flag 2 on the stream connection indicates video streaming

Reference: Discord-video-selfbot

The Discord-video-selfbot library is an older sender-side Go Live implementation. Useful observations:

  • Uses a second StreamConnection for Go Live, separate from the main voice connection
  • Fills the stream connection serverId from STREAM_CREATE.rtc_server_id
  • Reuses the main voice session_id for the stream leg
  • Uses protocol: "udp" with standard IP discovery and SELECT_PROTOCOL
  • Does not implement STREAM_WATCH or inbound video receive
  • Uses older xsalsa20_* encryption modes, so it is not a modern DAVE reference

Product Language

Prefer:

  • "start screen watch"
  • "watch your screen"
  • "I can start watching that share"
  • For spoken voice replies, say "open the link I sent" or "open that screen-share link" instead of reading the full URL aloud.

Avoid:

  • "I can already see your screen" unless frame context is actually active
  • Reading full screen-share URLs, hostnames, or token strings aloud in voice
  • "native Discord watch always works" because it depends on stream discovery working for the selfbot session