docs/archive/selfbot-stream-watch.md

Selfbot Stream Watch — Direct clankvox Integration Plan

Status: watch path complete — native stream watch validated end-to-end on March 13, 2026

Current product-level docs:

This file remains useful as the implementation narrative and validation trail for the direct-integration work, but it is no longer the primary product doc.

References:

../voice/discord-streaming.md
../voice/screen-share-system.md
../voice/voice-provider-abstraction.md
../notes/NATIVE_SCREENSHARE_NOTES.md
../operations/logging.md
Discord-video-stream — WebRTC-based Go Live send (reference protocol)
Discord-video-selfbot — older UDP-based Go Live send (useful transport evidence)

Goal

Enable start_screen_watch to watch Discord Go Live streams natively through clankvox without relying on the share-link fallback.

The selfbot runs on a user token. It is already a user-session participant in the voice channel. It can directly receive stream discovery dispatch, send STREAM_WATCH, and open a stream-server media connection — no separate account or gateway needed.

The end state is:

one product capability: start_screen_watch
one runtime tree: Bun app + clankvox
one native receive pipeline: clankvox video receive -> Bun keyframe decode -> existing stream-watch ingest
one fallback path: share-link capture remains the recovery transport

Implemented in the repo now:

selfbot gateway patches and raw dispatch hooks
stream discovery state for VOICE_STATE_UPDATE.self_stream, STREAM_CREATE, STREAM_SERVER_UPDATE, and STREAM_DELETE
Bun-side STREAM_WATCH request wiring for native screen watch
clankvox stream_watch IPC, second transport role, and role-aware transport_state
voiceStreamWatch / session lifecycle integration that requests the stream, opens clankvox when credentials exist, and tears it down on stop/delete
native transport failure now forces the share-link fallback directly instead of re-entering native selection

Validated live March 13, 2026:

selfbot receives STREAM_CREATE / STREAM_SERVER_UPDATE after OP20 STREAM_WATCH
stream server accepts modern UDP + DAVE watcher flow
DAVE MLS E2EE handshake completes, video frames decrypt successfully
H264 depacketization + Annex-B normalization + ffmpeg decode produces valid JPEG
stream-watch brain context pipeline receives decoded frames and produces commentary
two-checkpoint link fallback suppression prevents duplicate transports when native watch is active

Still open:

RTX retransmission receive for packet-loss recovery
outbound native stream send

This plan assumes personal experimentation in a private server.

Non-Goals

teaching the model a second tool or transport-specific capability
removing the share-link fallback
building a full smooth video player UI inside the selfbot runtime

The native path remains a screen-watch context feature, not a full remote desktop product.

Product Contract

The autonomy contract does not change:

the model still only sees start_screen_watch
runtime still decides transport
active sharers remain context, not automatic visibility
the agent can still choose silence after it starts watching

Why Direct Integration

We already have most of the expensive lower-half work in place:

H264/VP8 receive
DAVE video decrypt
remote video state tracking
OP15 media sink wants
keyframe decode to JPEG
stream-watch ingest, scanner notes, and commentary

The biggest missing gap is not visual reasoning. It is stream discovery plus a second media connection for the stream-server leg.

Since the selfbot is already a user-session participant, we skip the entire "how do we get a user session into the channel" problem. Direct integration:

reuses the existing clankvox video plumbing instead of duplicating it
keeps screen-watch state in the existing voice session runtime
keeps logs, tests, and failure handling in the main repo
avoids a temporary architecture we would later have to delete

Reference Takeaways

The two external reference repos are useful, but they prove different things:

Discord-video-stream
- useful for modern Go Live control-plane shape
- shows rtc_server_id used as the stream connection server_id
- shows READY.d.streams[] is the important source of stream SSRC metadata
- does not implement watch / receive
Discord-video-selfbot
- useful as evidence that a stream connection can follow the same broad voice-WS + UDP handshake pattern as regular voice
- shows the stream leg reuses the main voice session_id
- does not implement watch / receive
- predates modern AEAD + DAVE handling and uses older xsalsa20_* modes

Working conclusion:

UDP is the right first transport to try for stream_watch
WebRTC is no longer the leading risk, but is still a fallback contingency
receive-side SSRC and payload-type mapping must stay runtime-derived
watcher-side OP12 announcements must not synthesize missing READY.d.streams[] metadata
modern receive-side stream watch remains unproven until live validation

Key Design Decision

Direct integration does not mean forcing every concern into clankvox.

Planned split:

Bun owns Discord control-plane concerns:
- user-token gateway session (the selfbot's own session)
- stream discovery dispatch handling
- mapping Discord stream events into session intent
clankvox owns media-plane concerns:
- voice transport (audio send/receive, music, TTS, speaking state)
- stream-server video transport (Go Live receive)
- DAVE, RTP, depacketization, OP15, IPC video events

This keeps clankvox focused on transport/media while still making it the native owner of the stream watch path.

Target Architecture

High-level shape

The integrated design has two transport roles:

voice
- the selfbot's voice connection to the channel
- audio send/receive, music, TTS, speaking state
stream_watch
- stream-server connection for Go Live video receive
- receives Go Live video metadata and frames

Both connections use the same user identity (user id, session id) since the selfbot is a single account. The stream-server connection uses stream-specific credentials from STREAM_SERVER_UPDATE.

Why two transport roles instead of one

The voice connection and the stream-server connection are distinct Discord media connections with different endpoints, tokens, and server ids. They must be modeled as separate transport slots in clankvox so that:

stream transport disconnect does not tear down the voice session
each has its own DAVE manager and reconnect lifecycle
audio behavior stays attached only to the voice role
video subscriptions stay attached only to the stream watch role

Runtime Flow

1. Voice session begins

selfbot joins VC via its gateway session
stream discovery dispatch is already available on the selfbot's gateway

2. Go Live discovery

The selfbot's gateway session receives dispatch and records:

VOICE_STATE_UPDATE with self_stream
STREAM_CREATE
STREAM_SERVER_UPDATE
STREAM_DELETE

This updates native sharer metadata in the active voice session.

3. Screen watch request

When runtime resolves start_screen_watch:

choose the target sharer as today
if a native sharer is active, send STREAM_WATCH via gateway
on STREAM_CREATE + STREAM_SERVER_UPDATE, forward stream credentials into clankvox
clankvox opens the stream_watch transport

4. Frame receive

The stream connection:

identifies against the stream server with the selfbot's credentials
receives OP12 / OP18 video state
sends OP15 media sink wants
receives encrypted video RTP
DAVE decrypts and depacketizes video
emits video_state / video_frame / video_end

Node then:

decodes sampled keyframes to JPEG with the existing ffmpeg path
feeds them into ingestStreamFrame
preserves the existing scanner + commentary behavior

5. Teardown

Native watch tears down on:

STREAM_DELETE
target leaving VC
selfbot leaving VC
stream transport disconnect
explicit stop watch

If the native path cannot start or becomes unhealthy, runtime falls back to the existing share-link path.

Direct Integration Plan

Milestone 1: Prepare the runtime model for multiple transport roles

This is the first change because the current Rust state model is still single-connection.

Current state in clankvox:

one pending connection tuple
one voice_conn
one reconnect loop
one DAVE manager
one disconnect path that assumes all media is the primary voice connection

That is incompatible with stream watch.

Refactor target:

introduce TransportRole
- Voice
- StreamWatch
replace singleton connection fields with role-scoped transport slots
make VoiceEvent role-aware so stream transport disconnect does not tear down the voice session
keep audio/mixer behavior attached only to Voice
keep stream-watch video subscriptions and remote video state attached only to StreamWatch

Recommended shape in Rust:

enum TransportRole {
    Voice,
    StreamWatch,
}

struct TransportSlot {
    role: TransportRole,
    pending: PendingConnection,
    conn: Option<VoiceConnection>,
    dave: Arc<Mutex<Option<DaveManager>>>,
    reconnect_deadline: Option<Instant>,
    reconnect_attempt: u32,
}

This avoids a second wave of cleanup later.

Milestone 2: Generalize `VoiceConnection::connect`

The current connection API is regular-voice-specific. Stream watch should not be expressed as boolean branches hanging off guild/channel fields.

Refactor the connection params around explicit transport inputs:

pub struct VoiceConnectionParams<'a> {
    pub endpoint: &'a str,
    pub server_id: &'a str,
    pub user_id: u64,
    pub session_id: &'a str,
    pub token: &'a str,
    pub dave_channel_id: u64,
    pub role: TransportRole,
    pub identify_streams: &'a [IdentifyStream],
    pub speaking_mode: Option<u32>,
}

Important consequences:

server_id becomes explicit instead of assumed from guild id
dave_channel_id becomes explicit instead of assumed from channel id
identify payload construction becomes role-driven
stream watch stops pretending it is a normal guild voice connection

Milestone 3: Wire stream discovery through the selfbot gateway

The selfbot already has a gateway session. Extend it to track stream dispatch:

receive stream-related dispatch:
- VOICE_STATE_UPDATE with self_stream
- STREAM_CREATE
- STREAM_SERVER_UPDATE
- STREAM_DELETE
send gateway opcodes:
- OP20 STREAM_WATCH
- OP22 STREAM_SET_PAUSED when needed

This is not a new module — it extends the existing selfbot gateway handling to capture stream events and expose them to the voice session runtime.

Milestone 4: Extend session state and native sharer tracking

Extend VoiceSessionNativeScreenShareState in:

src/voice/voiceSessionTypes.ts
src/voice/nativeDiscordScreenShare.ts

Add stream-watch runtime state:

active stream-key registry
pending target stream key
latest rtc_server_id
latest stream endpoint/token receipt time
stream transport connection state
failure classification for fallback

This state must be visible to:

target resolution
native watch eligibility
operational messaging
realtime instruction refresh
dashboard/runtime snapshots

Milestone 5: Add new IPC commands for stream-watch transport

Extend:

src/voice/clankvoxClient.ts
src/voice/clankvox/src/ipc_protocol.rs
src/voice/clankvox/src/ipc.rs

Recommended commands:

stream_watch_connect
stream_watch_disconnect

Recommended payload for stream_watch_connect:

{
  "type": "stream_watch_connect",
  "endpoint": "<stream endpoint>",
  "token": "<stream token>",
  "serverId": "<rtc_server_id>",
  "sessionId": "<selfbot session id>",
  "userId": "<selfbot user id>",
  "daveChannelId": "<derived dave channel id>"
}

Recommended output events:

transport_state
- include role
- include status
- include reason
keep legacy connection_state only for the primary voice transport until the Bun session lifecycle fully migrates to transport_state

Do not add stream-watch-only bespoke state messages when a role-aware transport state envelope is enough.

Milestone 6: Implement role-aware disconnect and reconnect handling

This is mandatory. Without it, stream-watch failures will incorrectly collapse the voice session.

Behavior target:

voice disconnect
- current reconnect and session-failure behavior stays
stream_watch disconnect
- stop native watch
- preserve voice session
- classify the failure for possible share-link fallback

All reconnect scheduling and cleanup functions in clankvox should become role-scoped.

Milestone 7: Parse stream-server `READY` correctly

The current implementation already parses remote OP12 / OP18 video state well, but the stream-server handshake still assumes regular-voice shape.

Required changes:

parse READY.d.streams[]
treat streams[].ssrc and streams[].rtx_ssrc as canonical stream-watch SSRCs
keep top-level video_ssrc parsing only as a compatibility path
normalize the chosen stream descriptors into the existing video state model

This change should happen in:

src/voice/clankvox/src/voice_conn.rs

Milestone 8: Make DAVE explicitly role-specific

Do not share one DAVE manager across voice and stream watch.

Required changes:

one DAVE manager per transport slot
explicit DAVE channel id input
explicit user id per slot

Expected DAVE derivation:

voice: current channel-based behavior
stream_watch: derived from rtc_server_id

If live validation confirms BigInt(rtc_server_id) - 1 semantics, encode that derivation in one well-named helper instead of sprinkling it through the call sites.

Milestone 9: Stream transport protocol — UDP-first path strongly supported

Discord-video-selfbot significantly reduces the stream-transport uncertainty. It shows an older sender-side Go Live stream connection using protocol: "udp" with the same broad handshake shape as regular voice:

IP discovery
OP1 SELECT_PROTOCOL
OP4 SESSION_DESCRIPTION

That is strong evidence that clankvox should attempt a direct UDP stream-watch transport first instead of starting with a WebRTC stack.

Reference evidence (Discord-video-selfbot/src/Class/BaseConnection.ts):

selectProtocols() sends { protocol: "udp", data: { address, port, mode } }
StreamConnection extends BaseConnection — inherits the same UDP handshake
STREAM_CREATE fills the stream connection's serverId from rtc_server_id
the stream connection reuses the main voice session_id
encryption is xsalsa20_poly1305_lite and voice version is 7

What this reference does prove:

a Discord stream leg can exist as a second media connection
that leg can use a voice-style UDP handshake instead of requiring WebRTC
rtc_server_id and the session id are the important identity inputs

What this reference does not prove:

that modern STREAM_WATCH transport still behaves the same
that modern v9 + AEAD + DAVE stream servers accept the same mode set
that receive-side video SSRCs can be inferred as ssrc + 1 / ssrc + 2
that inbound frames arrive on the modern stack

This means:

UDP should be the first implementation path. We should try the existing clankvox transport model, generalized for the stream-watch role, before spending time on WebRTC.
WebRTC remains a contingency, not a baseline assumption. The sender-side WebRTC approach in Discord-video-stream is not enough reason to build a full WebRTC stack first.
Receive-side metadata stays negotiated. Treat READY.d.streams[] and advertised codecs as canonical. Do not hardcode sender-side ssrc + 1 conventions.
DAVE and encryption still need live confirmation. The reference predates DAVE entirely, so clankvox must derive channel IDs and select modern modes from what the stream server actually advertises.

Remaining validation:

confirm STREAM_WATCH produces a modern stream-server credential flow
confirm the stream server advertises and accepts clankvox's modern AEAD mode set
confirm DAVE channel ID derivation (BigInt(rtc_server_id) - 1) works
confirm video frames arrive over the UDP path using modern v9 semantics

Milestone 10: WebRTC fallback (contingency only)

Keep this milestone in the plan as insurance, but it is now low-priority. If live testing reveals that modern Discord stream servers have moved to WebRTC-only since the Discord-video-selfbot library was last updated, the fallback plan remains:

keep voice on the existing transport
implement WebRTC only for stream_watch
keep the output of that path normalized into the existing video events

The product should not care whether the transport underneath is UDP or WebRTC.

Milestone 11: Wire native watch start/stop through the existing screen-watch surface

Update:

src/bot/screenShare.ts
src/voice/voiceStreamWatch.ts
src/voice/sessionLifecycle.ts
src/voice/instructionManager.ts

Behavior target:

native watch eligibility depends on stream discovery state and transport readiness
explicit target selection still works as today
ambiguous multi-sharer handling still stays conservative
transport failure can force share-link fallback without retrying native first
fallback to share-link remains intact

Do not add transport-specific model tools.

File-by-File Plan

Bun / TypeScript

src/app.ts
- ensure stream discovery is wired into the selfbot gateway lifecycle
src/config.ts
- add env parsing for stream-watch feature toggles
src/voice/nativeDiscordScreenShare.ts
- extend state normalization and target resolution with stream-key state
src/voice/voiceSessionTypes.ts
- add stream-watch transport-state fields
src/voice/sessionLifecycle.ts
- subscribe to stream discovery events
- map stream creation/update/delete into IPC commands and native state
src/voice/clankvoxClient.ts
- add new IPC commands and role-aware transport-state events
src/bot/screenShare.ts
- native-first eligibility now includes stream discovery readiness
src/voice/voiceStreamWatch.ts
- keep existing watch behavior, but use improved failure/fallback reasons

Rust / clankvox

src/voice/clankvox/src/ipc_protocol.rs
- add stream-watch connection commands
src/voice/clankvox/src/ipc.rs
- parse and emit role-aware transport messages
src/voice/clankvox/src/app_state.rs
- replace singleton connection state with role-scoped transport slots
src/voice/clankvox/src/connection_supervisor.rs
- role-aware connect / reconnect / teardown
src/voice/clankvox/src/voice_conn.rs
- generalize connect params
- role-aware identify and DAVE setup
- parse READY.d.streams[]
- optionally support stream-role WebRTC later
src/voice/clankvox/src/capture_supervisor.rs
- consume role-aware video events cleanly
- preserve existing subscription/media-sink logic
src/voice/clankvox/src/dave.rs
- ensure separate DAVE lifecycle per transport slot

Observability Plan

Add explicit logs for every important control-plane and media-plane edge.

Suggested action names:

stream_discovery_dispatch
stream_watch_requested
stream_create_received
stream_server_received
stream_delete_received
stream_watch_transport_connect_start
stream_watch_transport_connected
stream_watch_transport_failed
stream_watch_transport_disconnected
stream_watch_transport_protocol_rejected
stream_watch_native_fallback_started

Suggested Rust tracing events:

clankvox_transport_connect_start
clankvox_transport_ready
clankvox_transport_reconnect_scheduled
clankvox_transport_disconnected
clankvox_stream_ready_streams_parsed
clankvox_stream_dave_initialized

The logs should always include:

sessionId
guildId
targetUserId
streamKey
transportRole
endpointHost when safe
close code / failure reason

Testing Plan

TypeScript tests

stream discovery dispatch parsing for VOICE_STATE_UPDATE, STREAM_CREATE, STREAM_SERVER_UPDATE, STREAM_DELETE
native eligibility and target selection with stream discovery state
fallback behavior when stream transport is unavailable

Rust tests

role-aware connection slot lifecycle
READY.d.streams[] parsing
role-aware disconnect handling
role-aware DAVE channel selection
IPC parsing for new stream-watch commands

Integration tests

stream discovery -> clankvoxClient -> sessionLifecycle
stream start / stream delete / fallback transitions
ensure stream transport failure does not end the voice session

Live validation checklist

Manual validation should prove, in order:

selfbot receives stream discovery dispatch in VC
STREAM_WATCH yields stream credentials
clankvox opens the stream transport
first video_state arrives
first keyframe decodes to JPEG
ingestStreamFrame accepts it
stream end tears down cleanly without affecting voice session

Risk Register

1. Stream transport may require WebRTC — risk reduced, not removed

~~This was the largest remaining engineering risk.~~

Discord-video-selfbot shows an older sender-side stream connection using protocol: "udp" with the same broad handshake shape as regular voice. That significantly de-risks the plan. The remaining uncertainty is whether modern stream servers on v9 + AEAD + DAVE still accept the same approach. Live testing in Milestone 9 will confirm.

2. Discord ToS and account safety

This plan assumes a selfbot used only for private experimentation in a personal server.

Rollout Sequence

Recommended implementation order:

stream discovery wiring through selfbot gateway (Milestone 3)
- validates whether STREAM_WATCH works at all
- if this fails, the entire plan changes — test the riskiest assumption first
session-state wiring (Milestone 4)
clankvox transport-role refactor (Milestones 1, 2)
IPC commands for stream-watch transport (Milestone 5)
role-aware disconnect, READY parsing, DAVE (Milestones 6, 7, 8)
stream-role connection attempt over UDP (Milestone 9)
end-to-end native watch validation
stream-role WebRTC only if UDP fails (Milestone 10)
wire native watch through screen-watch surface (Milestone 11)
dashboard/runtime snapshot polish
canonical doc cleanup after the implementation stabilizes

Product Language

Prefer:

"native screen watch"
"stream watch transport"
"stream discovery"

Avoid:

exposing transport details to the model
implying native watch is always available

User-facing product language should stay:

"start screen watch"
"watch your screen"
"native screen watch when available"