clankvox Architecture
This document is the transport/media-plane view of the system.
Ownership Boundary
Bun owns:
- the selfbot gateway session
- stream discovery dispatch handling
- session orchestration and product logic
- tools, prompts, settings, and commentary decisions
- decoding native watch frames into JPEG for the higher-level screen-watch pipeline
clankvox owns:
- Discord voice and stream-server sockets
- UDP/RTP send and receive
- codec advertisement and media framing
- DAVE lifecycle and media encryption/decryption
- capture events and media telemetry
- TTS/music playback pacing
- outbound native Go Live video packetization
That split is important. clankvox should stay transport-native and deterministic. Bun should stay agentic and product-facing.
Process Model
The entrypoint in ../src/main.rs creates one long-lived AppState and drives it from a single async event loop.
The loop reacts to five sources:
- inbound IPC messages from Bun
VoiceEventmessages from active transport connectionsMusicEventmessages from the ffmpeg/yt-dlp music pipeline- reconnect deadlines
- a 20ms tick used for audio send cadence and publish-frame draining
That shape keeps transport logic serialized through AppState even though lower-level tasks are running concurrently behind channels.
Core State
../src/app_state.rs is the shared spine. It holds:
- primary voice connection and pending voice connect inputs
stream_watchconnection state and its own DAVE slotstream_publishconnection state and its own DAVE slot- audio send state for outbound voice playback
- per-user capture state and speaking state
- remote video state and active video subscriptions
- music pipeline state
- stream publish runtime state
- reconnect bookkeeping
The important architectural choice is that each transport role has its own connection slot and its own DAVE manager. Go Live is not modeled as “extra fields on the main voice socket.”
Transport Roles
voice
The main voice leg for:
- join/leave
- speaking state
- inbound user audio capture
- outbound TTS and music
stream_watch
A separate stream-server connection used only for inbound Go Live receive:
- connects with
rtc_server_idand stream credentials from Bun - receives remote OP12/OP18 video state
- decrypts video and forwards encoded frames to Bun over IPC
- never owns the main audio session
stream_publish
A separate stream-server connection used only for outbound Go Live send:
- connects with self-owned stream credentials from Bun
- advertises sender-side H264 support
- announces video state to Discord
- packetizes and transmits outbound H264 access units
Supervisor Split
The code is organized around four operational surfaces:
Connection Supervisor
../src/connection_supervisor.rs
Owns:
- join / destroy
- connect and disconnect commands for all roles
- role-specific reconnect handling
- connection teardown when session metadata changes
Capture Supervisor
Owns:
- inbound speaking and audio events
- video subscription state
- remote video state merge/update logic
- transport-ready hooks for
stream_watchandstream_publish
Playback Supervisor
Owns:
- audio playback commands from Bun
- music lifecycle events
- queue draining on the 20ms tick
- buffer depth and TTS playback telemetry
- dispatch of one pending stream-publish frame per tick
Stream Publish Runtime
Owns:
- ffmpeg/yt-dlp sender pipeline
- raw H264 access-unit extraction
- pause/resume/stop handling for the sender subprocess
- sender runtime events and frame queueing into
AppState
IPC Contract
../src/ipc.rs is the Bun contract.
Inbound commands are grouped into four conceptual families:
- connection: join, voice server/state updates, stream-watch connect/disconnect, stream-publish connect/disconnect, destroy
- capture: subscribe/unsubscribe user audio and user video
- playback: TTS audio, stop playback, music play/pause/resume/stop/gain
- stream publish runtime: play, pause, resume, stop
Outbound events are also grouped:
- adapter / connection state
- transport state per role
- speaking and user audio capture
- user video state, frames, and end events
- playback and music lifecycle
- buffer depth and TTS playback telemetry
- structured IPC errors
Module Map
- ../src/voice_conn.rs: Discord transport and protocol heavy lifting
- ../src/dave.rs: DAVE session management and codec-aware encrypt/decrypt helpers
- ../src/audio_pipeline.rs: PCM buffering and playback helpers
- ../src/music.rs: music pipeline subprocess management
- ../src/video.rs: inbound video depacketization and state helpers
- ../src/ipc_protocol.rs: routing inbound IPC into command groups
- ../src/ipc_router.rs: dispatches routed commands into supervisors
Why The Architecture Looks This Way
The big driver is DAVE and Go Live.
Songbird-level abstractions were not sufficient because:
- DAVE control opcodes live on the voice WebSocket
- media encryption/decryption has to be coordinated with codec framing
- Go Live uses a second stream-server connection with different state and lifecycle needs
- sender and receiver roles need different codec and announcement behavior
That is why clankvox is a custom transport layer instead of a thin wrapper around an off-the-shelf Discord voice library.
