Tests
Voice E2E and golden validation suites:
e2e.md
Default Test Commands
bun run testruns the default unit and integration suite and excludes.live.test.tsfiles.bun run verifyruns the safe local quality gate: lint, typecheck, default tests, docs link check, backend import check, and dashboard production build.bun run checkaliasesbun run verify.bun run test:e2eruns all Discord E2E files undertests/e2e/, subject to each suite's env/config guards.bun run test:e2e:voiceruns the physical voice harness convenience target (tests/e2e/voicePhysicalHarness.test.ts) and setsRUN_E2E_VOICE_PHYSICAL=1for its explicit smoke check.bun run test:e2e:textruns the text E2E suite and setsRUN_E2E_TEXT=1.
Live LLM Tests
These tests make real model calls or use real model CLIs, so they can cost money or consume quota.
Shared Voice Coverage
The active voice live suites share a single source of truth:
tests/live/shared/voiceLiveScenarios.ts
Current shared voice catalog size:
69scenarios total8scenario groups- The same
69scenarios intentionally exercise two different contracts: voice admission invoiceAdmission.live.test.tsand voice generation in the voice section ofreplyGeneration.live.test.ts - The shared corpus currently encodes
35spoken-reply expectations,30[SKIP]expectations, and4actionablevoiceIntentexpectations
Current group breakdown:
name detection fast-paths:4join events:9clear engagement:11contextual engagement:7categorical restraint:8command recognition:3music wake latch handling:3eagerness sweeps:24
Coverage assessment:
- Good breadth for admission and generation behavior across name detection, joins, direct follow-ups, contextual engagement, restraint cases, command turns, wake-latch handling, and eagerness thresholds
- Good alignment because admission and generation consume the same voice inputs
- The shared corpus keeps admission plus generation/intent expectations coupled in one source of truth
- Still not full-stack realtime coverage: these tests do not validate websocket/session transport, ASR streaming, TTS audio output, Discord timing, or end-to-end voice latency
- Still not a full provider matrix by default: the scenarios are broad, but we do not automatically run every scenario against every provider/model combination
Structured Reply Live Test (Generation LLM only — no classifier)
This exercises the real structured reply contract for both text and voice generation across ACTIVE and AMBIENT situations.
It tests the generation LLM brain only — the classifier admission pipeline is NOT involved.
Each scenario builds a full generation prompt (buildSystemPrompt + buildVoiceTurnPrompt)
and sends it directly to the LLM via llm.generate(), then asserts whether the structured
output is a real spoken reply or [SKIP].
This answers: "Given this context, does the generation LLM produce the right reply-vs-skip decision?"
The voice section uses the shared voice live scenario catalog that is also consumed by tests/live/voiceAdmission.live.test.ts, so both suites cover the same voice situations.
The text section also covers:
- tool selection for
web_search,web_scrape,conversation_search, andmemory_write - a vision turn with inline image input
- raw structured-output validity for representative reply and skip cases
Defaults:
- Text provider defaults to
claude-oauth - Voice provider defaults to
claude-oauth - The suite includes a small eagerness sweep for both text and voice so we validate low-vs-high participation behavior, not just one-off direct-address cases
- Tool-selection subtests are skipped automatically for providers without tool-call support
- Vision subtests are skipped automatically for providers without multimodal image support
bun test tests/live/replyGeneration.live.test.ts
You can target different providers/models per path:
TEXT_LLM_PROVIDER=claude-oauth TEXT_LLM_MODEL=claude-sonnet-4-6 \
VOICE_LLM_PROVIDER=claude-oauth VOICE_LLM_MODEL=claude-sonnet-4-6 \
bun test tests/live/replyGeneration.live.test.ts
TEXT_LLM_PROVIDER=openai TEXT_LLM_MODEL=gpt-5-mini \
VOICE_LLM_PROVIDER=anthropic VOICE_LLM_MODEL=claude-haiku-4-5 \
OPENAI_API_KEY=... ANTHROPIC_API_KEY=... \
bun test tests/live/replyGeneration.live.test.ts
You can also filter to only one side:
LIVE_REPLY_FILTER=voice bun test tests/live/replyGeneration.live.test.ts
LIVE_REPLY_FILTER=text bun test tests/live/replyGeneration.live.test.ts
Debug visibility:
LIVE_REPLY_DEBUG=1prints provider/model, system prompt, user prompt, raw model output, parsed structured text, and any returned tool calls
Voice Admission Live Test (Classifier + fast-paths — no generation)
This is the active end-to-end admission suite for voice reply gating. It tests the
classifier pipeline — name detection fast-paths plus the YES/NO LLM classifier —
via evaluateVoiceReplyDecision(). The generation LLM is NOT involved.
The classifier LLM returns YES or NO. The admission pipeline wraps that into allow/deny
(along with deterministic fast-paths that can allow/deny before the classifier runs).
Each scenario asserts on the final allow/deny outcome.
This answers: "Given this context, does the admission pipeline (fast-paths + classifier) correctly gate the turn?"
It uses the same shared scenario corpus as the voice section of replyGeneration.live.test.ts.
Defaults:
- Classifier provider defaults to
claude-oauth - The suite covers the shared name-detection, join-event, engagement, restraint, command, wake-latch, and eagerness scenarios
bun test tests/live/voiceAdmission.live.test.ts
CLASSIFIER_PROVIDER=claude-oauth CLASSIFIER_MODEL=claude-sonnet-4-6 VOICE_ADMISSION_DEBUG=0 bun test tests/live/voiceAdmission.live.test.ts
CLASSIFIER_PROVIDER=claude-oauth CLASSIFIER_MODEL=claude-sonnet-4-6 LABEL_FILTER="event: another person joins" VOICE_ADMISSION_DEBUG=1 bun test tests/live/voiceAdmission.live.test.ts
Debug visibility:
VOICE_ADMISSION_DEBUG=1prints the exact classifier system prompt, classifier user prompt, raw classifier output, and parsed decision for each scenarioVOICE_CLASSIFIER_DEBUG=1still works as a compatibility alias for the same live admission debug path
Replay Test Harnesses
The project includes two offline behavior-validation harnesses:
- Flooding replay harness:
../../scripts/floodingReplayHarness.ts - Voice golden harness:
../../scripts/voiceGoldenHarness.ts(covered ine2e.md)
Harness intent:
- Flooding replay evaluates behavior against real conversation history in
data/clanker.dbwithout running the full Discord runtime loop. - Voice golden validates voice reply decisions and outputs using curated utterance cases across runtime modes.
Replay Framework Layout
Flooding replay uses a shared replay framework:
- Engine/runtime loop:
scripts/replay/core/engine.ts - Shared DB loading and history priming:
scripts/replay/core/db.ts - Shared LLM, judge, metrics, and output modules:
scripts/replay/core/llm.ts,scripts/replay/core/judge.ts,scripts/replay/core/metrics.ts,scripts/replay/core/output.ts - Shared helper types/utilities:
scripts/replay/core/types.ts,scripts/replay/core/utils.ts - Scenario implementations:
scripts/replay/scenarios/*.ts - Thin CLI entrypoints:
scripts/*ReplayHarness.ts
Current flooding wiring:
- Entrypoint:
scripts/floodingReplayHarness.ts - Scenario implementation:
scripts/replay/scenarios/flooding.ts
Flooding Replay: How It Works
The flooding harness runs this pipeline:
- Parse CLI args and load runtime settings from the SQLite
settingsrow wherekey = 'runtime_settings'. - Query
messagesfor replay context and user turns (since/until/channel-idfilters). - Query
actionsfor recorded outcomes (sent_reply,sent_message,reply_skipped,voice_intent_detected). - Rebuild turn-by-turn context and detect whether each user turn is addressed to the bot.
- Run the admission gate (
shouldAttemptReplyDecision) to decide whether a turn is eligible for reply behavior. - Resolve the turn outcome:
recordedmode: reuse recorded action rows from DB.livemode: call the actor LLM (buildSystemPrompt+buildReplyPrompt) for admitted turns.
- Accumulate per-channel-mode stats (
initiativevsnon_initiative), timeline events, and turn snapshots. - Optionally run a judge LLM in
livemode to score flooding for a specificwindow-start/window-end. - Evaluate assertions; any failure sets process exit code to
1.
Current limitation:
- The replay harness reads initiative channels from
permissions.replies.discoveryChannelIds, matching the live runtime.
Running the Flooding Replay
Recorded replay (no actor LLM calls):
bun run replay:flooding
# or
bun scripts/floodingReplayHarness.ts --mode recorded
Live replay (actor LLM; optional judge):
bun run replay:flooding:live
# or
bun scripts/floodingReplayHarness.ts --mode live
Common scoped run:
bun scripts/floodingReplayHarness.ts --mode recorded \
--since 2026-02-27T16:20:00.000Z \
--until 2026-02-27T16:40:00.000Z \
--channel-id 1052402898140667906 \
--max-turns 80 \
--snapshots-limit 20 \
--out-json data/replays/flood-window.json
Common assertion gates:
bun scripts/floodingReplayHarness.ts --mode recorded \
--assert-max-unaddressed-send-rate 15 \
--assert-max-unaddressed-sends 3 \
--assert-min-addressed-send-rate 70 \
--assert-min-addressed-sends 2 \
--assert-max-sent-turns 12 \
--fail-on-llm-error
Replay Key Flags
--since,--until,--channel-id: replay scope.--history-lookback-hours: extra context beforesince.--max-turns: cap number of replayed user turns.--snapshots-limit: how many turn snapshots to print.--actor-provider,--actor-model: override actor LLM in live mode.--judge-provider,--judge-model,--judge,--no-judge: control judge behavior.--window-start,--window-end: timeline window for judge + snapshot focus.--assert-min-llm-calls: assert minimum actor LLM volume in replay window.--out-json: write machine-readable report.
Creating a New Replay Harness
Use scripts/replay/scenarios/flooding.ts as the scenario template.
- Create
scripts/replay/scenarios/<scenario>.ts. - Export a
run<Scenario>ReplayHarness(argv)function that:- parses scenario args,
- calls
runReplayEngine(...), - evaluates scenario assertions,
- prints/writes scenario report output.
- Keep one decision path per mode (
recordedvslive) and remove unused branches. - Add scenario assertions that encode the behavior the harness should protect.
- Create
scripts/<scenario>ReplayHarness.tsas a thin entrypoint that calls the scenario runner. - Optionally add
package.jsonscripts:replay:<scenario>replay:<scenario>:live
- Validate with a known window:
- Run recorded first to establish baseline behavior.
- Run live with same window to compare current model behavior.
- Keep thresholds in CLI assertions so CI/local runs fail fast on regressions.
