Text Turn Log Dive - Codex Worker Adoption Timeout
Trigger: 1497985713835348038
Time: 2026-04-26 15:40:25 - 15:40:58 UTC
Instance: clanky
Mode: text reply, openai-oauth:gpt-5.5, spawn_code_worker with codex-cli worker
Participants: vuhlp and clanky
LLM calls: 2
Cost: $0.00
Primary files: src/agents/swarmLauncher.ts, src/llm/llmCodexCli.ts, docs/capabilities/code.md
Conversation Timeline
15:40:25 [runtime]: reply_admission_decision allow=true reason=recent_reply_window
15:40:32 [runtime]: reply_pipeline_gate allow=true reason=ready
15:40:32 [runtime]: reply_tool_availability includedToolCount=46 excludedToolCount=0
15:40:37 [llm]: first gpt-5.5 call returned one function_call: spawn_code_worker
15:40:52 [runtime]: swarm_worker_adoption_timeout instance=f6688d88-5a12-4a6a-861f-5390b16a5a55 harness=codex-cli
15:40:52 [runtime]: swarm_worker_exit exitCode=0 cancelled=true cancelReason="adoption timeout"
15:40:57 [llm]: follow-up gpt-5.5 call wrote the timeout reply
15:40:58 [clanky]: I tried to spawn it again and the swarm worker didn't adopt the task within the timeout...
Issues Reported / Observed
- The cwd fix got the request past workspace selection, but the Codex worker never adopted its reserved swarm instance.
- The worker process did not crash with a non-zero exit; Clanky cancelled it after the 15s adoption deadline.
- The tail showed a Codex MCP transport error before any swarm adoption happened.
1. Codex Started, But No Swarm Peer Adopted
Evidence:
| Time | Event | Metadata |
|---|---|---|
| 15:40:37.492 | llm_call | toolCallCount=1, toolNames=spawn_code_worker, functionCallArgumentChars=437 |
| 15:40:52.572 | swarm_worker_adoption_timeout | instanceId=f6688d88-5a12-4a6a-861f-5390b16a5a55, harness=codex-cli, timeoutMs=15000 |
| 15:40:52.584 | swarm_worker_exit | exitCode=0, cancelled=true, cancelReason=adoption timeout |
The adoption timeout tail was:
Reading additional input from stdin...
{"type":"thread.started","thread_id":"019dca73-2daf-7003-b838-363bc69a6e8a"}
{"type":"turn.started"}
2026-04-26T15:40:39.884370Z ERROR rmcp::transport::worker: worker quit with fatal: Transport channel closed, when Deserialize(Error("data did not match any variant of untagged enum JsonRpcMessage", line: 0, column: 0))
Interpretation:
Codex reached turn.started, but at least one MCP worker failed during transport setup. Since the reserved row never flipped to adopted=1, spawnPeer treated this as a launch failure and cancelled the direct child process. The process exit code was 0 because the launcher initiated cancellation after the adoption timeout; the functional failure was the missing swarm adoption.
2. The Failure Was In Codex MCP Launch Isolation
The direct Codex path relied on parent process env plus normal Codex config loading. That had two fragile pieces:
| Fragile piece | Why it mattered |
|---|---|
| User Codex config was loaded | User-level MCP servers were visible to the worker, so unrelated MCP launch/transport failures could poison the session before the swarm server adopted. |
Swarm adoption env was not present in Codex mcp_servers.swarm.env.* overrides | If Codex launched stdio MCP servers with only explicit server env, the swarm child did not receive SWARM_MCP_INSTANCE_ID, SWARM_MCP_DIRECTORY, SWARM_MCP_SCOPE, or SWARM_MCP_FILE_ROOT. |
| Swarm server cwd was implicit | bun run /abs/path/to/src/index.ts works best when the MCP process cwd is the vendored swarm-mcp package root, not the target scratch workspace. |
Code path before the fix:
| File | Behavior |
|---|---|
src/agents/swarmLauncher.ts | built Codex MCP config overrides before reserving the instance row, so dynamic adoption env was unavailable |
src/llm/llmCodexCli.ts | launched codex exec without --ignore-user-config |
src/agents/swarmLauncher.ts | omitted explicit Codex -C <resolved workspace cwd> in the direct child invocation |
Interpretation:
The swarm-mcp server itself was not proven bad. A manual stdio probe of Clanky's vendored mcp-servers/swarm-mcp/src/index.ts responded to MCP initialize. The problem was the launcher contract around Codex: the worker session was not isolated to the intended swarm server config, and the adoption env was not pinned into the server config Codex uses to spawn MCP children.
Fixes Applied
| Issue | Root Cause | Fix | File |
|---|---|---|---|
| Codex workers can load unrelated user MCP servers | codex exec inherited $CODEX_HOME/config.toml | Add --ignore-user-config to Codex code-agent launches while preserving auth through CODEX_HOME | src/llm/llmCodexCli.ts |
| Swarm child may not receive adoption identity | Codex MCP config overrides were generated before Clanky reserved the instance row | Build direct-child Codex overrides after reservation and include mcp_servers.swarm.env.* for SWARM_DB_PATH, SWARM_MCP_INSTANCE_ID, SWARM_MCP_DIRECTORY, SWARM_MCP_SCOPE, SWARM_MCP_FILE_ROOT, and SWARM_MCP_LABEL | src/agents/swarmLauncher.ts |
| Vendored swarm-mcp can launch from the wrong cwd | MCP server cwd was implicit | Add mcp_servers.swarm.cwd=<vendored swarm-mcp root> to Codex overrides | src/agents/swarmLauncher.ts |
| Codex workspace root was only the process cwd | Direct child args omitted Codex's own -C flag | Pass the resolved worker cwd to buildCodexCliCodeAgentArgs | src/agents/swarmLauncher.ts |
Follow-up Resolution: PTY-backed Codex Workers
After the direct-child isolation fixes, Clanky moved the preferred path to swarm-server PTYs so code workers are visible and attachable in swarm-ui. The Codex PTY path had a separate launch race:
| Symptom | Root Cause | Fix | File |
|---|---|---|---|
| Codex PTY adoption timed out with only startup escape sequences in replay | Clanky interpreted the first stale /state snapshot as PTY exit before the server snapshot poller had published the newly-created PTY | Wait for the PTY to appear in /state, or for a short appearance grace period, before treating a missing PTY as exit | src/agents/swarmLauncher.ts |
| Adoption timeout logs had an empty tail for PTY-backed workers | Clanky only had a stdout/stderr ring buffer for direct children | Fetch /pty/:id/replay from swarm-server for PTY-backed diagnostics | src/agents/swarmLauncher.ts, src/agents/swarmServerClient.ts |
| Codex PTY MCP child did not have a stable adoption identity | The interactive Codex process needed the same dynamic swarm env pinned into its MCP server config | Create the worker identity row before launch, pass its instance_id to /pty, and include mcp_servers.swarm.env.* overrides in the Codex args | src/agents/swarmLauncher.ts |
The successful end-to-end verification was a Clanky-spawned Codex worker creating a static app in swarm-test, adopting through the swarm-server PTY path, editing files, and reporting its checks back through the normal worker result flow.
Follow-up Optimization: Installed Skill Discovery
After the PTY adoption path was stable, Clanky reduced worker prompt size by preferring installed swarm-mcp skills over inlining the full vendored skill text. When the skill is reachable through the harness's normal discovery paths, the first-turn prompt now emits a short directive pointing at the on-disk skill. If no reachable install exists, Clanky still inlines the vendored skill body, preserving the old behavior. Setting appendCoordinationPrompt=false disables both delivery modes while leaving Clanky's launcher-specific overlays in place.
Verification
| Check | Result |
|---|---|
codex exec --help | Confirms --ignore-user-config and --skip-git-repo-check are supported for the actual non-interactive worker command |
bun test src/llm/llm.codexCli.test.ts src/agents/swarmLauncher.test.ts | 26 pass, 0 fail |
bun run test | 1540 pass, 0 fail |
bun run test src/agents/swarmLauncher.test.ts | 1549 pass, 0 fail; repo script ran the full non-E2E suite |
| Installed skill discovery optimization | 1551 pass, 0 fail; adds coverage for discovery directive and inline fallback precedence |
cargo test -p swarm-server | 29 pass, 0 fail |
Live spawnPeer Codex PTY probe | Adopted instance 5252ef5c-8198-42b1-9953-8a4a76861546 through PTY f38dffd6-e7be-47e2-8a52-428cc36870a1 |
Open Items / Future Considerations
- Add a cheap preflight that starts only the configured swarm MCP stdio server and verifies MCP
initializebefore launching a paid worker turn. - Log the concrete harness args with sensitive values redacted when adoption times out. The current tail is useful, but it does not say which MCP server produced the transport error.
- Product language: Codex code workers run with a Clanky-owned MCP server config for the spawned swarm peer, not the operator's whole user MCP toolbox. The worker gets the project workspace, the swarm identity, and only the coordination server needed to adopt and do the job.
- Product language: PTY-backed workers are live terminal sessions first. Clanky treats
swarm-serverreplay as the diagnostic source of truth and does not cancel a worker just because the first/statepoll has not caught up yet.
