Copy architecture docs, ADRs, storage domain specs, research, reviews, and 56 storage architecture tasks from the alkhub_ts monorepo. Adapt for standalone @alkdev/hub repo structure (src/ not packages/hub/). Sanitize all sensitive information: - Replace private IPs (10.0.0.1) with localhost defaults - Remove internal server hostnames (dev1, ns528096) - Replace /workspace/ private paths with npm package references - Remove hardcoded credentials from examples - Rewrite infrastructure.md without private network details Add Deno project scaffolding: deno.json (pinned deps), .gitignore, AGENTS.md, entry point. Migrate existing code stubs (crypto, config types, logger) with updated import paths.
113 lines
5.4 KiB
Markdown
113 lines
5.4 KiB
Markdown
---
|
|
status: draft
|
|
last_updated: 2026-04-19
|
|
---
|
|
|
|
# Coordination Operations
|
|
|
|
## Overview
|
|
|
|
Coordination operations manage multi-agent workflows: spawning sessions, inter-session messaging, status tracking, and anomaly detection. These are hub operations in the registry, backed by Postgres and Redis.
|
|
|
|
## Architecture
|
|
|
|
### State: Postgres Tables
|
|
|
|
Coordination operations use three tables in the hub's storage layer. See `storage/coordination.md` for the full schema definitions:
|
|
|
|
- **`mappings`** — Worktree/session/coordinator relationships. Links spawned sessions to their parent coordinator, spoke, git branch, and now the assigned task. Status: `active`, `completed`, `aborted`, `failed`.
|
|
- **`detections`** — Anomaly detection records. Links detection events to sessions with severity and details.
|
|
- **`tasks`** + **`task_dependencies`** — SDD task definitions and their dependency edges. The coordinator queries task status to determine next work. See `storage/tasks.md` for the full task storage design.
|
|
|
|
### Operations
|
|
|
|
#### `coord.spawn` — Create Worktree + Session
|
|
|
|
1. `env.git.worktreeCreate({ name, branch })` — create worktree (via call protocol)
|
|
2. `env.opencode.sessionCreate({ directory, title })` — create session (via call protocol)
|
|
3. Insert into `mappings` table (with `taskId` referencing the assigned task)
|
|
4. `env.opencode.sessionPromptAsync({ sessionId, prompt, agent })` — send initial prompt (via call protocol)
|
|
5. Publish `coord.spawned` event to Redis
|
|
|
|
#### `coord.status` — Query Spawned Session Status
|
|
|
|
1. Query `mappings` table for children of parent session
|
|
2. For each mapping, `env.opencode.sessionStatus({ sessionId })` (via call protocol)
|
|
3. Return aggregated status
|
|
|
|
#### `coord.message` — Send Message to Spawned Session
|
|
|
|
1. `env.opencode.sessionPromptAsync({ sessionId, message, agent })` (via call protocol)
|
|
2. Publish `coord.messaged` event to Redis
|
|
|
|
#### `coord.notify` — Notify Coordinator
|
|
|
|
1. Look up mapping to find `parentSessionId`
|
|
2. `env.opencode.sessionPromptAsync({ sessionId: parentSessionId, message: formattedNotification })` (via call protocol)
|
|
3. Publish `coord.notified` event to Redis with level (info/warning/blocking)
|
|
|
|
#### `coord.abort` — Abort Spawned Session
|
|
|
|
1. `env.opencode.sessionAbort({ sessionId })` (via call protocol)
|
|
2. Update mapping status to "aborted"
|
|
3. Publish `coord.aborted` event to Redis
|
|
|
|
### opencode REST Operations via FromOpenAPI
|
|
|
|
Each coordination operation that interacts with an opencode container calls through the operations generated by `FromOpenAPI` from opencode's server spec:
|
|
|
|
```
|
|
opencode.sessionCreate → POST /session
|
|
opencode.sessionPromptAsync → POST /session/{id}/prompt_async
|
|
opencode.sessionStatus → GET /session/{id}/status
|
|
opencode.sessionAbort → POST /session/{id}/abort
|
|
opencode.sessionMessages → GET /session/{id}/messages
|
|
```
|
|
|
|
These operations are auto-generated and type-safe. No manual HTTP client code. The SSE fix in `from_openapi.ts` (async generator for SUBSCRIPTION endpoints) makes the streaming endpoints work through our call protocol.
|
|
|
|
### How Agents Call Coordination Operations
|
|
|
|
Agents in opencode containers call hub operations via MCP — not through a plugin:
|
|
|
|
```
|
|
Agent in opencode container
|
|
│
|
|
├── MCP search({ q: "coord" }) → finds coord.*, hub.list, hub.call, etc.
|
|
├── MCP call({ tool: "coord.notify" }) → reports task finished, blocked, or messages coordinator
|
|
├── MCP call({ tool: "coord.status" }) → checks on sibling sessions
|
|
└── MCP call({ tool: "coord.abort" }) → aborts a stuck session
|
|
```
|
|
|
|
The hub's MCP endpoint is configured when the opencode container is set up (in `opencode.json` MCP servers). The agent discovers and calls coordination tools the same way it discovers any other tool — via the MCP `search`/`schema`/`call` pattern. No plugin needed.
|
|
|
|
## Anomaly Detection
|
|
|
|
The hub monitors sessions via Redis events and runs detection heuristics:
|
|
|
|
1. The hub subscribes to Redis `alk:events:message.part.updated:*` and `alk:events:session.status:*` channels
|
|
2. Maintains in-memory metrics per monitored session (tool errors, malformed tools, last activity, status)
|
|
3. Periodic check (every 30s) for stalls
|
|
4. When thresholds exceeded, stores detection in `detections` table and publishes `coord.detection` event
|
|
|
|
Detections are queryable via `coord.detect`:
|
|
|
|
```
|
|
coord.detect({ sessionIDs?: string[] }) → Array<{ sessionId, issues, severity }>
|
|
```
|
|
|
|
### Detection Heuristics
|
|
|
|
These heuristics are validated patterns for catching common agent session failures:
|
|
|
|
| Anomaly Type | Trigger | Default Threshold | Severity |
|
|
|-------------|---------|-------------------|----------|
|
|
| MODEL_DEGRADATION | Malformed tool calls detected | ≥1 malformed tool | High |
|
|
| HIGH_ERROR_COUNT | Tool errors accumulating | ≥5 tool errors | Medium |
|
|
| SESSION_STALL | No activity while busy | >60s no activity | Medium |
|
|
|
|
Simple counters and timers per session, maintained from the Redis event stream. Pull model — the coordinator calls `coord.detect` on demand rather than being interrupted by push notifications.
|
|
|
|
## Provenance
|
|
|
|
The coordination operations design (spawn/message/notify/abort/detect) and detection heuristics (model degradation, high error count, session stall) are validated patterns from prior work. The alkhub_ts implementation uses the call protocol and Postgres persistence rather than single-process file-based state. |