add flowgraph architecture docs (Phase 1 SDD)

Draft architecture specification for @alkdev/flowgraph — a workflow graph library providing DAG-based orchestration over operations. Covers two graph types (operation graph, call graph), ujsx workflow templates, GraphologyHost and ReactiveHost configs, signal-driven execution, type-compatibility analysis, error hierarchy, and build/distribution. Includes 3 ADRs: ujsx as template IR, DAG-only enforcement, decoupled storage.
2026-05-19 09:36:22 +00:00
parent 333dcd5ac1
commit d2253099ee
13 changed files with 2863 additions and 0 deletions
--- a/docs/architecture/call-graph.md
+++ b/docs/architecture/call-graph.md
@@ -0,0 +1,255 @@
+---
+status: draft
+last_updated: 2026-05-19
+---
+
+# Call Graph (Dynamic Runtime)
+
+The dynamic call graph populated at runtime from call events. Nodes are call invocations with status and timestamps; edges are parent-child and dependency relationships.
+
+## Overview
+
+The call graph is the runtime counterpart to the operation graph. Where the operation graph captures what *can* happen (type compatibility), the call graph captures what *is* happening or *has happened* (running calls, completed calls, failures, aborts).
+
+The call graph is populated automatically by the call protocol — every `call.requested` adds a node, every `call.responded`/`call.error`/`call.aborted` updates its status. This means the call graph is always in sync with the actual state of in-flight calls.
+
+Key capabilities:
+- **Abort cascading** — abort a call → all children are automatically aborted via `parentRequestId` chains
+- **Observability** — query what's running, what failed, what's blocked
+- **DAG operations** — topological sort of running calls, cycle detection (shouldn't happen but verified), reachability queries
+- **Serialization** — `export()`/`fromJSON()` for Postgres persistence
+
+## Construction
+
+### fromCallEvents()
+
+```typescript
+static fromCallEvents(events: CallEventMapValue[]): FlowGraph<CallNodeAttrs, CallEdgeAttrs>
+```
+
+Builds a call graph from an array of call protocol events. Events are processed in order:
+
+1. **`call.requested`** → add a `CallNodeAttrs` node with `status: "pending"`. If `parentRequestId` is set, add a `triggered` edge from parent to child.
+2. **`call.responded`** → update node status to `completed`, set `output` and `completedAt`
+3. **`call.error`** → update node status to `failed`, set `error` and `completedAt`
+4. **`call.aborted`** → update node status to `aborted`, set `completedAt`
+5. **`call.completed`** → update node status to `completed`, set `completedAt` (if not already set by `call.responded`)
+
+Processing is idempotent — processing the same event twice has no effect (the node already has the updated status).
+
+### Incremental: updateFromEvent()
+
+```typescript
+updateFromEvent(event: CallEventMapValue): void
+```
+
+Updates an existing call graph with a single call event. This is the primary interface for real-time graph population:
+
+```typescript
+const callGraph = new FlowGraph();
+// Subscribe to call protocol events
+pubsub.subscribe("call.requested", (event) => callGraph.updateFromEvent(event));
+pubsub.subscribe("call.responded", (event) => callGraph.updateFromEvent(event));
+pubsub.subscribe("call.error", (event) => callGraph.updateFromEvent(event));
+pubsub.subscribe("call.aborted", (event) => callGraph.updateFromEvent(event));
+pubsub.subscribe("call.completed", (event) => callGraph.updateFromEvent(event));
+```
+
+### fromJSON()
+
+```typescript
+static fromJSON(data: CallGraphSerialized): FlowGraph
+```
+
+Deserialize from graphology native JSON format. Used for loading persisted call graphs from Postgres.
+
+## Node Attributes
+
+See [schema.md](schema.md#CallNodeAttrs) for the full schema definition.
+
+| Field | Type | Set by |
+|-------|------|--------|
+| `requestId` | `string` | `call.requested` |
+| `operationId` | `string` | `call.requested` |
+| `status` | `CallStatus` | Updated by each call event |
+| `parentRequestId` | `string?` | `call.requested` |
+| `input` | `unknown` | `call.requested` |
+| `output` | `unknown?` | `call.responded` |
+| `error` | `{ code, message, details? }?` | `call.error` |
+| `identity` | `Identity?` | `call.requested` |
+| `startedAt` | `string?` | `call.requested` (when handler starts) |
+| `completedAt` | `string?` | Terminal event (`responded`, `error`, `aborted`) |
+
+The node key is `requestId`.
+
+## Edges
+
+Call graph edges carry an `edgeType` attribute:
+
+| `edgeType` | Meaning | Added by |
+|-----------|---------|----------|
+| `triggered` | Parent call caused child call to execute | `call.requested` with `parentRequestId` |
+| `depends_on` | Data dependency — source needs target's result | Explicit declaration (not auto-populated) |
+
+`depends_on` edges are not auto-populated by the call protocol. They represent data dependencies that aren't captured by the parent-child hierarchy. They may be added by:
+- Workflow template instantiation (the template knows which steps depend on which)
+- Explicit `addDependency(parent, child)` calls by the hub coordinator
+
+### Edge Key Convention
+
+`triggered` edges use `${parentRequestId}->${childRequestId}` as the edge key. `depends_on` edges use `${sourceRequestId}->${targetRequestId}:depends_on` to distinguish from `triggered` edges between the same pair.
+
+Since `multi: false`, there can be at most one `triggered` and one `depends_on` edge between the same pair. The edge key convention ensures deterministic keys.
+
+## Status Lifecycle
+
+Call node status transitions follow a strict state machine:
+
+```
+              call.requested
+                   │
+                   ▼
+              ┌─────────┐
+              │ pending │
+              └────┬────┘
+                   │
+              handler starts
+                   │
+                   ▼
+              ┌─────────┐
+         ┌────│ running │────┐
+         │    └────┬────┘    │
+    call.aborted  │    call.aborted
+         │        │         │
+         ▼        │         ▼
+   ┌─────────┐    │   ┌─────────┐
+   │ aborted │    │   │ aborted │
+   └─────────┘    │   └─────────┘
+                  │
+        ┌─────────┼─────────┐
+        │         │         │
+  call.responded   │    call.error
+        │         │         │
+        ▼         │         ▼
+  ┌───────────┐   │   ┌────────┐
+  │ completed │   │   │ failed │
+  └───────────┘   │   └────────┘
+                  │
+           call.completed
+                  │
+                  ▼
+            ┌───────────┐
+            │ completed │
+            └───────────┘
+```
+
+Invalid transitions (e.g., `completed` → `running`) throw `InvalidTransitionError`. The `updateStatus()` method validates the transition before applying it.
+
+## Abort Cascading
+
+When a call is aborted, all of its children should also be aborted. The call protocol handles this via `call.aborted` events propagating through `parentRequestId` chains.
+
+The call graph supports this with a traversal query:
+
+```typescript
+// Abort cascade: get all descendants of a call
+const descendants = callGraph.descendants(requestId);
+// → all calls that would be affected by aborting this call
+```
+
+The hub coordinator can:
+1. Receive `call.aborted` for a parent call
+2. Query `callGraph.descendants(requestId)` for all children
+3. Abort each child call via `PendingRequestMap.abort()`
+
+This is a structural operation — the graph provides the "who is affected" information, the protocol provides the "abort them" mechanism.
+
+## Observability Queries
+
+The call graph supports queries for observability without traversing the entire graph:
+
+| Query | Method | Returns |
+|-------|--------|---------|
+| Get running calls | `filterByStatus("running")` | Node IDs with running status |
+| Get failed calls | `filterByStatus("failed")` | Node IDs with failed status |
+| Get top-level calls | `getRoots()` | Nodes with no `parentRequestId` |
+| Get children of call | `children(requestId)` | Direct children via `triggered` edges |
+| Get call duration | `duration(requestId)` | `completedAt - startedAt` (throws if not completed) |
+| Get call lineage | `lineage(requestId)` | Ancestor chain from root to this call |
+
+### filterByStatus
+
+```typescript
+filterByStatus(status: CallStatus): string[]
+```
+
+Returns all node keys with the given status. Implemented as a filter over `graph.forEachNode()`. For small graphs (tens to hundreds of nodes), this is O(n) and fast. For very large graphs, a status index could be added as an optimization.
+
+### getRoots
+
+```typescript
+getRoots(): string[]
+```
+
+Returns all nodes with `parentRequestId === undefined` (top-level calls). These are the entry points of call chains.
+
+## Serialization and Persistence
+
+```typescript
+const data = callGraph.export();          // graphology native JSON
+callGraph.toJSON();                       // alias for export()
+const restored = FlowGraph.fromJSON(data); // round-trip
+```
+
+The call graph's `export()`/`fromJSON()` boundary is designed for Postgres persistence via the hub's storage layer. Flowgraph does not handle database operations — it provides the serialized format, and the hub handles storage.
+
+Payload fields (`input`, `output`, `error`) are stored as-is in the graph. The hub's storage layer is responsible for truncation and redaction (see `@alkdev/alkhub_ts/docs/architecture/storage/call-graph.md` for the payload handling strategy).
+
+## Mutations
+
+```typescript
+// Add a call node (from call.requested event)
+addCall(attrs: CallNodeAttrs): void
+
+// Update call status (from call.responded/error/aborted/completed event)
+updateStatus(requestId: string, status: CallStatus, extra?: Partial<CallNodeAttrs>): void
+
+// Add a dependency edge (explicit, not auto-populated)
+addDependency(source: string, target: string): void
+
+// Remove a call node and its edges
+removeCall(requestId: string): void
+
+// Update call attributes (partial merge)
+updateCall(requestId: string, attrs: Partial<CallNodeAttrs>): void
+```
+
+`updateStatus` validates the transition. `addDependency` validates that both endpoints exist. `removeCall` removes the node and all attached edges (graphology cascade).
+
+## Constraints
+
+- **DAG-only** — call graphs cannot have cycles. A call cannot be its own ancestor. `addCall` with a `parentRequestId` that would create a cycle throws `CycleError`.
+- **Status transitions are validated** — invalid transitions throw `InvalidTransitionError`.
+- **Node keys are `requestId`** — not `operationId`. Multiple calls to the same operation have different `requestId`s but the same `operationId`.
+- **`parentRequestId` is both node attribute and edge** — denormalized for fast point lookups (node attribute) and traversal queries (edge), following the storage schema pattern.
+- **`depends_on` edges are not auto-populated** — they represent data dependencies that the call protocol doesn't capture. They must be added explicitly by the hub coordinator or workflow template instantiation.
+- **Payload fields are stored as-is** — flowgraph doesn't truncate or redact `input`, `output`, or `error`. That's the hub's responsibility at the persistence boundary.
+- **Small graph sizes** — call graphs at hub level are typically tens of nodes. Performance is a non-issue; O(n) traversals are fine.
+
+## Open Questions
+
+1. **Should the call graph support `call.requested` events with unknown `operationId`?** If a `call.requested` event references an operation not in the registry, should the node be created with `operationId` set to the unknown value? Yes — the call graph records what happened, not what should have happened. The node gets a `status: "pending"` and may later transition to `"failed"` with an `OPERATION_NOT_FOUND` error code.
+
+2. **Should `depends_on` edges be auto-populated from workflow templates?** When a call graph is instantiated from a workflow template, the template's sequential/parallel structure implies data dependencies. Should the template instantiation automatically create `depends_on` edges? This would couple the call graph to the template system, which may not always be desirable.
+
+3. **Should the call graph support multiple graphs simultaneously (one per workflow execution)?** Currently the design assumes one call graph per `FlowGraph` instance. If the hub needs to track multiple concurrent workflows, it would use multiple instances. An alternative is a single graph with workflow-scoped subgraphs.
+
+4. **Should `filterByStatus` use an index?** For small graphs (tens of nodes), a simple filter is fast. For very large graphs, maintaining a `Map<CallStatus, Set<string>>` index would make status queries O(1). The index would need to be updated on every `updateStatus()` call.
+
+## References
+
+- Schema: [schema.md](schema.md) — `CallNodeAttrs`, `CallEdgeAttrs`, `CallStatus`, `EdgeType`
+- Call protocol: `@alkdev/alkhub_ts/docs/architecture/call-graph.md`
+- Call graph storage: `@alkdev/alkhub_ts/docs/architecture/storage/call-graph.md`
+- Call event types: `@alkdev/operations/src/call.ts`
+- Taskgraph pattern: `@alkdev/taskgraph_ts/src/graph/construction.ts`