Files
flowgraph/docs/architecture/call-graph.md
glm-5.1 907c33650f fix: architecture review - address 5 critical issues, 6 warnings, 3 suggestions
Critical fixes:
- C1: Create standalone ADR-006 file (edge type consistency),
  extract from open-questions.md inline content
- C2: Convert CallResult from plain interface to TypeBox schema,
  aligning with 'TypeBox as single source of truth' constraint
- C3: Add fromJSON() cycle detection specification - enforce
  ADR-002 DAG invariant even on deserialized input
- C4: Rewrite consumer-integration.md Phase 4 to use ADR-005
  event-append pattern instead of direct signal mutation
- C5: Fix operator precedence bug in consumer-integration.md
  (missing parentheses around OR condition)

Warnings addressed:
- W1: Fix immutability claim - operation graph is 'conventionally
  immutable', not prevented by API
- W2: Add EventLogProjection to reactive exports map
- W3: Add CallResult/CallResultSchema to schema exports map
- W4: Fix reactive-execution.md Level 1 error handling to use
  event-append pattern instead of direct signal mutation
- W5: Remove duplicate dataFlow inference description in schema.md
- W6: Clarify ADR-006 project context (flowgraph vs taskgraph)

Suggestions implemented:
- S1: Add 'reviewed' document lifecycle status between draft/stable,
  update all docs to reviewed status
- S2: Add carve-out note for analysis result types in schema.md
  constraints (they are ephemeral, not serialized)
- S3: Add isComplete() and getAggregateStatus() convenience methods
  to WorkflowReactiveRoot specification
2026-05-21 19:40:45 +00:00

255 lines
15 KiB
Markdown

---
status: reviewed
last_updated: 2026-05-22
---
# Call Graph (Dynamic Runtime)
The dynamic call graph populated at runtime from call events. Nodes are call invocations with status and timestamps; edges are parent-child and dependency relationships.
## Overview
The call graph is the runtime counterpart to the operation graph. Where the operation graph captures what *can* happen (type compatibility), the call graph captures what *is* happening or *has happened* (running calls, completed calls, failures, aborts).
The call graph is populated automatically by the call protocol — every `call.requested` adds a node, every `call.responded`/`call.error`/`call.aborted` updates its status. This means the call graph is always in sync with the actual state of in-flight calls.
Key capabilities:
- **Abort cascading** — abort a call → all children are automatically aborted via `parentRequestId` chains
- **Observability** — query what's running, what failed, what's blocked
- **DAG operations** — topological sort of running calls, cycle detection (shouldn't happen but verified), reachability queries
- **Serialization** — `export()`/`fromJSON()` for Postgres persistence
## Construction
### fromCallEvents()
```typescript
static fromCallEvents(events: CallEventMapValue[]): FlowGraph<CallNodeAttrs, CallEdgeAttrs>
```
Builds a call graph from an array of call protocol events. Events are processed in order:
1. **`call.requested`** → add a `CallNodeAttrs` node with `status: "pending"`. If `parentRequestId` is set, add a `triggered` edge from parent to child.
2. **`call.responded`** → update node status to `completed`, set `output` and `completedAt`
3. **`call.error`** → update node status to `failed`, set `error` and `completedAt`
4. **`call.aborted`** → update node status to `aborted`, set `completedAt`
5. **`call.completed`** → update node status to `completed`, set `completedAt` (if not already set by `call.responded`)
Processing is idempotent — processing the same event twice has no effect (the node already has the updated status).
### Incremental: updateFromEvent()
```typescript
updateFromEvent(event: CallEventMapValue): void
```
Updates an existing call graph with a single call event. This is the primary interface for real-time graph population:
```typescript
const callGraph = new FlowGraph();
// Subscribe to call protocol events
pubsub.subscribe("call.requested", (event) => callGraph.updateFromEvent(event));
pubsub.subscribe("call.responded", (event) => callGraph.updateFromEvent(event));
pubsub.subscribe("call.error", (event) => callGraph.updateFromEvent(event));
pubsub.subscribe("call.aborted", (event) => callGraph.updateFromEvent(event));
pubsub.subscribe("call.completed", (event) => callGraph.updateFromEvent(event));
```
### fromJSON()
```typescript
static fromJSON(data: CallGraphSerialized): FlowGraph
```
Deserialize from graphology native JSON format. Used for loading persisted call graphs from Postgres.
## Node Attributes
See [schema.md](schema.md#CallNodeAttrs) for the full schema definition.
| Field | Type | Set by |
|-------|------|--------|
| `requestId` | `string` | `call.requested` |
| `operationId` | `string` | `call.requested` |
| `status` | `CallStatus` | Updated by each call event |
| `parentRequestId` | `string?` | `call.requested` |
| `input` | `unknown` | `call.requested` |
| `output` | `unknown?` | `call.responded` |
| `error` | `{ code, message, details? }?` | `call.error` |
| `identity` | `Identity?` | `call.requested` |
| `startedAt` | `string?` | `call.requested` (when handler starts) |
| `completedAt` | `string?` | Terminal event (`responded`, `error`, `aborted`) |
The node key is `requestId`.
## Edges
Call graph edges carry an `edgeType` attribute:
| `edgeType` | Meaning | Added by |
|-----------|---------|----------|
| `triggered` | Parent call caused child call to execute | `call.requested` with `parentRequestId` |
| `depends_on` | Data dependency — source needs target's result | Explicit declaration (not auto-populated) |
`depends_on` edges represent data dependencies between calls. Per [ADR-005](decisions/005-event-log-as-source-of-truth.md), the reactive engine does NOT use `depends_on` edges for data flow — data flows through the result projection (`getResult()`). However, `depends_on` edges remain in the API for **observability and visualization**: they annotate which calls depended on which other calls' results, providing a data-flow overlay on top of the call hierarchy. Hub coordinators or external tools may add `depends_on` edges to annotate observed data flow for debugging, monitoring, or call-graph visualization. They do NOT affect execution.
### Edge Key Convention
`triggered` edges use `${parentRequestId}->${childRequestId}` as the edge key. `depends_on` edges use `${sourceRequestId}->${targetRequestId}:depends_on` to distinguish from `triggered` edges between the same pair.
This composite key format is necessary because `multi: false` allows at most one edge per key between a given (source, target) pair. Since a call graph can have both a `triggered` edge (parent→child) and a `depends_on` edge (data dependency) between the same pair of calls, the edge type suffix in the key disambiguates them. See [schema.md#edge-key-convention](schema.md) for the general key convention and the discussion of multi-edge support.
## Status Lifecycle
Call node status transitions follow a strict state machine:
```
call.requested
┌─────────┐
│ pending │
└────┬────┘
handler starts
┌─────────┐
┌────│ running │────┐
│ └────┬────┘ │
call.aborted │ call.aborted
│ │ │
▼ │ ▼
┌─────────┐ │ ┌─────────┐
│ aborted │ │ │ aborted │
└─────────┘ │ └─────────┘
┌─────────┼─────────┐
│ │ │
call.responded │ call.error
│ │ │
▼ │ ▼
┌───────────┐ │ ┌────────┐
│ completed │ │ │ failed │
└───────────┘ │ └────────┘
call.completed
┌───────────┐
│ completed │
└───────────┘
```
Invalid transitions (e.g., `completed``running`) throw `InvalidTransitionError`. The `updateStatus()` method validates the transition before applying it.
## Abort Cascading
When a call is aborted, all of its children should also be aborted. The call protocol handles this via `call.aborted` events propagating through `parentRequestId` chains.
The call graph supports this with a traversal query:
```typescript
// Abort cascade: get all descendants of a call
const descendants = callGraph.descendants(requestId);
// → all calls that would be affected by aborting this call
```
The hub coordinator can:
1. Receive `call.aborted` for a parent call
2. Query `callGraph.descendants(requestId)` for all children
3. Abort each child call via `PendingRequestMap.abort()`
This is a structural operation — the graph provides the "who is affected" information, the protocol provides the "abort them" mechanism.
## Observability Queries
The call graph supports queries for observability without traversing the entire graph:
| Query | Method | Returns |
|-------|--------|---------|
| Get running calls | `filterByStatus("running")` | Node IDs with running status |
| Get failed calls | `filterByStatus("failed")` | Node IDs with failed status |
| Get top-level calls | `getRoots()` | Nodes with no `parentRequestId` |
| Get children of call | `children(requestId)` | Direct children via `triggered` edges |
| Get call duration | `duration(requestId)` | `completedAt - startedAt` (throws if not completed) |
| Get call lineage | `lineage(requestId)` | Ancestor chain from root to this call |
### filterByStatus
```typescript
filterByStatus(status: CallStatus): string[]
```
Returns all node keys with the given status. Implemented as a filter over `graph.forEachNode()`. For small graphs (tens to hundreds of nodes), this is O(n) and fast. For very large graphs, a status index could be added as an optimization.
### getRoots
```typescript
getRoots(): string[]
```
Returns all nodes with `parentRequestId === undefined` (top-level calls). These are the entry points of call chains.
## Serialization and Persistence
```typescript
const data = callGraph.export(); // graphology native JSON
callGraph.toJSON(); // alias for export()
const restored = FlowGraph.fromJSON(data); // round-trip
```
The call graph's `export()`/`fromJSON()` boundary is designed for Postgres persistence via the hub's storage layer. Flowgraph does not handle database operations — it provides the serialized format, and the hub handles storage.
Payload fields (`input`, `output`, `error`) are stored as-is in the graph. The hub's storage layer is responsible for truncation and redaction (see `@alkdev/alkhub_ts/docs/architecture/storage/call-graph.md` for the payload handling strategy).
## Mutations
```typescript
// Add a call node (from call.requested event)
// If attrs.parentRequestId is set, also creates a triggered edge from parent to child
addCall(attrs: CallNodeAttrs): void
// Update call status (from call.responded/error/aborted/completed event)
updateStatus(requestId: string, status: CallStatus, extra?: Partial<CallNodeAttrs>): void
// Add a dependency edge (explicit, not auto-populated by call protocol)
// Creates an edge with edgeType: "depends_on"
addDependency(source: string, target: string): void
// Remove a call node and its edges
removeCall(requestId: string): void
// Update call attributes (partial merge)
updateCall(requestId: string, attrs: Partial<CallNodeAttrs>): void
```
`addCall` is the primary entry point for populating the call graph from call events. When `attrs.parentRequestId` is present, it automatically creates a `triggered` edge from the parent to the new node. `addDependency` creates explicit `depends_on` edges that represent data dependencies not captured by the parent-child hierarchy. `updateStatus` validates the transition. `addDependency` validates that both endpoints exist and that the edge would not create a cycle. `removeCall` removes the node and all attached edges (graphology cascade).
## Constraints
- **DAG-only** — call graphs cannot have cycles. A call cannot be its own ancestor. `addCall` with a `parentRequestId` that would create a cycle throws `CycleError`.
- **Status transitions are validated** — invalid transitions throw `InvalidTransitionError`.
- **Node keys are `requestId`** — not `operationId`. Multiple calls to the same operation have different `requestId`s but the same `operationId`.
- **`parentRequestId` is both node attribute and edge** — denormalized for fast point lookups (node attribute) and traversal queries (edge), following the storage schema pattern.
- **`depends_on` edges are for observability, not execution** — per ADR-005, data dependencies flow through the result projection. `depends_on` edges annotate observed data flow for visualization and debugging. The reactive engine does NOT use them for scheduling or precondition computation. They may be added by hub coordinators or external tools to document which calls depended on which other calls' results.
- **Payload fields are stored as-is** — flowgraph doesn't truncate or redact `input`, `output`, or `error`. That's the hub's responsibility at the persistence boundary.
- **Small graph sizes** — call graphs at hub level are typically tens of nodes. Performance is a non-issue; O(n) traversals are fine.
## Open Questions
1. ~~**Should the call graph support `call.requested` events with unknown `operationId`?**~~ **Resolved (OQ-014)**: Yes — the call graph records what happened, not what should have happened. Nodes with unknown `operationId` get `status: "pending"` and may later transition to `"failed"` with an `OPERATION_NOT_FOUND` error code. This is consistent with the error-handling doc's existing statement about unknown `operationId`. The behavior should be documented explicitly in the `fromCallEvents()` specification: when a `call.requested` event references an `operationId` not in the registry, the node is still created with `status: "pending"` and the given `operationId`. This enables the call graph to serve as a complete audit trail regardless of registry state.
2. ~~**Should `depends_on` edges be auto-populated from workflow templates?**~~ **Resolved (OQ-008/ADR-005)**: `depends_on` edges are unnecessary as a separate concept. Data dependencies are expressed through the result projection. If node B needs node A's output, B reads `getResult("A")` from the result projection. The temporal ordering (A before B) is already expressed by template edges. There's no need for a separate edge type to represent data flow — the event log is the data transport.
3. ~~**Should the call graph support multiple graphs simultaneously?**~~ **Resolved (OQ-015)**: No — one `FlowGraph` instance per graph. Multiple concurrent workflows use multiple instances. This design is simpler and matches graphology's model. Subgraphs would require a scoping mechanism and cross-scope queries that add complexity without benefit at current scale. The hub coordinator creates one `WorkflowReactiveRoot` per workflow, so one `FlowGraph` per workflow is consistent. This is a deliberate "no," not a deferral — if future scale demands require multi-workflow queries, a specialized query layer can aggregate across instances.
4. ~~**Should `filterByStatus` use an index?**~~ **Resolved (OQ-016)**: No — O(n) filter is sufficient for expected graph sizes (tens to hundreds of nodes). A status index would add implementation complexity (maintain on every `updateStatus()`) for no measurable benefit at current scale. If performance becomes an issue with very large graphs, a `Map<CallStatus, Set<string>>` index can be added as an optimization later without changing the public API.
## References
- Schema: [schema.md](schema.md) — `CallNodeAttrs`, `CallEdgeAttrs`, `CallStatus`, `EdgeType`
- Call protocol: `@alkdev/alkhub_ts/docs/architecture/call-graph.md`
- Call graph storage: `@alkdev/alkhub_ts/docs/architecture/storage/call-graph.md`
- Call event types: `@alkdev/operations/src/call.ts`
- Taskgraph pattern: `@alkdev/taskgraph_ts/src/graph/construction.ts`