add flowgraph architecture docs (Phase 1 SDD)
Draft architecture specification for @alkdev/flowgraph — a workflow graph library providing DAG-based orchestration over operations. Covers two graph types (operation graph, call graph), ujsx workflow templates, GraphologyHost and ReactiveHost configs, signal-driven execution, type-compatibility analysis, error hierarchy, and build/distribution. Includes 3 ADRs: ujsx as template IR, DAG-only enforcement, decoupled storage.
This commit is contained in:
255
docs/architecture/call-graph.md
Normal file
255
docs/architecture/call-graph.md
Normal file
@@ -0,0 +1,255 @@
|
||||
---
|
||||
status: draft
|
||||
last_updated: 2026-05-19
|
||||
---
|
||||
|
||||
# Call Graph (Dynamic Runtime)
|
||||
|
||||
The dynamic call graph populated at runtime from call events. Nodes are call invocations with status and timestamps; edges are parent-child and dependency relationships.
|
||||
|
||||
## Overview
|
||||
|
||||
The call graph is the runtime counterpart to the operation graph. Where the operation graph captures what *can* happen (type compatibility), the call graph captures what *is* happening or *has happened* (running calls, completed calls, failures, aborts).
|
||||
|
||||
The call graph is populated automatically by the call protocol — every `call.requested` adds a node, every `call.responded`/`call.error`/`call.aborted` updates its status. This means the call graph is always in sync with the actual state of in-flight calls.
|
||||
|
||||
Key capabilities:
|
||||
- **Abort cascading** — abort a call → all children are automatically aborted via `parentRequestId` chains
|
||||
- **Observability** — query what's running, what failed, what's blocked
|
||||
- **DAG operations** — topological sort of running calls, cycle detection (shouldn't happen but verified), reachability queries
|
||||
- **Serialization** — `export()`/`fromJSON()` for Postgres persistence
|
||||
|
||||
## Construction
|
||||
|
||||
### fromCallEvents()
|
||||
|
||||
```typescript
|
||||
static fromCallEvents(events: CallEventMapValue[]): FlowGraph<CallNodeAttrs, CallEdgeAttrs>
|
||||
```
|
||||
|
||||
Builds a call graph from an array of call protocol events. Events are processed in order:
|
||||
|
||||
1. **`call.requested`** → add a `CallNodeAttrs` node with `status: "pending"`. If `parentRequestId` is set, add a `triggered` edge from parent to child.
|
||||
2. **`call.responded`** → update node status to `completed`, set `output` and `completedAt`
|
||||
3. **`call.error`** → update node status to `failed`, set `error` and `completedAt`
|
||||
4. **`call.aborted`** → update node status to `aborted`, set `completedAt`
|
||||
5. **`call.completed`** → update node status to `completed`, set `completedAt` (if not already set by `call.responded`)
|
||||
|
||||
Processing is idempotent — processing the same event twice has no effect (the node already has the updated status).
|
||||
|
||||
### Incremental: updateFromEvent()
|
||||
|
||||
```typescript
|
||||
updateFromEvent(event: CallEventMapValue): void
|
||||
```
|
||||
|
||||
Updates an existing call graph with a single call event. This is the primary interface for real-time graph population:
|
||||
|
||||
```typescript
|
||||
const callGraph = new FlowGraph();
|
||||
// Subscribe to call protocol events
|
||||
pubsub.subscribe("call.requested", (event) => callGraph.updateFromEvent(event));
|
||||
pubsub.subscribe("call.responded", (event) => callGraph.updateFromEvent(event));
|
||||
pubsub.subscribe("call.error", (event) => callGraph.updateFromEvent(event));
|
||||
pubsub.subscribe("call.aborted", (event) => callGraph.updateFromEvent(event));
|
||||
pubsub.subscribe("call.completed", (event) => callGraph.updateFromEvent(event));
|
||||
```
|
||||
|
||||
### fromJSON()
|
||||
|
||||
```typescript
|
||||
static fromJSON(data: CallGraphSerialized): FlowGraph
|
||||
```
|
||||
|
||||
Deserialize from graphology native JSON format. Used for loading persisted call graphs from Postgres.
|
||||
|
||||
## Node Attributes
|
||||
|
||||
See [schema.md](schema.md#CallNodeAttrs) for the full schema definition.
|
||||
|
||||
| Field | Type | Set by |
|
||||
|-------|------|--------|
|
||||
| `requestId` | `string` | `call.requested` |
|
||||
| `operationId` | `string` | `call.requested` |
|
||||
| `status` | `CallStatus` | Updated by each call event |
|
||||
| `parentRequestId` | `string?` | `call.requested` |
|
||||
| `input` | `unknown` | `call.requested` |
|
||||
| `output` | `unknown?` | `call.responded` |
|
||||
| `error` | `{ code, message, details? }?` | `call.error` |
|
||||
| `identity` | `Identity?` | `call.requested` |
|
||||
| `startedAt` | `string?` | `call.requested` (when handler starts) |
|
||||
| `completedAt` | `string?` | Terminal event (`responded`, `error`, `aborted`) |
|
||||
|
||||
The node key is `requestId`.
|
||||
|
||||
## Edges
|
||||
|
||||
Call graph edges carry an `edgeType` attribute:
|
||||
|
||||
| `edgeType` | Meaning | Added by |
|
||||
|-----------|---------|----------|
|
||||
| `triggered` | Parent call caused child call to execute | `call.requested` with `parentRequestId` |
|
||||
| `depends_on` | Data dependency — source needs target's result | Explicit declaration (not auto-populated) |
|
||||
|
||||
`depends_on` edges are not auto-populated by the call protocol. They represent data dependencies that aren't captured by the parent-child hierarchy. They may be added by:
|
||||
- Workflow template instantiation (the template knows which steps depend on which)
|
||||
- Explicit `addDependency(parent, child)` calls by the hub coordinator
|
||||
|
||||
### Edge Key Convention
|
||||
|
||||
`triggered` edges use `${parentRequestId}->${childRequestId}` as the edge key. `depends_on` edges use `${sourceRequestId}->${targetRequestId}:depends_on` to distinguish from `triggered` edges between the same pair.
|
||||
|
||||
Since `multi: false`, there can be at most one `triggered` and one `depends_on` edge between the same pair. The edge key convention ensures deterministic keys.
|
||||
|
||||
## Status Lifecycle
|
||||
|
||||
Call node status transitions follow a strict state machine:
|
||||
|
||||
```
|
||||
call.requested
|
||||
│
|
||||
▼
|
||||
┌─────────┐
|
||||
│ pending │
|
||||
└────┬────┘
|
||||
│
|
||||
handler starts
|
||||
│
|
||||
▼
|
||||
┌─────────┐
|
||||
┌────│ running │────┐
|
||||
│ └────┬────┘ │
|
||||
call.aborted │ call.aborted
|
||||
│ │ │
|
||||
▼ │ ▼
|
||||
┌─────────┐ │ ┌─────────┐
|
||||
│ aborted │ │ │ aborted │
|
||||
└─────────┘ │ └─────────┘
|
||||
│
|
||||
┌─────────┼─────────┐
|
||||
│ │ │
|
||||
call.responded │ call.error
|
||||
│ │ │
|
||||
▼ │ ▼
|
||||
┌───────────┐ │ ┌────────┐
|
||||
│ completed │ │ │ failed │
|
||||
└───────────┘ │ └────────┘
|
||||
│
|
||||
call.completed
|
||||
│
|
||||
▼
|
||||
┌───────────┐
|
||||
│ completed │
|
||||
└───────────┘
|
||||
```
|
||||
|
||||
Invalid transitions (e.g., `completed` → `running`) throw `InvalidTransitionError`. The `updateStatus()` method validates the transition before applying it.
|
||||
|
||||
## Abort Cascading
|
||||
|
||||
When a call is aborted, all of its children should also be aborted. The call protocol handles this via `call.aborted` events propagating through `parentRequestId` chains.
|
||||
|
||||
The call graph supports this with a traversal query:
|
||||
|
||||
```typescript
|
||||
// Abort cascade: get all descendants of a call
|
||||
const descendants = callGraph.descendants(requestId);
|
||||
// → all calls that would be affected by aborting this call
|
||||
```
|
||||
|
||||
The hub coordinator can:
|
||||
1. Receive `call.aborted` for a parent call
|
||||
2. Query `callGraph.descendants(requestId)` for all children
|
||||
3. Abort each child call via `PendingRequestMap.abort()`
|
||||
|
||||
This is a structural operation — the graph provides the "who is affected" information, the protocol provides the "abort them" mechanism.
|
||||
|
||||
## Observability Queries
|
||||
|
||||
The call graph supports queries for observability without traversing the entire graph:
|
||||
|
||||
| Query | Method | Returns |
|
||||
|-------|--------|---------|
|
||||
| Get running calls | `filterByStatus("running")` | Node IDs with running status |
|
||||
| Get failed calls | `filterByStatus("failed")` | Node IDs with failed status |
|
||||
| Get top-level calls | `getRoots()` | Nodes with no `parentRequestId` |
|
||||
| Get children of call | `children(requestId)` | Direct children via `triggered` edges |
|
||||
| Get call duration | `duration(requestId)` | `completedAt - startedAt` (throws if not completed) |
|
||||
| Get call lineage | `lineage(requestId)` | Ancestor chain from root to this call |
|
||||
|
||||
### filterByStatus
|
||||
|
||||
```typescript
|
||||
filterByStatus(status: CallStatus): string[]
|
||||
```
|
||||
|
||||
Returns all node keys with the given status. Implemented as a filter over `graph.forEachNode()`. For small graphs (tens to hundreds of nodes), this is O(n) and fast. For very large graphs, a status index could be added as an optimization.
|
||||
|
||||
### getRoots
|
||||
|
||||
```typescript
|
||||
getRoots(): string[]
|
||||
```
|
||||
|
||||
Returns all nodes with `parentRequestId === undefined` (top-level calls). These are the entry points of call chains.
|
||||
|
||||
## Serialization and Persistence
|
||||
|
||||
```typescript
|
||||
const data = callGraph.export(); // graphology native JSON
|
||||
callGraph.toJSON(); // alias for export()
|
||||
const restored = FlowGraph.fromJSON(data); // round-trip
|
||||
```
|
||||
|
||||
The call graph's `export()`/`fromJSON()` boundary is designed for Postgres persistence via the hub's storage layer. Flowgraph does not handle database operations — it provides the serialized format, and the hub handles storage.
|
||||
|
||||
Payload fields (`input`, `output`, `error`) are stored as-is in the graph. The hub's storage layer is responsible for truncation and redaction (see `@alkdev/alkhub_ts/docs/architecture/storage/call-graph.md` for the payload handling strategy).
|
||||
|
||||
## Mutations
|
||||
|
||||
```typescript
|
||||
// Add a call node (from call.requested event)
|
||||
addCall(attrs: CallNodeAttrs): void
|
||||
|
||||
// Update call status (from call.responded/error/aborted/completed event)
|
||||
updateStatus(requestId: string, status: CallStatus, extra?: Partial<CallNodeAttrs>): void
|
||||
|
||||
// Add a dependency edge (explicit, not auto-populated)
|
||||
addDependency(source: string, target: string): void
|
||||
|
||||
// Remove a call node and its edges
|
||||
removeCall(requestId: string): void
|
||||
|
||||
// Update call attributes (partial merge)
|
||||
updateCall(requestId: string, attrs: Partial<CallNodeAttrs>): void
|
||||
```
|
||||
|
||||
`updateStatus` validates the transition. `addDependency` validates that both endpoints exist. `removeCall` removes the node and all attached edges (graphology cascade).
|
||||
|
||||
## Constraints
|
||||
|
||||
- **DAG-only** — call graphs cannot have cycles. A call cannot be its own ancestor. `addCall` with a `parentRequestId` that would create a cycle throws `CycleError`.
|
||||
- **Status transitions are validated** — invalid transitions throw `InvalidTransitionError`.
|
||||
- **Node keys are `requestId`** — not `operationId`. Multiple calls to the same operation have different `requestId`s but the same `operationId`.
|
||||
- **`parentRequestId` is both node attribute and edge** — denormalized for fast point lookups (node attribute) and traversal queries (edge), following the storage schema pattern.
|
||||
- **`depends_on` edges are not auto-populated** — they represent data dependencies that the call protocol doesn't capture. They must be added explicitly by the hub coordinator or workflow template instantiation.
|
||||
- **Payload fields are stored as-is** — flowgraph doesn't truncate or redact `input`, `output`, or `error`. That's the hub's responsibility at the persistence boundary.
|
||||
- **Small graph sizes** — call graphs at hub level are typically tens of nodes. Performance is a non-issue; O(n) traversals are fine.
|
||||
|
||||
## Open Questions
|
||||
|
||||
1. **Should the call graph support `call.requested` events with unknown `operationId`?** If a `call.requested` event references an operation not in the registry, should the node be created with `operationId` set to the unknown value? Yes — the call graph records what happened, not what should have happened. The node gets a `status: "pending"` and may later transition to `"failed"` with an `OPERATION_NOT_FOUND` error code.
|
||||
|
||||
2. **Should `depends_on` edges be auto-populated from workflow templates?** When a call graph is instantiated from a workflow template, the template's sequential/parallel structure implies data dependencies. Should the template instantiation automatically create `depends_on` edges? This would couple the call graph to the template system, which may not always be desirable.
|
||||
|
||||
3. **Should the call graph support multiple graphs simultaneously (one per workflow execution)?** Currently the design assumes one call graph per `FlowGraph` instance. If the hub needs to track multiple concurrent workflows, it would use multiple instances. An alternative is a single graph with workflow-scoped subgraphs.
|
||||
|
||||
4. **Should `filterByStatus` use an index?** For small graphs (tens of nodes), a simple filter is fast. For very large graphs, maintaining a `Map<CallStatus, Set<string>>` index would make status queries O(1). The index would need to be updated on every `updateStatus()` call.
|
||||
|
||||
## References
|
||||
|
||||
- Schema: [schema.md](schema.md) — `CallNodeAttrs`, `CallEdgeAttrs`, `CallStatus`, `EdgeType`
|
||||
- Call protocol: `@alkdev/alkhub_ts/docs/architecture/call-graph.md`
|
||||
- Call graph storage: `@alkdev/alkhub_ts/docs/architecture/storage/call-graph.md`
|
||||
- Call event types: `@alkdev/operations/src/call.ts`
|
||||
- Taskgraph pattern: `@alkdev/taskgraph_ts/src/graph/construction.ts`
|
||||
Reference in New Issue
Block a user