ADR-005: event log as single source of truth

Proposed architecture decision to use an append-only execution event log (call protocol events) as ground truth, with status/result/call-graph as projections. Resolves OQ-06, OQ-07, OQ-08, OQ-09; reframes OQ-01, OQ-02, OQ-10. Inspired by event sourcing discipline (notification vs state transfer) and compute_graph ExecutionContext pattern.
2026-05-20 09:33:15 +00:00
parent 27ebbd491e
commit 2c1b2d1a15
3 changed files with 204 additions and 25 deletions
--- a/docs/architecture/open-questions.md
+++ b/docs/architecture/open-questions.md
@@ -14,28 +14,43 @@ Cross-cutting compilation of all unresolved questions across the flowgraph archi
 - When a question is resolved, update its status to `resolved` and add a resolution note
 - Once all questions in a theme are resolved, the theme section can be removed

+## ADR-005 Impact
+
+[ADR-005: Event Log as Single Source of Truth](decisions/005-event-log-as-source-of-truth.md) proposes an Execution Event Log pattern that resolves or reframes several open questions. Questions affected by ADR-005 are marked with `adr-005` in their status. Summary:
+
+| Question | ADR-005 Impact |
+|----------|-----------------|
+| OQ-01 | Reframed: incompatible edges only exist where there's data flow. Temporal-only edges don't need type checking. |
+| OQ-02 | Reframed: type compatibility depth only applies to state-transfer edges, not notification edges. |
+| OQ-06 | Resolved: the reactive layer bridges to the call protocol through the event log, not direct signal mutation. |
+| OQ-07 | Resolved: call graph and reactive engine are both projections of the event log. Neither owns the other. |
+| OQ-08 | Resolved: `depends_on` edges unnecessary; data dependencies expressed through result projection. |
+| OQ-09 | Resolved: retries are natural append events, not state mutations. |
+| OQ-10 | Reframed: policy question (abort running nodes?) becomes a projection configuration, not a hardcoded state machine rule. |
+
 ## Theme 1: Edge Semantics and Type Compatibility

 ### OQ-01: Should `fromSpecs()` add ALL edges or only compatible ones?

 - **Origin**: [operation-graph.md](operation-graph.md) Q1
- **Status**: open
+- **Status**: reframed by ADR-005
 - **Priority**: high — affects storage size, API surface, and diagnostic value
 - **Options**:
  - (a) Add both compatible and incompatible edges (current design). Pro: diagnostic information visible. Con: graph is larger.
  - (b) Only add compatible edges, with a `potentialEdges()` query computing incompatible connections on demand. Pro: smaller graph. Con: loses diagnostic information.
 - **Notes**: This decision affects `buildTypeEdges()` in [analysis.md](analysis.md) and `OperationEdgeAttrs` in [schema.md](schema.md). The `compatible: false` attribute on edges only makes sense if option (a) is chosen.
+- **ADR-005 reframing**: Incompatible edges only exist on **state-transfer** edges (where data flows from A's output to B's input). **Temporal-only** edges (where B starts after A completes but doesn't use A's output) don't need type checking at all. This means option (b) may be correct for temporal edges, while option (a) is correct for state-transfer edges. The operation graph could distinguish these with an edge attribute.
 - **Cross-references**: OQ-04

 ### OQ-02: How granular should type compatibility results be?

 - **Origin**: [operation-graph.md](operation-graph.md) Q4, [analysis.md](analysis.md) Q1
- **Status**: open
+- **Status**: reframed by ADR-005
 - **Priority**: high — directly shapes the `typeCompat()` return type and `OperationEdgeAttrs`
 - **Question (merged)**: How deep should `typeCompat` check? Should it be fully recursive? And should the result be `{ compatible, detail? }` or `{ compatible, mismatches: TypeMismatch[] }`?
 - **Current design**: The schema already defines `TypeMismatch` with `{ path, expected, actual }` and `OperationEdgeAttrs` has an optional `mismatches` field. The analysis doc describes deep recursive structural comparison. But there's a tension: full recursive checking is more thorough but may produce false negatives for schemas with dynamic structures.
 - **Notes**: The schema doc already has `mismatches?: TypeMismatch[]` in `OperationEdgeAttrs`. The analysis doc already defines `TypeCompatResult` with `mismatches`. This suggests the design has already converged toward structured mismatch reporting. What remains is confirming: (a) recursive depth limits, (b) handling of `Type.Unknown()` and complex types (unions, intersections), (c) whether the `detail` string field is still needed alongside `mismatches`.
- **Cross-references**: OQ-01 (incompatible edges need mismatch detail)
+- **ADR-005 reframing**: Type compatibility checking only applies to **state-transfer** edges (where A's output flows into B's input). **Temporal-only** edges (where B starts after A but doesn't use A's output) don't need type checking — their "compatibility" is trivially true. This means the operation graph should distinguish between edges that carry data and edges that only express ordering. `typeCompat()` only needs to run on state-transfer edges.

 ### OQ-03: Should subscription operations be treated differently in type compatibility?

@@ -80,18 +95,20 @@ Cross-cutting compilation of all unresolved questions across the flowgraph archi
 ### OQ-06: How does template instantiation interact with the call protocol?

 - **Origin**: [workflow-templates.md](workflow-templates.md) Q4, [host-configs.md](host-configs.md) Q3
- **Status**: open
+- **Status**: resolved by ADR-005
 - **Priority**: high — this is a fundamental integration point between flowgraph and the call protocol
 - **Question (merged)**: When a template is instantiated as a call graph, each `<Operation>` becomes a call. But the call protocol's `call.requested` events include `parentRequestId` — who is the parent? Is it the template instance? The hub coordinator? And how does the `ReactiveHostConfig` bridge to `registry.execute()` or `PendingRequestMap.call()`?
- **Notes**: The consumer-integration doc shows the coordinator calling `registry.execute()` inside an `effect()`, but doesn't specify the `parentRequestId` semantics. This is a consumer-side decision, but flowgraph needs to document: (a) whether the template has its own `requestId`, (b) how the reactive engine signals the coordinator to start a call, (c) whether `ReactiveHostConfig` has a callback prop for this.
+- **ADR-005 resolution**: The reactive layer bridges to the call protocol through the event log. Call protocol events (`call.requested`, `call.responded`, etc.) are appended to the event log. The reactive status projection derives `NodeStatus` from the log. The result projection derives `CallResult` from the log. The hub coordinator appends events; the reactive layer projects them. No callback, no boomerang, no direct signal mutation by the coordinator.
 - **Cross-references**: OQ-07, OQ-08

 ### OQ-07: Should the reactive engine own the call graph?

 - **Origin**: [host-configs.md](host-configs.md) Q4
- **Status**: open
+- **Status**: resolved by ADR-005
 - **Priority**: high — affects the separation between flowgraph and the call protocol
 - **Question**: Currently the call graph (from call-graph.md) and the reactive engine (from reactive-execution.md) are separate concepts. But at runtime, every `<Operation>` in a template becomes a call graph node. Should the reactive engine populate the call graph as a side effect?
+- **ADR-005 resolution**: Neither owns the other. Both the call graph and the reactive status/result projections derive from the same event log. They are independent projections of the same source of truth. The call graph projects the structural view (who triggered whom). The reactive engine projects the behavioral view (what's running, what's blocked). You can have one without the other, or both simultaneously.
+- **Question**: Currently the call graph (from call-graph.md) and the reactive engine (from reactive-execution.md) are separate concepts. But at runtime, every `<Operation>` in a template becomes a call graph node. Should the reactive engine populate the call graph as a side effect?
 - **Options**:
  - (a) Separate: Call graph is populated by call protocol events. Reactive engine uses signals only. Coordinator bridges them.
  - (b) Unified: Reactive engine creates call graph nodes when nodes transition to `running`, updates them on completion. Call graph is derived from reactive state.
@@ -100,11 +117,10 @@ Cross-cutting compilation of all unresolved questions across the flowgraph archi
 ### OQ-08: Should `depends_on` edges be auto-populated from workflow templates?

 - **Origin**: [call-graph.md](call-graph.md) Q2
- **Status**: open
+- **Status**: resolved by ADR-005
 - **Priority**: medium — affects how the call graph and template system relate
 - **Question**: When a call graph is instantiated from a workflow template, the template's sequential/parallel structure implies data dependencies. Should the template instantiation automatically create `depends_on` edges in the call graph?
- **Notes**: Currently `depends_on` edges must be added explicitly. Auto-population would couple the call graph to the template system. The alternative is for the coordinator to add `depends_on` edges when it instantiates a template.
- **Cross-references**: OQ-06, workflow-templates Q3 (explicit `depends_on` in templates)
+- **ADR-005 resolution**: `depends_on` edges are unnecessary as a separate concept. Data dependencies are expressed through the result projection of the event log. If node B needs node A's output, B reads `getResult("A")` from the result projection. The temporal ordering (A before B) is already expressed by template edges. There's no need for a separate edge type to represent data flow — the event log IS the data transport.

 ---

@@ -113,22 +129,28 @@ Cross-cutting compilation of all unresolved questions across the flowgraph archi
 ### OQ-09: How are retries handled at the signal level?

 - **Origin**: [reactive-execution.md](reactive-execution.md) Q2
- **Status**: open
+- **Status**: resolved by ADR-005
 - **Priority**: high — affects the core status state machine
 - **Question**: If an operation fails and should be retried, the status would need to go `running → failed → ready → running`. But the current state machine marks `failed` as terminal with no exit transitions. How should this work?
 - **Options**:
  - (a) A `retried` status that allows re-entering `ready`. Con: adds another state to `NodeStatus`.
  - (b) A separate `retryCount` attribute. A node can reset its status from `failed` to `ready` if `retryCount < maxRetries`. Con: breaks the terminal-state invariant.
  - (c) Retry creates a new node (new `requestId`). The old node stays `failed`. Con: increases graph size but preserves state machine integrity.
- **Notes**: Option (c) aligns with the call protocol, where each retry is a new call with a new `requestId`. This is likely the right answer but needs confirmation.
+- **ADR-005 resolution**: Option (c) is correct, and the event log makes it natural. A retry is not a state mutation — it's a new sequence of events appended to the log. When `call.requested` arrives for the same operation with a new `requestId`, it's a new fact. The old `call.error` event remains in the log as history. The status projection derives the current state by scanning for the most recent event per node. No `retried` status needed; no state machine mutation; the log preserves full history.
 - **Cross-references**: OQ-10

 ### OQ-10: What happens to running nodes when a predecessor fails?

 - **Origin**: [reactive-execution.md](reactive-execution.md) Q6
- **Status**: open
+- **Status**: reframed by ADR-005
 - **Priority**: high — affects failure propagation correctness
 - **Question**: The current spec transitions `idle` and `waiting` nodes to `aborted` when `blockedByFailure` becomes true. But what about a node that's already `running`? Should it be cancelled?
+- **Options**:
+  - (a) Running nodes are NOT affected. A predecessor's failure blocks dependents that haven't started, but running nodes continue. The coordinator can cancel them via `prm.abort()` if desired.
+  - (b) Running nodes automatically transition to `aborted`. This requires the `effect()` to check for running nodes.
+- **ADR-005 reframing**: This becomes a policy configuration of the status projection, not a hardcoded state machine rule. The event log records the failure fact. The projection decides: do we abort running nodes that depend on the failed node? The answer depends on the workflow's failure strategy. Option (a) is the default (running nodes continue), but a policy could specify otherwise. The event log makes both strategies expressible without changing the underlying mechanism — only the projection logic changes.
+- **Cross-references**: OQ-09 (retries need to know if a running node can be restarted)
+- **Question**: The current spec transitions `idle` and `waiting` nodes to `aborted` when `blockedByFailure` becomes true. But what about a node that's already `running`? Should it be cancelled?
 - **Options**:
  - (a) Running nodes are NOT affected. A predecessor's failure blocks dependents that haven't started, but running nodes continue. The coordinator can cancel them via `prm.abort()` if desired.
  - (b) Running nodes automatically transition to `aborted`. This requires the `effect()` to check for running nodes.
@@ -330,16 +352,16 @@ Cross-cutting compilation of all unresolved questions across the flowgraph archi

 | ID | Question | Origin | Priority | Status |
 |----|----------|--------|----------|--------|
-| OQ-01 | All edges or only compatible edges? | operation-graph | high | open |
-| OQ-02 | Type compatibility depth and granularity | operation-graph, analysis | high | open |
+| OQ-01 | All edges or only compatible edges? | operation-graph | high | reframed by ADR-005 |
+| OQ-02 | Type compatibility depth and granularity | operation-graph, analysis | high | reframed by ADR-005 |
 | OQ-03 | Subscription operations in type compat | operation-graph | medium | open |
 | OQ-04 | `edgeType` on all edges? | schema | medium | open |
 | OQ-05 | Structural container transparency | workflow-templates, host-configs | high | open |
-| OQ-06 | Template ↔ call protocol interaction | workflow-templates, host-configs | high | open |
-| OQ-07 | Should reactive engine own call graph? | host-configs | high | open |
-| OQ-08 | Auto-populate `depends_on` from templates? | call-graph | medium | open |
-| OQ-09 | Retries at signal level | reactive-execution | high | open |
-| OQ-10 | Running nodes when predecessor fails | reactive-execution | high | open |
+| OQ-06 | Template ↔ call protocol interaction | workflow-templates, host-configs | high | resolved by ADR-005 |
+| OQ-07 | Should reactive engine own call graph? | host-configs | high | resolved by ADR-005 |
+| OQ-08 | Auto-populate `depends_on` from templates? | call-graph | medium | resolved by ADR-005 |
+| OQ-09 | Retries at signal level | reactive-execution | high | resolved by ADR-005 |
+| OQ-10 | Running nodes when predecessor fails | reactive-execution | high | reframed by ADR-005 |
 | OQ-11 | OR logic for preconditions | reactive-execution | medium | open |
 | OQ-12 | `maxConcurrency` interaction with preconditions | reactive-execution | medium | open |
 | OQ-13 | `blockedByFailure` vs single computed | reactive-execution | low | open |
@@ -362,13 +384,13 @@ Cross-cutting compilation of all unresolved questions across the flowgraph archi
 ### Priority Assessment

 **High priority** (should resolve before implementation):
- OQ-01: All edges or only compatible — shapes the entire operation graph API
- OQ-02: Type compatibility depth — shapes `typeCompat()` return type
+- ~~OQ-01: All edges or only compatible~~ — reframed by ADR-005: incompatible edges only exist on state-transfer edges
+- ~~OQ-02: Type compatibility depth~~ — reframed by ADR-005: type checking only for state-transfer edges
 - OQ-05: Structural container transparency — fundamental to DAG and reactive engine
- OQ-06: Template ↔ call protocol — fundamental integration point
- OQ-07: Reactive engine owns call graph? — affects architecture boundaries
- OQ-09: Retries — shapes the state machine
- OQ-10: Running node failure handling — shapes failure propagation
+- ~~OQ-06: Template ↔ call protocol~~ — resolved by ADR-005
+- ~~OQ-07: Reactive engine owns call graph?~~ — resolved by ADR-005
+- ~~OQ-09: Retries~~ — resolved by ADR-005
+- ~~OQ-10: Running node failure handling~~ — reframed by ADR-005: policy configuration, not hardcoded

 **Medium priority** (should resolve before v1 release):
 - OQ-03, OQ-04, OQ-08, OQ-11, OQ-12, OQ-14, OQ-17, OQ-20, OQ-21, OQ-22, OQ-29