resolve all remaining open questions (OQ-03–OQ-29), add ADR-006

Resolve all 19 remaining open questions across the architecture. Every question now has a documented resolution with rationale: - OQ-004/OQ-029: edgeType is a universal required attribute on all edges, single graph per FlowGraph instance (ADR-006) - OQ-011: No OR preconditions for v1; preconditionMode as v2 extension - OQ-012: maxConcurrency enforced via reactive counting semaphore - OQ-014: Unknown operationId creates node with pending status - OQ-017: Expose common graphology traversal methods on FlowGraph (80/20) - OQ-020: condition as Type.Unknown() with string/function documentation - OQ-022: Identity imported from @alkdev/operations peer dep - All other questions resolved with documented rationale Fix three critical issues found by architecture review: 1. edgeType serialization/validation gap: document two-step validation 2. CallEdgeAttrs runtime discrimination: edgeType as runtime discriminant, depends_on edges clarified as observability-only (not execution) 3. ADR-005 signal mutation inconsistency: explicitly distinguish call-level statuses (event-log-driven) from workflow-derived statuses (signal-mutation) Additional clarifications: - dataFlow inference uses conservative strategy (defaults false) - Conditional.test string resolution: operationName → status === completed - Add negated field to TemplateEdgeAttrs for else-branch conditions - Document edge key priority convention for composite keys - Add maxConcurrency semaphore design to reactive-execution.md
2026-05-21 09:25:55 +00:00
parent c76be7f689
commit f3e084d02f
9 changed files with 239 additions and 268 deletions
--- a/docs/architecture/reactive-execution.md
+++ b/docs/architecture/reactive-execution.md
@@ -1,6 +1,6 @@
 ---
 status: draft
-last_updated: 2026-05-21
+last_updated: 2026-05-22
 ---

 # Reactive Execution
@@ -376,67 +376,7 @@ Both B and C become `ready` at the same time, and the hub starts them in paralle

 ### Join preconditions

-When a node depends on multiple predecessors (e.g., D depends on both B and C completing):
-
- D's preconditions: `B.status === "completed" && C.status === "completed"`
-
-D only becomes `ready` when all predecessors complete. This is the "join" in fork-join parallelism.
-
-## Failure Propagation
-
-Failure propagation is the mechanism by which a failed or aborted node causes its downstream dependents to abort. The key design principle: **failure follows dependency edges, not structural scope**.
-
-This means:
- In a `Sequential` group, failure propagates forward through the chain (B depends on A, so if A fails, B aborts)
- In a `Parallel` group, sibling branches are independent — a failure in branch A does NOT affect branch B, because there are no dependency edges between them
- A node that depends on multiple predecessors (a join) aborts only when it's impossible for its preconditions to ever be met
-
-### The preconditions-failure duality
-
-Each node has two complementary reactive computations:
-
-1. **`preconditions`** (`computed<boolean>`) — true when all predecessors are `completed` or `skipped`. Node can start.
-2. **`blockedByFailure`** (`computed<boolean>`) — true when any predecessor is `failed` or `aborted` and the failure is uncaught (not handled by a `Conditional`).
-
-```typescript
-const preconditions = computed(() => {
-  const predecessors = graph.inNeighbors(node);
-  return predecessors.every(pred => {
-    const status = statusMap.get(pred)!.value;
-    return status === "completed" || status === "skipped";
-  });
-});
-
-const blockedByFailure = computed(() => {
-  const predecessors = graph.inNeighbors(node);
-  return predecessors.some(pred => {
-    const status = statusMap.get(pred)!.value;
-    return status === "failed" || status === "aborted";
-  });
-});
-```
-
-When `blockedByFailure` becomes `true` and the node hasn't started (`idle` or `waiting`), the node transitions to `aborted`. This happens via an `effect()`:
-
-```typescript
-effect(() => {
-  if (blockedByFailure.value && (status.value === "idle" || status.value === "waiting")) {
-    status.value = "aborted";
-  }
-});
-```
-
-This cascade is automatic and reactive — when a predecessor fails, all downstream `blockedByFailure` computations re-evaluate, and their effects fire, aborting any waiting dependents.
-
-### Sequential failure propagation
-
-```
-A (failed) → B (aborted) → C (aborted)
-```
-
-When A fails, B's `blockedByFailure` becomes true. B transitions from `waiting` to `aborted`. C's `blockedByFailure` then becomes true (B is now `aborted`). C transitions to `aborted`. The entire downstream chain aborts.
-
-### Parallel independence
+When a node depends on multiple predecessors (fork-join):

 ```
        ┌── B (completed) ──┐
@@ -444,36 +384,33 @@ A (completed)                ├── D (ready)
        └── C (failed) ─────┘
 ```

-When C fails:
- C's downstream dependents see `blockedByFailure = true`
- B is unaffected — it's on an independent branch
- D depends on both B and C. D's `preconditions` will never be met (C is `failed`, not `completed`). D's `blockedByFailure` is true (C is `failed`). D transitions to `aborted`.
-
-But crucially, this is because D *depends on* C, not because they share a structural scope:
-
-```
-        ┌── B (completed) ──┐
-A (completed)                │   (no edge from C to E)
-        └── C (failed) ─────┘
-                                    └── E (completed)
-```
-
-E has no dependency on C. E continues running regardless of C's failure. **Failure follows dependency edges, not structural boundaries.**
-
-### Join semantics
-
-When a node depends on multiple predecessors (fork-join):
-
-```
-        ┌── B (completed) ──┐
-A (completed)                ├── D (aborted)
-        └── C (failed) ─────┘
-```
-
 D's `preconditions` requires both B and C to be completed/skipped. Since C is `failed`, D's preconditions can never be met. D transitions to `aborted`.

 The alternative would be "partial success" — D starts with B's output even though C failed. This is NOT supported by the precondition model. If partial execution is needed, the template author should use a `Conditional` to handle the failure case explicitly.

+### `maxConcurrency` for Parallel groups
+
+A `Parallel` group with `maxConcurrency: 3` should only start 3 nodes at a time, even though all preconditions are met. This is a scheduling constraint, not a structural one — the DAG doesn't encode it.
+
+The `WorkflowReactiveRoot` enforces `maxConcurrency` via a reactive counting semaphore:
+
+```typescript
+// For each node in a Parallel group with maxConcurrency:
+const groupKey = getParallelGroup(nodeId);  // from parentMap/siblingMap
+const maxConc = getMaxConcurrency(groupKey); // from template props
+
+const canStart = computed(() => {
+  const siblingRunningCount = siblings.filter(
+    sib => statusMap.get(sib)!.value === "running"
+  ).length;
+  return preconditions.value && siblingRunningCount < maxConc;
+});
+```
+
+A node becomes `ready` only when both its `preconditions` are met AND the number of currently running siblings is below `maxConcurrency`. When a sibling completes and a slot opens, the next ready node starts.
+
+For `Parallel` groups without `maxConcurrency` (the default), all siblings start immediately when their preconditions are met — no semaphore is needed.
+
 ### Conditional as error boundary

 A `Conditional` can catch a failure and redirect to a fallback path:
@@ -771,7 +708,8 @@ The `WorkflowErrorBoundary` catches errors that escape the signal graph (e.g., a

 ## Constraints

- **Events are the source of truth** (ADR-005) — the hub coordinator appends call protocol events. Status, results, and call graph state are derived from the event log. The coordinator does NOT directly set signal values.
+- **Events are the source of truth for call-level statuses** (ADR-005) — the hub coordinator appends call protocol events. Call-level statuses (`running`, `completed`, `failed`, `aborted` from `call.aborted`) are derived from the event log by the status projection. The coordinator does NOT directly set signal values for these statuses.
+- **Workflow-derived statuses use signal mutation** — statuses that have no call protocol equivalent (`idle`, `waiting`, `ready`, `skipped`, and `aborted` from `blockedByFailure`) are set directly on signals by the reactive engine. This is not a violation of ADR-005's event-log principle — these statuses represent workflow-level concerns (scheduling, failure propagation) that exist outside the call protocol's scope. ADR-005's principle applies to *call protocol events*; it does not forbid the reactive layer from managing its own workflow-level state. See the "Hybrid Status Model" section for the full categorization.
 - **Event processing is idempotent** — processing the same event twice produces the same projected state. The status projection scans for the most recent event per node.
 - **Signals are in-memory** — `WorkflowReactiveRoot` state is not persisted. If the hub restarts, the reactive state is reconstructed from call protocol events + template re-render. The event log itself can be reconstructed from the call protocol event stream.
 - **Failure policy is configurable** — the `FailurePolicy` determines what happens to running nodes when a predecessor fails. Default is `continue-running` (only idle/waiting nodes abort). Alternative is `abort-dependents` (running dependents also abort).
@@ -780,7 +718,8 @@ The `WorkflowErrorBoundary` catches errors that escape the signal graph (e.g., a
 - **Abort is immediate in signals, delayed in protocol** — transitioning a signal to `aborted` is instant, but `prm.abort(requestId)` takes time to propagate through the call protocol. The hub should invoke both.
 - **`skipped` satisfies preconditions** — a `skipped` predecessor is treated as "completed for the purpose of preconditions." It means the branch was deliberately bypassed, not broken.
 - **`failed` and `aborted` block preconditions** — a `failed` or `aborted` predecessor means the dependent's preconditions can never be met. The `blockedByFailure` effect transitions the dependent to `aborted`.
- **`NodeStatus` and `CallStatus` share terminal states** — `running`, `completed`, `failed`, `aborted` map directly. `idle`, `waiting`, `ready`, `skipped` are workflow-specific additions.
+- **`NodeStatus` and `CallStatus` share terminal states** — `running`, `completed`, `failed`, `aborted` map directly. `idle`, `waiting`, `ready`, `skipped` are workflow-specific additions with no call protocol equivalent.
+- **Edge key format uses composite keys for call graph** — `triggered` edges use `${source}->${target}`, `depends_on` edges use `${source}->${target}:depends_on`. See [schema.md](schema.md) for the full key convention.

 ## Lifecycle and Ownership

@@ -872,15 +811,15 @@ The `ReactiveContext` passed to `ReactiveHostConfig` includes a reference to `wo

 ## Open Questions

-1. **Should preconditions support OR logic?** Currently all predecessors must complete (AND logic). An `anyOf` predicate would allow "start this node as soon as any predecessor completes." This would require an edge attribute or node-level configuration.
+1. ~~**Should preconditions support OR logic?**~~ **Resolved (OQ-011)**: No for v1. All preconditions use AND logic — a node becomes `ready` only when ALL predecessors have reached a satisfying terminal state (`completed` or `skipped`). OR logic (`anyOf`) would introduce significant complexity (what happens when one predecessor completes but another fails? Is the node ready or blocked?) and is already partially addressed by `Conditional` (which provides branch-level either/or semantics). For v2, if OR logic becomes necessary, it should be added as a `preconditionMode: "allOf" | "anyOf"` attribute on `Operation` (node-level, not edge-level), defaulting to `"allOf"`. This is a clean extension point that doesn't change the current precondition model.

 2. ~~**How are retries handled at the signal level?**~~ **Resolved by ADR-005**: Retries are natural append events. A retry creates a new `call.requested` with a new `requestId`. The status projection derives the current state by scanning for the most recent event per node. No `retried` status needed. See the Retry semantics section above.

-3. **Should the reactive graph support partial re-rendering?** If a template changes mid-execution (e.g., a step is added), the ujsx reconciler could diff the old and new trees. But the ReactiveHost only supports mount rendering. Re-rendering would require reconciler support.
+3. ~~**Should the reactive graph support partial re-rendering?**~~ **Resolved (OQ-025)**: Blocked on ujsx reconciler. Currently mount-only. When the reconciler is implemented, flowgraph gains re-rendering through the standard `prepareUpdate`/`commitUpdate` HostConfig methods. The event log persists across re-renders (ADR-005), so re-rendered nodes pick up where they left off. No special reactive-graph re-rendering logic is needed — the reconciler handles tree diffing, and the HostConfig applies mutations.

-4. **How does `maxConcurrency` interact with preconditions?** A `Parallel` group with `maxConcurrency: 3` should only start 3 nodes at a time, even though all preconditions are met. This is a scheduling concern, not a structural one. The reactive layer could implement this as a semaphore signal, or it could be the coordinator's responsibility.
+4. ~~**How does `maxConcurrency` interact with preconditions?**~~ **Resolved (OQ-012)**: `maxConcurrency` is a `Parallel` prop enforced by the `WorkflowReactiveRoot` via a counting semaphore in the reactive layer. When the root initializes signals for nodes in a `Parallel` group with `maxConcurrency: N`, it wraps the precondition logic: a node's effective `ready` transition requires both `preconditions.value === true` AND `runningCount < maxConcurrency`. The `runningCount` is a reactive computed derived from counting sibling nodes currently in the `running` state. This is entirely a reactive-engine concern — the DAG doesn't encode `maxConcurrency` (it's not structural), and the call graph doesn't need to know about it. The `Parallel` component's `maxConcurrency` prop is already part of the template definition; the reactive engine just needs to honor it.

-5. **Should `blockedByFailure` be a separate `computed` or derived from `preconditions`?** Currently the design has two separate computeds — `preconditions` (all predecessors completed/skipped) and `blockedByFailure` (any predecessor failed/aborted). An alternative is a single `computed<NodeReadiness>` that returns `"ready" | "blocked" | "failed"` or similar. This reduces the number of effects but makes the readiness check less composable.
+5. ~~**Should `blockedByFailure` be a separate `computed` or derived from `preconditions`?**~~ **Resolved (OQ-013)**: Keep two separate `computed` values (current design). Two separate computeds are more composable — you can check preconditions independently of failure status, and you can compose different effects for each. A single `computed<NodeReadiness>` would require every consumer to destructure the result, losing the clean `if (preconditions.value) { ... }` pattern. The implementation cost of two effects per node is negligible. The current design is the right one.

 6. ~~**What happens to running nodes when a predecessor fails?**~~ **Resolved by ADR-005/OQ-010**: This is a `FailurePolicy` configuration of the projection. The default policy (`continue-running`) means running nodes continue. An alternative policy (`abort-dependents`) would abort running dependents. The event log makes both strategies expressible — only the projection logic changes.