Add ADR-026 (vault key model — HD derivation) recording the foundational HD-derivation decision, 74' coin type reservation, SLIP-0010/Ed25519 default, secp256k1 feature-gating, and AES-256-GCM cipher choice. These were previously inline rationale with no ADR (W9). Extend ADR-018 with an explicit EncryptedData wire format lock — fields, encoding, and semantics are frozen; no removal without a format-version migration (W10). Resolve the remaining guard clauses and spec decisions: - W2: Capabilities must be immutable after construction (no interior mutability). Makes the Arc vs deep-copy clone semantics genuinely two-way. - W5: Published to_* specs are compatibility contracts — best-effort mappings are two-way before first publication, one-way after. Version generated specs. - W6: Salt field clarification — v2 salt is permanently unused; a future KDF is a different derivation family, not a version-indexed path; the field saves a wire-format change only. - W7: unlock_new returns Zeroizing<String> — the mnemonic is the root of trust and must not linger in freed memory. - W17: OQ-09 WASM — server-side dispatch door is honestly closed (Connection is concrete, tokio-bound), not implicitly preserved. - W18: OQ-10 git — composability fork (raw smart protocol vs call-protocol projection) is a separate decision from ERC721 scope. - W20: from_openapi must prefix imported error codes (HTTP_404) to avoid collision with protocol-level codes (NOT_FOUND). Normative rule, not naming convention. - W21: ScopedOperationEnv field is private — construction via new()/ empty(), query via allows(). Makes the future subgraph refactor non-breaking. - C13: Connection::set_identity — the endpoint does not read identity() after handle() returns (Connection is moved into the spawned task). Observability is handler-side logging. Simplest honest answer. - W1: OperationAdapter trait is async, returns Vec<HandlerRegistration>. from_call requires async discovery; ADR-022 changed the return type. - W11: CompositionAuthority::as_identity() defined — constructs a synthetic Identity (label as id, scopes, resources) not resolvable via IdentityProvider. Second Identity construction path, acknowledged. - W14: SecretKey is iroh::SecretKey (Ed25519) — consistent with the endpoint's iroh dependency. - W19: Grandchild abort propagation is inherit-by-default (option a) — invoke() with no explicit policy inherits parent's policy. ContinueRunning auto-propagates to grandchildren unless explicitly overridden.
257 lines
13 KiB
Markdown
257 lines
13 KiB
Markdown
# ADR-016: Abort Cascade for Nested Calls
|
|
|
|
## Status
|
|
|
|
Accepted
|
|
|
|
## Context
|
|
|
|
The call protocol allows handlers to compose other operations through
|
|
`OperationEnv::invoke()`. This creates a call tree: a parent request spawns
|
|
children (via `parent_request_id`), which may spawn their own children. The
|
|
tree is the agency chain (ADR-015) — principal delegates to agent, agent may
|
|
delegate to sub-agent.
|
|
|
|
When `call.aborted` arrives for a parent request, the current `PendingRequestMap`
|
|
removes only that single entry. The children are unaware — they continue running,
|
|
consuming resources, and potentially producing side effects. This is the nested
|
|
abort problem:
|
|
|
|
```
|
|
Client calls /agent/chat (r1)
|
|
agent handler calls /fs/readFile via env.invoke (r1-a)
|
|
fs handler calls /db/query via env.invoke (r1-a-1)
|
|
agent handler calls /bash/exec via env.invoke (r1-b)
|
|
|
|
Client aborts r1 (call.aborted { id: "r1" })
|
|
→ r1 removed from PendingRequestMap
|
|
→ r1-a, r1-a-1, r1-b continue running (ghost work)
|
|
→ bash/exec keeps executing (unwanted side effect)
|
|
→ db/query keeps running (wasted resources)
|
|
→ results produced that nobody consumes
|
|
```
|
|
|
|
The `@alkdev/flowgraph` TypeScript package solved this with a directed graph
|
|
that tracks the call tree and a `FailurePolicy` enum:
|
|
|
|
- `"abort-dependents"`: aborting a node cascades to all non-terminal descendants.
|
|
This is the "whole tree should abort" behavior.
|
|
- `"continue-running"`: only idle/waiting dependents are aborted; started ones
|
|
keep going. New ones don't start because their predecessors failed/aborted.
|
|
|
|
The agent use case makes this concrete and urgent: an LLM composes deep, dynamic
|
|
call trees (parallel tools, sequential tools, sub-agents calling sub-tools).
|
|
Aborting a chat should tear down the entire tree — the LLM HTTP stream, all tool
|
|
calls, all sub-calls. But this is a protocol-level concern, not an agent feature:
|
|
every consumer (NAPI adapter, Python adapter, any service speaking EventEnvelope)
|
|
inherits whatever abort model the protocol defines. The call protocol is a
|
|
general-purpose cross-boundary RPC mechanism; nested composition is a core
|
|
protocol feature, and abort semantics for that composition are protocol semantics.
|
|
|
|
## Decision
|
|
|
|
### 1. `call.aborted` cascades to descendants
|
|
|
|
When `call.aborted` arrives for a request, the protocol cascades the abort to
|
|
all non-terminal descendants in the call tree (identified via `parent_request_id`).
|
|
Each descendant receives a `call.aborted` event. The `PendingRequestMap` removes
|
|
all affected entries.
|
|
|
|
The cascade is protocol-level: the event schema carries cascade semantics. A
|
|
`call.aborted` for a parent implies abort of all descendants. This is not a
|
|
client-side convention — the server (CallAdapter) is responsible for discovering
|
|
descendants and propagating the abort.
|
|
|
|
### 2. Default policy: `abort-dependents`
|
|
|
|
The default policy is `abort-dependents`: aborting a request aborts everything
|
|
downstream, regardless of branch. This is the correct default because aborted
|
|
parent work has no consumer waiting for results — continuing is wasted work at
|
|
best and unwanted side effects at worst (e.g., a `bash/exec` that keeps running
|
|
after the caller stopped caring, a DB mutation that completes after the
|
|
transaction was aborted).
|
|
|
|
### 3. Opt-in policy: `continue-running`
|
|
|
|
An opt-in `continue-running` policy is available for cases where long-running
|
|
work should survive a parent's abort. Under `continue-running`:
|
|
- Descendants that have already started (status: running) continue to completion.
|
|
- Descendants that haven't started yet (status: pending/waiting) are aborted
|
|
(their predecessors failed, so they can't proceed).
|
|
- No new descendants start (the parent is gone).
|
|
|
|
Use cases for `continue-running`: a long-running subscription that should keep
|
|
streaming after its parent's sibling failed; a background task that was spawned
|
|
by a handler and should survive the handler's abort.
|
|
|
|
The caller or handler specifies the policy at call time. The policy is set
|
|
on the `OperationContext` and propagated to children via `OperationEnv::invoke()`
|
|
— see Decision 6 below. The default is `abort-dependents`; `continue-running`
|
|
is an opt-in for long-running work that should survive a parent's abort.
|
|
|
|
### 4. Cleanup hooks
|
|
|
|
When a call is aborted, handlers need a mechanism to clean up resources: cancel
|
|
an HTTP stream, cancel a honker queue job, close a file handle, release a lock.
|
|
The protocol provides this through the call lifecycle — when a call is aborted,
|
|
the handler's task is cancelled (in Rust, the future is dropped). Cleanup is
|
|
handled by `Drop` implementations on resource guards, or by explicit
|
|
cancellation callbacks if the handler registers them.
|
|
|
|
This is a handler-level concern, not a protocol-level one. The protocol's job is
|
|
to cascade the abort; the handler's job is to clean up when cancelled. The
|
|
mechanism (tokio `CancellationToken`, `Drop` guards, explicit callbacks) is a
|
|
two-way door for implementation.
|
|
|
|
### 5. The call tree is tracked via `parent_request_id`
|
|
|
|
The call tree is already recorded: `OperationContext.parent_request_id` links
|
|
each call to its parent. The cascade mechanism walks this tree to find
|
|
descendants. No separate graph structure is required at the protocol level —
|
|
the `PendingRequestMap` can index entries by `parent_request_id` to enable
|
|
efficient descendant lookup.
|
|
|
|
The `@alkdev/flowgraph` package (directed graph with `descendants()`,
|
|
reactive status propagation, `FailurePolicy`) is prior art and may be adapted
|
|
as a separate Rust crate for consumers that need richer call-tree visualization
|
|
or reactive status tracking. It is not required for the protocol-level cascade
|
|
— a parent-indexed map suffices.
|
|
|
|
### 6. The abort policy is set on `OperationContext`, not on the wire payload
|
|
|
|
The abort policy (`abort-dependents` vs `continue-running`) is set on
|
|
`OperationContext` and propagated to children via `OperationEnv::invoke()`.
|
|
It is NOT a field in the `call.requested` wire payload, and it is NOT a
|
|
per-operation declaration on `OperationSpec`.
|
|
|
|
**Why not the wire payload**: the wire caller doesn't know the composition
|
|
tree. The caller of `/agent/chat` cannot meaningfully decide whether
|
|
`/fs/readFile` (composed internally by the agent handler) should survive an
|
|
abort — the handler that composes the child knows that, not the wire caller.
|
|
Putting the policy on the wire payload would give the wire caller control
|
|
over internal composition behavior it can't see.
|
|
|
|
**Why not per-operation declaration**: Assumption 5 says the policy
|
|
is per-call, not per-operation. The same operation may need
|
|
`abort-dependents` in one composition context and `continue-running` in
|
|
another. A static property on `OperationSpec` can't express that.
|
|
|
|
**How it works on `OperationContext`**: the root context
|
|
(`build_root_context` in the CallAdapter) gets the default policy
|
|
(`abort-dependents`). When a handler composes a child via
|
|
`env.invoke()`, it can specify the policy for that child:
|
|
|
|
```rust
|
|
// Default: abort-dependents (child aborts if parent aborts)
|
|
context.env.invoke("fs", "readFile", input, &context).await
|
|
|
|
// Opt-in: continue-running (child survives parent's abort)
|
|
context.env.invoke_with_policy(
|
|
"fs", "readFile", input, &context, AbortPolicy::ContinueRunning
|
|
).await
|
|
```
|
|
|
|
The child's `OperationContext` carries the policy. If the child itself
|
|
composes grandchildren, the policy **propagates by inheritance** — the
|
|
grandchild inherits the child's policy (which was the parent's policy,
|
|
unless the parent overrode it for the child via `invoke_with_policy`).
|
|
`ContinueRunning` does auto-propagate to grandchildren: if a parent opts
|
|
its child into `ContinueRunning`, and the child composes grandchildren
|
|
without explicitly overriding, the grandchildren also get
|
|
`ContinueRunning`. This is consistent with the composition authority and
|
|
scoped env propagation in ADR-022 — the parent handler decides the
|
|
child's runtime context, including abort policy, and that decision
|
|
propagates through the composition tree by default.
|
|
|
|
**Review #002 W19 resolution**: `invoke()` with no explicit policy
|
|
argument inherits the parent's current policy (option a). It does **not**
|
|
reset to `AbortDependents`. A handler that wants a child to reset to the
|
|
default must explicitly call `invoke_with_policy(...,
|
|
AbortPolicy::AbortDependents)`. This makes the propagation predictable:
|
|
the policy I set for my child applies to my child's children unless they
|
|
re-decide. The `invoke()` default in operation-registry.md
|
|
(`abort_policy: parent.abort_policy.clone()`) is correct.
|
|
|
|
The `OperationEnv` trait gains an optional policy parameter. The specific
|
|
API shape (a separate `invoke_with_policy` method, a policy field on an
|
|
`InvokeOptions` struct, or a builder pattern) is a two-way door for
|
|
implementation — but the policy enters through `OperationEnv::invoke()`,
|
|
not through the wire and not through `OperationSpec`.
|
|
|
|
## Consequences
|
|
|
|
**Positive:**
|
|
- No ghost work. Aborting a parent call tears down the entire tree. Resources
|
|
are released, side effects are halted, no results are produced for absent
|
|
consumers.
|
|
- The default (`abort-dependents`) matches the intuitive expectation: if I
|
|
stop caring about the parent, I stop caring about everything it spawned.
|
|
- The opt-in (`continue-running`) covers the legitimate exception (long-running
|
|
work that should survive) without making it the default.
|
|
- The protocol carries cascade semantics, so every consumer inherits the
|
|
correct behavior — no consumer needs to implement its own abort propagation.
|
|
- The `parent_request_id` chain already exists; the cascade mechanism is an
|
|
index on it, not a new data structure.
|
|
- Cleanup hooks are handled by Rust's async drop semantics — dropping the
|
|
handler's future cancels it, and `Drop` guards release resources. This is
|
|
idiomatic Rust, not a custom mechanism.
|
|
|
|
**Negative:**
|
|
- The `PendingRequestMap` needs a parent-indexed lookup (a `HashMap<String,
|
|
Vec<String>>` from parent_request_id to child request_ids, or a scan). This
|
|
is a minor implementation cost, not a protocol change.
|
|
- The `call.aborted` event schema carries cascade semantics — clients that
|
|
don't understand cascade (future versions, other implementations) would
|
|
need to handle it. Mitigated: cascade is server-side (the CallAdapter walks
|
|
the tree and sends `call.aborted` per descendant), so clients see individual
|
|
abort events regardless of whether they understand the cascade concept.
|
|
- The `continue-running` policy adds a parameter to the call lifecycle. The
|
|
specific location (payload field, context field, per-operation declaration)
|
|
is a two-way door, but the existence of the policy is a one-way commitment.
|
|
|
|
## Assumptions
|
|
|
|
1. **Aborting a parent should abort descendants by default.** If the default
|
|
should be `continue-running` (descendants survive), this ADR is wrong. The
|
|
assumption is that ghost work is worse than premature cancellation — a
|
|
cancelled descendant can be retried, but a ghost process consuming
|
|
resources and producing unwanted side effects is harder to recover from.
|
|
|
|
2. **The server (CallAdapter) is responsible for cascade.** The client sends
|
|
`call.aborted` for one request ID; the server discovers descendants and
|
|
propagates. If the client were responsible for cascading, it would need to
|
|
know the full tree — which it may not (server-side composition creates
|
|
children the client never saw).
|
|
|
|
3. **`parent_request_id` is sufficient to discover descendants.** The call tree
|
|
is a tree (acyclic, single parent per node). If future composition patterns
|
|
create multi-parent relationships (e.g., a shared subcall invoked by two
|
|
parents), the cascade model needs extension. The assumption is that
|
|
composition creates a tree, not a DAG.
|
|
|
|
4. **Dropping the handler's future is sufficient for cleanup.** Rust's async
|
|
drop semantics cancel the future and run `Drop` guards. If a use case
|
|
requires explicit cleanup callbacks (e.g., external systems that need a
|
|
signal), the mechanism needs extension. The assumption is that `Drop`
|
|
guards cover the common cases (HTTP stream cancellation, file handle
|
|
release, lock release).
|
|
|
|
5. **`continue-running` is per-call, not per-operation.** The policy is
|
|
specified at call time via `OperationEnv::invoke()`, not declared at
|
|
registration on `OperationSpec` and not set by the wire caller. The
|
|
composing handler decides the child's policy based on the specific
|
|
context. See Decision 6.
|
|
|
|
## References
|
|
|
|
- ADR-012: Call protocol stream model (bidirectional streams, EventEnvelope,
|
|
ID-based correlation)
|
|
- ADR-015: Privilege model (the call tree is the agency chain —
|
|
`parent_request_id` traces principal → agent)
|
|
- OQ-17: Abort cascade semantics (resolved by this ADR)
|
|
- OQ-19: Session-scoped registries (session-scoped operations are in the call
|
|
tree and participate in cascade)
|
|
- `@alkdev/flowgraph` TypeScript package — prior art for call-graph tracking
|
|
with `descendants()`, `FailurePolicy`, reactive status propagation
|
|
- [call-protocol.md](../crates/call/call-protocol.md)
|
|
- [operation-registry.md](../crates/call/operation-registry.md) |