Files
alknet/docs/architecture/decisions/016-abort-cascade-for-nested-calls.md
glm-5.2 e2730869ca docs(architecture): add ADR-016 abort cascade for nested calls, resolve OQ-17
ADR-016 locks the abort cascade model:
- call.aborted cascades to all non-terminal descendants via parent_request_id
- Default policy: abort-dependents (abort everything downstream)
- Opt-in: continue-running (started descendants continue, pending ones abort)
- Server (CallAdapter) discovers descendants and propagates; client sends one abort
- Handlers clean up via Rust async drop semantics (Drop guards)
- parent_indexed map suffices for tree walking; flowgraph is optional prior art

Spec updates:
- call-protocol.md abort cascade section references ADR-016
- OQ-17 resolved, ADR-016 referenced across all call crate specs
- README.md updated: ADRs 001-016, OQ-17 moved to resolved
2026-06-18 09:37:19 +00:00

195 lines
9.7 KiB
Markdown

# ADR-016: Abort Cascade for Nested Calls
## Status
Accepted
## Context
The call protocol allows handlers to compose other operations through
`OperationEnv::invoke()`. This creates a call tree: a parent request spawns
children (via `parent_request_id`), which may spawn their own children. The
tree is the agency chain (ADR-015) — principal delegates to agent, agent may
delegate to sub-agent.
When `call.aborted` arrives for a parent request, the current `PendingRequestMap`
removes only that single entry. The children are unaware — they continue running,
consuming resources, and potentially producing side effects. This is the nested
abort problem:
```
Client calls /agent/chat (r1)
agent handler calls /fs/readFile via env.invoke (r1-a)
fs handler calls /db/query via env.invoke (r1-a-1)
agent handler calls /bash/exec via env.invoke (r1-b)
Client aborts r1 (call.aborted { id: "r1" })
→ r1 removed from PendingRequestMap
→ r1-a, r1-a-1, r1-b continue running (ghost work)
→ bash/exec keeps executing (unwanted side effect)
→ db/query keeps running (wasted resources)
→ results produced that nobody consumes
```
The `@alkdev/flowgraph` TypeScript package solved this with a directed graph
that tracks the call tree and a `FailurePolicy` enum:
- `"abort-dependents"`: aborting a node cascades to all non-terminal descendants.
This is the "whole tree should abort" behavior.
- `"continue-running"`: only idle/waiting dependents are aborted; started ones
keep going. New ones don't start because their predecessors failed/aborted.
The agent use case makes this concrete and urgent: an LLM composes deep, dynamic
call trees (parallel tools, sequential tools, sub-agents calling sub-tools).
Aborting a chat should tear down the entire tree — the LLM HTTP stream, all tool
calls, all sub-calls. But this is a protocol-level concern, not an agent feature:
every consumer (NAPI adapter, Python adapter, any service speaking EventEnvelope)
inherits whatever abort model the protocol defines. The call protocol is a
general-purpose cross-boundary RPC mechanism; nested composition is a core
protocol feature, and abort semantics for that composition are protocol semantics.
## Decision
### 1. `call.aborted` cascades to descendants
When `call.aborted` arrives for a request, the protocol cascades the abort to
all non-terminal descendants in the call tree (identified via `parent_request_id`).
Each descendant receives a `call.aborted` event. The `PendingRequestMap` removes
all affected entries.
The cascade is protocol-level: the event schema carries cascade semantics. A
`call.aborted` for a parent implies abort of all descendants. This is not a
client-side convention — the server (CallAdapter) is responsible for discovering
descendants and propagating the abort.
### 2. Default policy: `abort-dependents`
The default policy is `abort-dependents`: aborting a request aborts everything
downstream, regardless of branch. This is the correct default because aborted
parent work has no consumer waiting for results — continuing is wasted work at
best and unwanted side effects at worst (e.g., a `bash/exec` that keeps running
after the caller stopped caring, a DB mutation that completes after the
transaction was aborted).
### 3. Opt-in policy: `continue-running`
An opt-in `continue-running` policy is available for cases where long-running
work should survive a parent's abort. Under `continue-running`:
- Descendants that have already started (status: running) continue to completion.
- Descendants that haven't started yet (status: pending/waiting) are aborted
(their predecessors failed, so they can't proceed).
- No new descendants start (the parent is gone).
Use cases for `continue-running`: a long-running subscription that should keep
streaming after its parent's sibling failed; a background task that was spawned
by a handler and should survive the handler's abort.
The caller or handler specifies the policy at call time. The specific mechanism
(a field in the `call.requested` payload, a field on `OperationContext`, or a
per-operation declaration) is a two-way door for implementation.
### 4. Cleanup hooks
When a call is aborted, handlers need a mechanism to clean up resources: cancel
an HTTP stream, cancel a honker queue job, close a file handle, release a lock.
The protocol provides this through the call lifecycle — when a call is aborted,
the handler's task is cancelled (in Rust, the future is dropped). Cleanup is
handled by `Drop` implementations on resource guards, or by explicit
cancellation callbacks if the handler registers them.
This is a handler-level concern, not a protocol-level one. The protocol's job is
to cascade the abort; the handler's job is to clean up when cancelled. The
mechanism (tokio `CancellationToken`, `Drop` guards, explicit callbacks) is a
two-way door for implementation.
### 5. The call tree is tracked via `parent_request_id`
The call tree is already recorded: `OperationContext.parent_request_id` links
each call to its parent. The cascade mechanism walks this tree to find
descendants. No separate graph structure is required at the protocol level —
the `PendingRequestMap` can index entries by `parent_request_id` to enable
efficient descendant lookup.
The `@alkdev/flowgraph` package (directed graph with `descendants()`,
reactive status propagation, `FailurePolicy`) is prior art and may be adapted
as a separate Rust crate for consumers that need richer call-tree visualization
or reactive status tracking. It is not required for the protocol-level cascade
— a parent-indexed map suffices.
## Consequences
**Positive:**
- No ghost work. Aborting a parent call tears down the entire tree. Resources
are released, side effects are halted, no results are produced for absent
consumers.
- The default (`abort-dependents`) matches the intuitive expectation: if I
stop caring about the parent, I stop caring about everything it spawned.
- The opt-in (`continue-running`) covers the legitimate exception (long-running
work that should survive) without making it the default.
- The protocol carries cascade semantics, so every consumer inherits the
correct behavior — no consumer needs to implement its own abort propagation.
- The `parent_request_id` chain already exists; the cascade mechanism is an
index on it, not a new data structure.
- Cleanup hooks are handled by Rust's async drop semantics — dropping the
handler's future cancels it, and `Drop` guards release resources. This is
idiomatic Rust, not a custom mechanism.
**Negative:**
- The `PendingRequestMap` needs a parent-indexed lookup (a `HashMap<String,
Vec<String>>` from parent_request_id to child request_ids, or a scan). This
is a minor implementation cost, not a protocol change.
- The `call.aborted` event schema carries cascade semantics — clients that
don't understand cascade (future versions, other implementations) would
need to handle it. Mitigated: cascade is server-side (the CallAdapter walks
the tree and sends `call.aborted` per descendant), so clients see individual
abort events regardless of whether they understand the cascade concept.
- The `continue-running` policy adds a parameter to the call lifecycle. The
specific location (payload field, context field, per-operation declaration)
is a two-way door, but the existence of the policy is a one-way commitment.
## Assumptions
1. **Aborting a parent should abort descendants by default.** If the default
should be `continue-running` (descendants survive), this ADR is wrong. The
assumption is that ghost work is worse than premature cancellation — a
cancelled descendant can be retried, but a ghost process consuming
resources and producing unwanted side effects is harder to recover from.
2. **The server (CallAdapter) is responsible for cascade.** The client sends
`call.aborted` for one request ID; the server discovers descendants and
propagates. If the client were responsible for cascading, it would need to
know the full tree — which it may not (server-side composition creates
children the client never saw).
3. **`parent_request_id` is sufficient to discover descendants.** The call tree
is a tree (acyclic, single parent per node). If future composition patterns
create multi-parent relationships (e.g., a shared subcall invoked by two
parents), the cascade model needs extension. The assumption is that
composition creates a tree, not a DAG.
4. **Dropping the handler's future is sufficient for cleanup.** Rust's async
drop semantics cancel the future and run `Drop` guards. If a use case
requires explicit cleanup callbacks (e.g., external systems that need a
signal), the mechanism needs extension. The assumption is that `Drop`
guards cover the common cases (HTTP stream cancellation, file handle
release, lock release).
5. **`continue-running` is per-call, not per-operation.** The policy is
specified at call time, not declared at registration. If the policy should
be a static property of the operation (declared in `OperationSpec`), the
model changes. The assumption is that the caller or handler decides at call
time based on the specific context.
## References
- ADR-012: Call protocol stream model (bidirectional streams, EventEnvelope,
ID-based correlation)
- ADR-015: Privilege model (the call tree is the agency chain —
`parent_request_id` traces principal → agent)
- OQ-17: Abort cascade semantics (resolved by this ADR)
- OQ-19: Session-scoped registries (session-scoped operations are in the call
tree and participate in cascade)
- `@alkdev/flowgraph` TypeScript package — prior art for call-graph tracking
with `descendants()`, `FailurePolicy`, reactive status propagation
- [call-protocol.md](../crates/call/call-protocol.md)
- [operation-registry.md](../crates/call/operation-registry.md)