ADR-016 locks the abort cascade model: - call.aborted cascades to all non-terminal descendants via parent_request_id - Default policy: abort-dependents (abort everything downstream) - Opt-in: continue-running (started descendants continue, pending ones abort) - Server (CallAdapter) discovers descendants and propagates; client sends one abort - Handlers clean up via Rust async drop semantics (Drop guards) - parent_indexed map suffices for tree walking; flowgraph is optional prior art Spec updates: - call-protocol.md abort cascade section references ADR-016 - OQ-17 resolved, ADR-016 referenced across all call crate specs - README.md updated: ADRs 001-016, OQ-17 moved to resolved
9.7 KiB
ADR-016: Abort Cascade for Nested Calls
Status
Accepted
Context
The call protocol allows handlers to compose other operations through
OperationEnv::invoke(). This creates a call tree: a parent request spawns
children (via parent_request_id), which may spawn their own children. The
tree is the agency chain (ADR-015) — principal delegates to agent, agent may
delegate to sub-agent.
When call.aborted arrives for a parent request, the current PendingRequestMap
removes only that single entry. The children are unaware — they continue running,
consuming resources, and potentially producing side effects. This is the nested
abort problem:
Client calls /agent/chat (r1)
agent handler calls /fs/readFile via env.invoke (r1-a)
fs handler calls /db/query via env.invoke (r1-a-1)
agent handler calls /bash/exec via env.invoke (r1-b)
Client aborts r1 (call.aborted { id: "r1" })
→ r1 removed from PendingRequestMap
→ r1-a, r1-a-1, r1-b continue running (ghost work)
→ bash/exec keeps executing (unwanted side effect)
→ db/query keeps running (wasted resources)
→ results produced that nobody consumes
The @alkdev/flowgraph TypeScript package solved this with a directed graph
that tracks the call tree and a FailurePolicy enum:
"abort-dependents": aborting a node cascades to all non-terminal descendants. This is the "whole tree should abort" behavior."continue-running": only idle/waiting dependents are aborted; started ones keep going. New ones don't start because their predecessors failed/aborted.
The agent use case makes this concrete and urgent: an LLM composes deep, dynamic call trees (parallel tools, sequential tools, sub-agents calling sub-tools). Aborting a chat should tear down the entire tree — the LLM HTTP stream, all tool calls, all sub-calls. But this is a protocol-level concern, not an agent feature: every consumer (NAPI adapter, Python adapter, any service speaking EventEnvelope) inherits whatever abort model the protocol defines. The call protocol is a general-purpose cross-boundary RPC mechanism; nested composition is a core protocol feature, and abort semantics for that composition are protocol semantics.
Decision
1. call.aborted cascades to descendants
When call.aborted arrives for a request, the protocol cascades the abort to
all non-terminal descendants in the call tree (identified via parent_request_id).
Each descendant receives a call.aborted event. The PendingRequestMap removes
all affected entries.
The cascade is protocol-level: the event schema carries cascade semantics. A
call.aborted for a parent implies abort of all descendants. This is not a
client-side convention — the server (CallAdapter) is responsible for discovering
descendants and propagating the abort.
2. Default policy: abort-dependents
The default policy is abort-dependents: aborting a request aborts everything
downstream, regardless of branch. This is the correct default because aborted
parent work has no consumer waiting for results — continuing is wasted work at
best and unwanted side effects at worst (e.g., a bash/exec that keeps running
after the caller stopped caring, a DB mutation that completes after the
transaction was aborted).
3. Opt-in policy: continue-running
An opt-in continue-running policy is available for cases where long-running
work should survive a parent's abort. Under continue-running:
- Descendants that have already started (status: running) continue to completion.
- Descendants that haven't started yet (status: pending/waiting) are aborted (their predecessors failed, so they can't proceed).
- No new descendants start (the parent is gone).
Use cases for continue-running: a long-running subscription that should keep
streaming after its parent's sibling failed; a background task that was spawned
by a handler and should survive the handler's abort.
The caller or handler specifies the policy at call time. The specific mechanism
(a field in the call.requested payload, a field on OperationContext, or a
per-operation declaration) is a two-way door for implementation.
4. Cleanup hooks
When a call is aborted, handlers need a mechanism to clean up resources: cancel
an HTTP stream, cancel a honker queue job, close a file handle, release a lock.
The protocol provides this through the call lifecycle — when a call is aborted,
the handler's task is cancelled (in Rust, the future is dropped). Cleanup is
handled by Drop implementations on resource guards, or by explicit
cancellation callbacks if the handler registers them.
This is a handler-level concern, not a protocol-level one. The protocol's job is
to cascade the abort; the handler's job is to clean up when cancelled. The
mechanism (tokio CancellationToken, Drop guards, explicit callbacks) is a
two-way door for implementation.
5. The call tree is tracked via parent_request_id
The call tree is already recorded: OperationContext.parent_request_id links
each call to its parent. The cascade mechanism walks this tree to find
descendants. No separate graph structure is required at the protocol level —
the PendingRequestMap can index entries by parent_request_id to enable
efficient descendant lookup.
The @alkdev/flowgraph package (directed graph with descendants(),
reactive status propagation, FailurePolicy) is prior art and may be adapted
as a separate Rust crate for consumers that need richer call-tree visualization
or reactive status tracking. It is not required for the protocol-level cascade
— a parent-indexed map suffices.
Consequences
Positive:
- No ghost work. Aborting a parent call tears down the entire tree. Resources are released, side effects are halted, no results are produced for absent consumers.
- The default (
abort-dependents) matches the intuitive expectation: if I stop caring about the parent, I stop caring about everything it spawned. - The opt-in (
continue-running) covers the legitimate exception (long-running work that should survive) without making it the default. - The protocol carries cascade semantics, so every consumer inherits the correct behavior — no consumer needs to implement its own abort propagation.
- The
parent_request_idchain already exists; the cascade mechanism is an index on it, not a new data structure. - Cleanup hooks are handled by Rust's async drop semantics — dropping the
handler's future cancels it, and
Dropguards release resources. This is idiomatic Rust, not a custom mechanism.
Negative:
- The
PendingRequestMapneeds a parent-indexed lookup (aHashMap<String, Vec<String>>from parent_request_id to child request_ids, or a scan). This is a minor implementation cost, not a protocol change. - The
call.abortedevent schema carries cascade semantics — clients that don't understand cascade (future versions, other implementations) would need to handle it. Mitigated: cascade is server-side (the CallAdapter walks the tree and sendscall.abortedper descendant), so clients see individual abort events regardless of whether they understand the cascade concept. - The
continue-runningpolicy adds a parameter to the call lifecycle. The specific location (payload field, context field, per-operation declaration) is a two-way door, but the existence of the policy is a one-way commitment.
Assumptions
-
Aborting a parent should abort descendants by default. If the default should be
continue-running(descendants survive), this ADR is wrong. The assumption is that ghost work is worse than premature cancellation — a cancelled descendant can be retried, but a ghost process consuming resources and producing unwanted side effects is harder to recover from. -
The server (CallAdapter) is responsible for cascade. The client sends
call.abortedfor one request ID; the server discovers descendants and propagates. If the client were responsible for cascading, it would need to know the full tree — which it may not (server-side composition creates children the client never saw). -
parent_request_idis sufficient to discover descendants. The call tree is a tree (acyclic, single parent per node). If future composition patterns create multi-parent relationships (e.g., a shared subcall invoked by two parents), the cascade model needs extension. The assumption is that composition creates a tree, not a DAG. -
Dropping the handler's future is sufficient for cleanup. Rust's async drop semantics cancel the future and run
Dropguards. If a use case requires explicit cleanup callbacks (e.g., external systems that need a signal), the mechanism needs extension. The assumption is thatDropguards cover the common cases (HTTP stream cancellation, file handle release, lock release). -
continue-runningis per-call, not per-operation. The policy is specified at call time, not declared at registration. If the policy should be a static property of the operation (declared inOperationSpec), the model changes. The assumption is that the caller or handler decides at call time based on the specific context.
References
- ADR-012: Call protocol stream model (bidirectional streams, EventEnvelope, ID-based correlation)
- ADR-015: Privilege model (the call tree is the agency chain —
parent_request_idtraces principal → agent) - OQ-17: Abort cascade semantics (resolved by this ADR)
- OQ-19: Session-scoped registries (session-scoped operations are in the call tree and participate in cascade)
@alkdev/flowgraphTypeScript package — prior art for call-graph tracking withdescendants(),FailurePolicy, reactive status propagation- call-protocol.md
- operation-registry.md