alknet/docs/architecture/decisions/016-abort-cascade-for-nested-calls.md

# ADR-016: Abort Cascade for Nested Calls

## Status

Accepted

## Context

The call protocol allows handlers to compose other operations through
`OperationEnv::invoke()`. This creates a call tree: a parent request spawns
children (via `parent_request_id`), which may spawn their own children. The
tree is the agency chain (ADR-015) — principal delegates to agent, agent may
delegate to sub-agent.

When `call.aborted` arrives for a parent request, the current `PendingRequestMap`
removes only that single entry. The children are unaware — they continue running,
consuming resources, and potentially producing side effects. This is the nested
abort problem:

```
Client calls /agent/chat (r1)
  agent handler calls /fs/readFile via env.invoke (r1-a)
    fs handler calls /db/query via env.invoke (r1-a-1)
  agent handler calls /bash/exec via env.invoke (r1-b)

Client aborts r1 (call.aborted { id: "r1" })
  → r1 removed from PendingRequestMap
  → r1-a, r1-a-1, r1-b continue running (ghost work)
  → bash/exec keeps executing (unwanted side effect)
  → db/query keeps running (wasted resources)
  → results produced that nobody consumes
```

The `@alkdev/flowgraph` TypeScript package solved this with a directed graph
that tracks the call tree and a `FailurePolicy` enum:

- `"abort-dependents"`: aborting a node cascades to all non-terminal descendants.
  This is the "whole tree should abort" behavior.
- `"continue-running"`: only idle/waiting dependents are aborted; started ones
  keep going. New ones don't start because their predecessors failed/aborted.

The agent use case makes this concrete and urgent: an LLM composes deep, dynamic
call trees (parallel tools, sequential tools, sub-agents calling sub-tools).
Aborting a chat should tear down the entire tree — the LLM HTTP stream, all tool
calls, all sub-calls. But this is a protocol-level concern, not an agent feature:
every consumer (NAPI adapter, Python adapter, any service speaking EventEnvelope)
inherits whatever abort model the protocol defines. The call protocol is a
general-purpose cross-boundary RPC mechanism; nested composition is a core
protocol feature, and abort semantics for that composition are protocol semantics.

## Decision

### 1. `call.aborted` cascades to descendants

When `call.aborted` arrives for a request, the protocol cascades the abort to
all non-terminal descendants in the call tree (identified via `parent_request_id`).
Each descendant receives a `call.aborted` event. The `PendingRequestMap` removes
all affected entries.

The cascade is protocol-level: the event schema carries cascade semantics. A
`call.aborted` for a parent implies abort of all descendants. This is not a
client-side convention — the server (CallAdapter) is responsible for discovering
descendants and propagating the abort.

### 2. Default policy: `abort-dependents`

The default policy is `abort-dependents`: aborting a request aborts everything
downstream, regardless of branch. This is the correct default because aborted
parent work has no consumer waiting for results — continuing is wasted work at
best and unwanted side effects at worst (e.g., a `bash/exec` that keeps running
after the caller stopped caring, a DB mutation that completes after the
transaction was aborted).

### 3. Opt-in policy: `continue-running`

An opt-in `continue-running` policy is available for cases where long-running
work should survive a parent's abort. Under `continue-running`:
- Descendants that have already started (status: running) continue to completion.
- Descendants that haven't started yet (status: pending/waiting) are aborted
  (their predecessors failed, so they can't proceed).
- No new descendants start (the parent is gone).

Use cases for `continue-running`: a long-running subscription that should keep
streaming after its parent's sibling failed; a background task that was spawned
by a handler and should survive the handler's abort.

The caller or handler specifies the policy at call time. The policy is set
on the `OperationContext` and propagated to children via `OperationEnv::invoke()`
— see Decision 6 below. The default is `abort-dependents`; `continue-running`
is an opt-in for long-running work that should survive a parent's abort.

### 4. Cleanup hooks

When a call is aborted, handlers need a mechanism to clean up resources: cancel
an HTTP stream, cancel a honker queue job, close a file handle, release a lock.
The protocol provides this through the call lifecycle — when a call is aborted,
the handler's task is cancelled (in Rust, the future is dropped). Cleanup is
handled by `Drop` implementations on resource guards, or by explicit
cancellation callbacks if the handler registers them.

This is a handler-level concern, not a protocol-level one. The protocol's job is
to cascade the abort; the handler's job is to clean up when cancelled. The
mechanism (tokio `CancellationToken`, `Drop` guards, explicit callbacks) is a
two-way door for implementation.

### 5. The call tree is tracked via `parent_request_id`

The call tree is already recorded: `OperationContext.parent_request_id` links
each call to its parent. The cascade mechanism walks this tree to find
descendants. No separate graph structure is required at the protocol level —
the `PendingRequestMap` can index entries by `parent_request_id` to enable
efficient descendant lookup.

The `@alkdev/flowgraph` package (directed graph with `descendants()`,
reactive status propagation, `FailurePolicy`) is prior art and may be adapted
as a separate Rust crate for consumers that need richer call-tree visualization
or reactive status tracking. It is not required for the protocol-level cascade
— a parent-indexed map suffices.

### 6. The abort policy is set on `OperationContext`, not on the wire payload

The abort policy (`abort-dependents` vs `continue-running`) is set on
`OperationContext` and propagated to children via `OperationEnv::invoke()`.
It is NOT a field in the `call.requested` wire payload, and it is NOT a
per-operation declaration on `OperationSpec`.

**Why not the wire payload**: the wire caller doesn't know the composition
tree. The caller of `/agent/chat` cannot meaningfully decide whether
`/fs/readFile` (composed internally by the agent handler) should survive an
abort — the handler that composes the child knows that, not the wire caller.
Putting the policy on the wire payload would give the wire caller control
over internal composition behavior it can't see.

**Why not per-operation declaration**: Assumption 5 says the policy
is per-call, not per-operation. The same operation may need
`abort-dependents` in one composition context and `continue-running` in
another. A static property on `OperationSpec` can't express that.

**How it works on `OperationContext`**: the root context
(`build_root_context` in the CallAdapter) gets the default policy
(`abort-dependents`). When a handler composes a child via
`env.invoke()`, it can specify the policy for that child:

```rust
// Default: abort-dependents (child aborts if parent aborts)
context.env.invoke("fs", "readFile", input, &context).await

// Opt-in: continue-running (child survives parent's abort)
context.env.invoke_with_policy(
    "fs", "readFile", input, &context, AbortPolicy::ContinueRunning
).await
```

The child's `OperationContext` carries the policy. If the child itself
composes grandchildren, the policy **propagates by inheritance** — the
grandchild inherits the child's policy (which was the parent's policy,
unless the parent overrode it for the child via `invoke_with_policy`).
`ContinueRunning` does auto-propagate to grandchildren: if a parent opts
its child into `ContinueRunning`, and the child composes grandchildren
without explicitly overriding, the grandchildren also get
`ContinueRunning`. This is consistent with the composition authority and
scoped env propagation in ADR-022 — the parent handler decides the
child's runtime context, including abort policy, and that decision
propagates through the composition tree by default.

**Review #002 W19 resolution**: `invoke()` with no explicit policy
argument inherits the parent's current policy (option a). It does **not**
reset to `AbortDependents`. A handler that wants a child to reset to the
default must explicitly call `invoke_with_policy(...,
AbortPolicy::AbortDependents)`. This makes the propagation predictable:
the policy I set for my child applies to my child's children unless they
re-decide. The `invoke()` default in operation-registry.md
(`abort_policy: parent.abort_policy.clone()`) is correct.

The `OperationEnv` trait gains an optional policy parameter. The specific
API shape (a separate `invoke_with_policy` method, a policy field on an
`InvokeOptions` struct, or a builder pattern) is a two-way door for
implementation — but the policy enters through `OperationEnv::invoke()`,
not through the wire and not through `OperationSpec`.

## Consequences

**Positive:**
- No ghost work. Aborting a parent call tears down the entire tree. Resources
  are released, side effects are halted, no results are produced for absent
  consumers.
- The default (`abort-dependents`) matches the intuitive expectation: if I
  stop caring about the parent, I stop caring about everything it spawned.
- The opt-in (`continue-running`) covers the legitimate exception (long-running
  work that should survive) without making it the default.
- The protocol carries cascade semantics, so every consumer inherits the
  correct behavior — no consumer needs to implement its own abort propagation.
- The `parent_request_id` chain already exists; the cascade mechanism is an
  index on it, not a new data structure.
- Cleanup hooks are handled by Rust's async drop semantics — dropping the
  handler's future cancels it, and `Drop` guards release resources. This is
  idiomatic Rust, not a custom mechanism.

**Negative:**
- The `PendingRequestMap` needs a parent-indexed lookup (a `HashMap<String,
  Vec<String>>` from parent_request_id to child request_ids, or a scan). This
  is a minor implementation cost, not a protocol change.
- The `call.aborted` event schema carries cascade semantics — clients that
  don't understand cascade (future versions, other implementations) would
  need to handle it. Mitigated: cascade is server-side (the CallAdapter walks
  the tree and sends `call.aborted` per descendant), so clients see individual
  abort events regardless of whether they understand the cascade concept.
- The `continue-running` policy adds a parameter to the call lifecycle. The
  specific location (payload field, context field, per-operation declaration)
  is a two-way door, but the existence of the policy is a one-way commitment.

## Assumptions

1. **Aborting a parent should abort descendants by default.** If the default
   should be `continue-running` (descendants survive), this ADR is wrong. The
   assumption is that ghost work is worse than premature cancellation — a
   cancelled descendant can be retried, but a ghost process consuming
   resources and producing unwanted side effects is harder to recover from.

2. **The server (CallAdapter) is responsible for cascade.** The client sends
   `call.aborted` for one request ID; the server discovers descendants and
   propagates. If the client were responsible for cascading, it would need to
   know the full tree — which it may not (server-side composition creates
   children the client never saw).

3. **`parent_request_id` is sufficient to discover descendants.** The call tree
   is a tree (acyclic, single parent per node). If future composition patterns
   create multi-parent relationships (e.g., a shared subcall invoked by two
   parents), the cascade model needs extension. The assumption is that
   composition creates a tree, not a DAG.

4. **Dropping the handler's future is sufficient for cleanup.** Rust's async
   drop semantics cancel the future and run `Drop` guards. If a use case
   requires explicit cleanup callbacks (e.g., external systems that need a
   signal), the mechanism needs extension. The assumption is that `Drop`
   guards cover the common cases (HTTP stream cancellation, file handle
   release, lock release).

5. **`continue-running` is per-call, not per-operation.** The policy is
   specified at call time via `OperationEnv::invoke()`, not declared at
   registration on `OperationSpec` and not set by the wire caller. The
   composing handler decides the child's policy based on the specific
   context. See Decision 6.

## References

- ADR-012: Call protocol stream model (bidirectional streams, EventEnvelope,
  ID-based correlation)
- ADR-015: Privilege model (the call tree is the agency chain —
  `parent_request_id` traces principal → agent)
- OQ-17: Abort cascade semantics (resolved by this ADR)
- OQ-19: Session-scoped registries (session-scoped operations are in the call
  tree and participate in cascade)
- `@alkdev/flowgraph` TypeScript package — prior art for call-graph tracking
  with `descendants()`, `FailurePolicy`, reactive status propagation
- [call-protocol.md](../crates/call/call-protocol.md)
- [operation-registry.md](../crates/call/operation-registry.md)