Files

glm-5.2 cb98f42cd4 docs(architecture): resolve review #002 remaining Tier 4 findings

Add ADR-026 (vault key model — HD derivation) recording the foundational
HD-derivation decision, 74' coin type reservation, SLIP-0010/Ed25519
default, secp256k1 feature-gating, and AES-256-GCM cipher choice. These
were previously inline rationale with no ADR (W9).

Extend ADR-018 with an explicit EncryptedData wire format lock — fields,
encoding, and semantics are frozen; no removal without a format-version
migration (W10).

Resolve the remaining guard clauses and spec decisions:

- W2: Capabilities must be immutable after construction (no interior
  mutability). Makes the Arc vs deep-copy clone semantics genuinely
  two-way.
- W5: Published to_* specs are compatibility contracts — best-effort
  mappings are two-way before first publication, one-way after. Version
  generated specs.
- W6: Salt field clarification — v2 salt is permanently unused; a future
  KDF is a different derivation family, not a version-indexed path; the
  field saves a wire-format change only.
- W7: unlock_new returns Zeroizing<String> — the mnemonic is the root of
  trust and must not linger in freed memory.
- W17: OQ-09 WASM — server-side dispatch door is honestly closed
  (Connection is concrete, tokio-bound), not implicitly preserved.
- W18: OQ-10 git — composability fork (raw smart protocol vs call-protocol
  projection) is a separate decision from ERC721 scope.
- W20: from_openapi must prefix imported error codes (HTTP_404) to avoid
  collision with protocol-level codes (NOT_FOUND). Normative rule, not
  naming convention.
- W21: ScopedOperationEnv field is private — construction via new()/
  empty(), query via allows(). Makes the future subgraph refactor
  non-breaking.
- C13: Connection::set_identity — the endpoint does not read identity()
  after handle() returns (Connection is moved into the spawned task).
  Observability is handler-side logging. Simplest honest answer.
- W1: OperationAdapter trait is async, returns Vec<HandlerRegistration>.
  from_call requires async discovery; ADR-022 changed the return type.
- W11: CompositionAuthority::as_identity() defined — constructs a
  synthetic Identity (label as id, scopes, resources) not resolvable via
  IdentityProvider. Second Identity construction path, acknowledged.
- W14: SecretKey is iroh::SecretKey (Ed25519) — consistent with the
  endpoint's iroh dependency.
- W19: Grandchild abort propagation is inherit-by-default (option a) —
  invoke() with no explicit policy inherits parent's policy. ContinueRunning
  auto-propagates to grandchildren unless explicitly overridden.

2026-06-23 08:20:27 +00:00

13 KiB

Raw Permalink Blame History

ADR-016: Abort Cascade for Nested Calls

Status

Accepted

Context

The call protocol allows handlers to compose other operations through OperationEnv::invoke(). This creates a call tree: a parent request spawns children (via parent_request_id), which may spawn their own children. The tree is the agency chain (ADR-015) — principal delegates to agent, agent may delegate to sub-agent.

When call.aborted arrives for a parent request, the current PendingRequestMap removes only that single entry. The children are unaware — they continue running, consuming resources, and potentially producing side effects. This is the nested abort problem:

Client calls /agent/chat (r1)
  agent handler calls /fs/readFile via env.invoke (r1-a)
    fs handler calls /db/query via env.invoke (r1-a-1)
  agent handler calls /bash/exec via env.invoke (r1-b)

Client aborts r1 (call.aborted { id: "r1" })
  → r1 removed from PendingRequestMap
  → r1-a, r1-a-1, r1-b continue running (ghost work)
  → bash/exec keeps executing (unwanted side effect)
  → db/query keeps running (wasted resources)
  → results produced that nobody consumes

The @alkdev/flowgraph TypeScript package solved this with a directed graph that tracks the call tree and a FailurePolicy enum:

"abort-dependents": aborting a node cascades to all non-terminal descendants. This is the "whole tree should abort" behavior.
"continue-running": only idle/waiting dependents are aborted; started ones keep going. New ones don't start because their predecessors failed/aborted.

The agent use case makes this concrete and urgent: an LLM composes deep, dynamic call trees (parallel tools, sequential tools, sub-agents calling sub-tools). Aborting a chat should tear down the entire tree — the LLM HTTP stream, all tool calls, all sub-calls. But this is a protocol-level concern, not an agent feature: every consumer (NAPI adapter, Python adapter, any service speaking EventEnvelope) inherits whatever abort model the protocol defines. The call protocol is a general-purpose cross-boundary RPC mechanism; nested composition is a core protocol feature, and abort semantics for that composition are protocol semantics.

Decision

1. `call.aborted` cascades to descendants

When call.aborted arrives for a request, the protocol cascades the abort to all non-terminal descendants in the call tree (identified via parent_request_id). Each descendant receives a call.aborted event. The PendingRequestMap removes all affected entries.

The cascade is protocol-level: the event schema carries cascade semantics. A call.aborted for a parent implies abort of all descendants. This is not a client-side convention — the server (CallAdapter) is responsible for discovering descendants and propagating the abort.

2. Default policy: `abort-dependents`

The default policy is abort-dependents: aborting a request aborts everything downstream, regardless of branch. This is the correct default because aborted parent work has no consumer waiting for results — continuing is wasted work at best and unwanted side effects at worst (e.g., a bash/exec that keeps running after the caller stopped caring, a DB mutation that completes after the transaction was aborted).

3. Opt-in policy: `continue-running`

An opt-in continue-running policy is available for cases where long-running work should survive a parent's abort. Under continue-running:

Descendants that have already started (status: running) continue to completion.
Descendants that haven't started yet (status: pending/waiting) are aborted (their predecessors failed, so they can't proceed).
No new descendants start (the parent is gone).

Use cases for continue-running: a long-running subscription that should keep streaming after its parent's sibling failed; a background task that was spawned by a handler and should survive the handler's abort.

The caller or handler specifies the policy at call time. The policy is set on the OperationContext and propagated to children via OperationEnv::invoke() — see Decision 6 below. The default is abort-dependents; continue-running is an opt-in for long-running work that should survive a parent's abort.

4. Cleanup hooks

When a call is aborted, handlers need a mechanism to clean up resources: cancel an HTTP stream, cancel a honker queue job, close a file handle, release a lock. The protocol provides this through the call lifecycle — when a call is aborted, the handler's task is cancelled (in Rust, the future is dropped). Cleanup is handled by Drop implementations on resource guards, or by explicit cancellation callbacks if the handler registers them.

This is a handler-level concern, not a protocol-level one. The protocol's job is to cascade the abort; the handler's job is to clean up when cancelled. The mechanism (tokio CancellationToken, Drop guards, explicit callbacks) is a two-way door for implementation.

5. The call tree is tracked via `parent_request_id`

The call tree is already recorded: OperationContext.parent_request_id links each call to its parent. The cascade mechanism walks this tree to find descendants. No separate graph structure is required at the protocol level — the PendingRequestMap can index entries by parent_request_id to enable efficient descendant lookup.

The @alkdev/flowgraph package (directed graph with descendants(), reactive status propagation, FailurePolicy) is prior art and may be adapted as a separate Rust crate for consumers that need richer call-tree visualization or reactive status tracking. It is not required for the protocol-level cascade — a parent-indexed map suffices.

6. The abort policy is set on `OperationContext`, not on the wire payload

The abort policy (abort-dependents vs continue-running) is set on OperationContext and propagated to children via OperationEnv::invoke(). It is NOT a field in the call.requested wire payload, and it is NOT a per-operation declaration on OperationSpec.

Why not the wire payload: the wire caller doesn't know the composition tree. The caller of /agent/chat cannot meaningfully decide whether /fs/readFile (composed internally by the agent handler) should survive an abort — the handler that composes the child knows that, not the wire caller. Putting the policy on the wire payload would give the wire caller control over internal composition behavior it can't see.

Why not per-operation declaration: Assumption 5 says the policy is per-call, not per-operation. The same operation may need abort-dependents in one composition context and continue-running in another. A static property on OperationSpec can't express that.

How it works on OperationContext: the root context (build_root_context in the CallAdapter) gets the default policy (abort-dependents). When a handler composes a child via env.invoke(), it can specify the policy for that child:

// Default: abort-dependents (child aborts if parent aborts)
context.env.invoke("fs", "readFile", input, &context).await

// Opt-in: continue-running (child survives parent's abort)
context.env.invoke_with_policy(
    "fs", "readFile", input, &context, AbortPolicy::ContinueRunning
).await

The child's OperationContext carries the policy. If the child itself composes grandchildren, the policy propagates by inheritance — the grandchild inherits the child's policy (which was the parent's policy, unless the parent overrode it for the child via invoke_with_policy). ContinueRunning does auto-propagate to grandchildren: if a parent opts its child into ContinueRunning, and the child composes grandchildren without explicitly overriding, the grandchildren also get ContinueRunning. This is consistent with the composition authority and scoped env propagation in ADR-022 — the parent handler decides the child's runtime context, including abort policy, and that decision propagates through the composition tree by default.

Review #002 W19 resolution: invoke() with no explicit policy argument inherits the parent's current policy (option a). It does not reset to AbortDependents. A handler that wants a child to reset to the default must explicitly call invoke_with_policy(..., AbortPolicy::AbortDependents). This makes the propagation predictable: the policy I set for my child applies to my child's children unless they re-decide. The invoke() default in operation-registry.md (abort_policy: parent.abort_policy.clone()) is correct.

The OperationEnv trait gains an optional policy parameter. The specific API shape (a separate invoke_with_policy method, a policy field on an InvokeOptions struct, or a builder pattern) is a two-way door for implementation — but the policy enters through OperationEnv::invoke(), not through the wire and not through OperationSpec.

Consequences

Positive:

No ghost work. Aborting a parent call tears down the entire tree. Resources are released, side effects are halted, no results are produced for absent consumers.
The default (abort-dependents) matches the intuitive expectation: if I stop caring about the parent, I stop caring about everything it spawned.
The opt-in (continue-running) covers the legitimate exception (long-running work that should survive) without making it the default.
The protocol carries cascade semantics, so every consumer inherits the correct behavior — no consumer needs to implement its own abort propagation.
The parent_request_id chain already exists; the cascade mechanism is an index on it, not a new data structure.
Cleanup hooks are handled by Rust's async drop semantics — dropping the handler's future cancels it, and Drop guards release resources. This is idiomatic Rust, not a custom mechanism.

Negative:

The PendingRequestMap needs a parent-indexed lookup (a HashMap<String, Vec<String>> from parent_request_id to child request_ids, or a scan). This is a minor implementation cost, not a protocol change.
The call.aborted event schema carries cascade semantics — clients that don't understand cascade (future versions, other implementations) would need to handle it. Mitigated: cascade is server-side (the CallAdapter walks the tree and sends call.aborted per descendant), so clients see individual abort events regardless of whether they understand the cascade concept.
The continue-running policy adds a parameter to the call lifecycle. The specific location (payload field, context field, per-operation declaration) is a two-way door, but the existence of the policy is a one-way commitment.

Assumptions

Aborting a parent should abort descendants by default. If the default should be continue-running (descendants survive), this ADR is wrong. The assumption is that ghost work is worse than premature cancellation — a cancelled descendant can be retried, but a ghost process consuming resources and producing unwanted side effects is harder to recover from.
The server (CallAdapter) is responsible for cascade. The client sends call.aborted for one request ID; the server discovers descendants and propagates. If the client were responsible for cascading, it would need to know the full tree — which it may not (server-side composition creates children the client never saw).
parent_request_id is sufficient to discover descendants. The call tree is a tree (acyclic, single parent per node). If future composition patterns create multi-parent relationships (e.g., a shared subcall invoked by two parents), the cascade model needs extension. The assumption is that composition creates a tree, not a DAG.
Dropping the handler's future is sufficient for cleanup. Rust's async drop semantics cancel the future and run Drop guards. If a use case requires explicit cleanup callbacks (e.g., external systems that need a signal), the mechanism needs extension. The assumption is that Drop guards cover the common cases (HTTP stream cancellation, file handle release, lock release).
continue-running is per-call, not per-operation. The policy is specified at call time via OperationEnv::invoke(), not declared at registration on OperationSpec and not set by the wire caller. The composing handler decides the child's policy based on the specific context. See Decision 6.

References

ADR-012: Call protocol stream model (bidirectional streams, EventEnvelope, ID-based correlation)
ADR-015: Privilege model (the call tree is the agency chain — parent_request_id traces principal → agent)
OQ-17: Abort cascade semantics (resolved by this ADR)
OQ-19: Session-scoped registries (session-scoped operations are in the call tree and participate in cascade)
@alkdev/flowgraph TypeScript package — prior art for call-graph tracking with descendants(), FailurePolicy, reactive status propagation
call-protocol.md
operation-registry.md

13 KiB Raw Permalink Blame History