From e2730869cab68339f4a6cd7b6cee9db5fac4d407 Mon Sep 17 00:00:00 2001 From: "glm-5.2" Date: Thu, 18 Jun 2026 09:37:19 +0000 Subject: [PATCH] docs(architecture): add ADR-016 abort cascade for nested calls, resolve OQ-17 ADR-016 locks the abort cascade model: - call.aborted cascades to all non-terminal descendants via parent_request_id - Default policy: abort-dependents (abort everything downstream) - Opt-in: continue-running (started descendants continue, pending ones abort) - Server (CallAdapter) discovers descendants and propagates; client sends one abort - Handlers clean up via Rust async drop semantics (Drop guards) - parent_indexed map suffices for tree walking; flowgraph is optional prior art Spec updates: - call-protocol.md abort cascade section references ADR-016 - OQ-17 resolved, ADR-016 referenced across all call crate specs - README.md updated: ADRs 001-016, OQ-17 moved to resolved --- docs/architecture/README.md | 7 +- docs/architecture/crates/call/README.md | 4 +- .../architecture/crates/call/call-protocol.md | 12 +- .../crates/call/operation-registry.md | 1 - .../016-abort-cascade-for-nested-calls.md | 195 ++++++++++++++++++ docs/architecture/open-questions.md | 12 +- docs/architecture/overview.md | 1 + 7 files changed, 211 insertions(+), 21 deletions(-) create mode 100644 docs/architecture/decisions/016-abort-cascade-for-nested-calls.md diff --git a/docs/architecture/README.md b/docs/architecture/README.md index 29d6338..489fef5 100644 --- a/docs/architecture/README.md +++ b/docs/architecture/README.md @@ -7,9 +7,9 @@ last_updated: 2026-06-20 ## Current State -**Pre-implementation.** The project has completed a pivot from a three-layer model to an ALPN-as-service model. The greenfield workspace contains only `alknet-vault` (stable) and research/reference material. Foundational ADRs (001–015) are in place, including the BiStream type definition (ADR-007), vault integration (ADR-008), ALPN router/endpoint (ADR-010), AuthContext structure (ADR-011), call protocol stream model (ADR-012), Rust as canonical implementation language (ADR-013), secret material flow with capability injection (ADR-014), and privilege model with authority context (ADR-015). The alknet-core and alknet-call crate specs are in draft. +**Pre-implementation.** The project has completed a pivot from a three-layer model to an ALPN-as-service model. The greenfield workspace contains only `alknet-vault` (stable) and research/reference material. Foundational ADRs (001–016) are in place, including the BiStream type definition (ADR-007), vault integration (ADR-008), ALPN router/endpoint (ADR-010), AuthContext structure (ADR-011), call protocol stream model (ADR-012), Rust as canonical implementation language (ADR-013), secret material flow with capability injection (ADR-014), privilege model with authority context (ADR-015), and abort cascade for nested calls (ADR-016). The alknet-core and alknet-call crate specs are in draft. -**Next step**: Review alknet-call spec documents, then begin implementation. OQ-11 (handler-level auth resolution observability), OQ-15 (call protocol client and adapter contract), and OQ-17 (abort cascade) will be resolved during or before implementation. +**Next step**: Review alknet-call spec documents, then begin implementation. OQ-11 (handler-level auth resolution observability) and OQ-15 (call protocol client and adapter contract) will be resolved during or before implementation. ## Architecture Documents @@ -45,6 +45,7 @@ last_updated: 2026-06-20 | [013](decisions/013-rust-canonical-implementation.md) | Rust as Canonical Implementation Language | Accepted | | [014](decisions/014-secret-material-flow-and-capability-injection.md) | Secret Material Flow and Capability Injection | Accepted | | [015](decisions/015-privilege-model-and-authority-context.md) | Privilege Model and Authority Context | Accepted | +| [016](decisions/016-abort-cascade-for-nested-calls.md) | Abort Cascade for Nested Calls | Accepted | ## Open Questions @@ -59,6 +60,7 @@ See [open-questions.md](open-questions.md) for the full tracker. - **OQ-08**: Vault integration — CLI-embedded, assembly-layer only (ADR-008, ADR-014) - **OQ-16**: Safe vault operations for call protocol exposure — none for now (ADR-014) - **OQ-18**: Privilege model — `internal` = authority switch, External/Internal visibility, handler identity + scoped env (ADR-015) +- **OQ-17**: Abort cascade — `call.aborted` cascades to descendants; default `abort-dependents`, `continue-running` opt-in (ADR-016) **Resolved two-way doors:** - **OQ-04**: Dynamic handler registration — static at startup (ADR-010) @@ -72,7 +74,6 @@ See [open-questions.md](open-questions.md) for the full tracker. **Open one-way doors (need ADR before implementation):** - **OQ-15**: Call protocol client and adapter contract — alknet-call needs both the server (CallAdapter) and client (call invocation over QUIC), plus the adapter contract traits (from_*, to_*) that enable composition. ADR-014 constrains the adapter contract: adapters take credential sources from the assembly layer, not static tokens. ADR-015 constrains: adapter-registered operations are `Internal` by default. -- **OQ-17**: Abort cascade semantics — `call.aborted` cascades to descendants. Default `abort-dependents`, `continue-running` opt-in. One-way door on the event schema; mechanism is a two-way door. - **OQ-19**: Session-scoped operation registries — agent-written operations in a quickjs sandbox, overlaid on the global registry via `OperationEnv` trait layering. Protocol doesn't need changes; the one-way door is not closing the trait-based composition point. Promotion from session to core requires curation review. **Deferred (not active):** diff --git a/docs/architecture/crates/call/README.md b/docs/architecture/crates/call/README.md index 311a27d..ea87563 100644 --- a/docs/architecture/crates/call/README.md +++ b/docs/architecture/crates/call/README.md @@ -30,6 +30,7 @@ Structured RPC over QUIC: operations, request/response, streaming subscriptions, | [012](../../decisions/012-call-protocol-stream-model.md) | Call Protocol Stream Model | Bidirectional streams, EventEnvelope, ID-based correlation | | [014](../../decisions/014-secret-material-flow-and-capability-injection.md) | Secret Material Flow and Capability Injection | Call protocol carries no secret material; capabilities injected at assembly layer | | [015](../../decisions/015-privilege-model-and-authority-context.md) | Privilege Model and Authority Context | `internal` = authority switch not ACL skip; External/Internal visibility; handler identity + scoped env | +| [016](../../decisions/016-abort-cascade-for-nested-calls.md) | Abort Cascade for Nested Calls | `call.aborted` cascades to descendants; default `abort-dependents`, `continue-running` opt-in | ## Relevant Open Questions @@ -40,7 +41,6 @@ Structured RPC over QUIC: operations, request/response, streaming subscriptions, | OQ-14 | Batch operation semantics | resolved | Correlated `call.requested` events is the correct protocol design | | OQ-15 | Call protocol client and adapter contract | open | ADR-014 constrains adapters: credential sources, not static tokens. ADR-015: adapter ops are Internal by default | | OQ-16 | Safe vault operations for call protocol exposure | resolved (ADR-014) | None exposed for now | -| OQ-17 | Abort cascade semantics | open | `call.aborted` cascades to descendants; default `abort-dependents`, `continue-running` opt-in. One-way door on event schema | | OQ-19 | Session-scoped operation registries | open | Agent-written operations overlaid on global registry via `OperationEnv` trait layering. Protocol doesn't need changes; one-way door is not closing the trait-based composition point | ## Key Design Principles @@ -52,5 +52,5 @@ Structured RPC over QUIC: operations, request/response, streaming subscriptions, 5. **irpc is one dispatch backend**: Local operations dispatch directly. irpc service calls (in-process, type-safe) are internal. The call protocol is the external interface. 6. **Local dispatch only**: The operation registry dispatches to local handlers. Remote dispatch (federation, head/worker routing) would be a separate mechanism at a different layer, not a modification to alknet-call's path format. 7. **No secret material on the wire**: The call protocol carries no private keys, API keys, mnemonics, or decrypted credentials. Handlers receive outbound credentials through `OperationContext.capabilities`, injected at the assembly layer. See ADR-014. -8. **Abort cascades to descendants**: `call.aborted` for a parent request cascades to all non-terminal descendants. Default `abort-dependents`; `continue-running` opt-in. See OQ-17. +8. **Abort cascades to descendants**: `call.aborted` for a parent request cascades to all non-terminal descendants. Default `abort-dependents`; `continue-running` opt-in. See ADR-016. 9. **Internal calls switch authority context, not skip ACL**: The `internal` flag marks composition-originated calls. ACL runs against the handler's identity, not the caller's and not as a blanket skip. Operations have External/Internal visibility. Scoped composition env bounds reachability. See ADR-015. \ No newline at end of file diff --git a/docs/architecture/crates/call/call-protocol.md b/docs/architecture/crates/call/call-protocol.md index efc1bb3..36af485 100644 --- a/docs/architecture/crates/call/call-protocol.md +++ b/docs/architecture/crates/call/call-protocol.md @@ -273,13 +273,13 @@ Local dispatch produces `ResponseEnvelope` with no serialization overhead. The ` ### Abort Cascade and Nested Calls -When a handler composes other operations via `OperationEnv::invoke()`, it creates a call tree: a parent request (r1) spawns children (r1-a, r1-b), which may spawn their own children. The `parent_request_id` field on `OperationContext` records this tree. +When a handler composes other operations via `OperationEnv::invoke()`, it creates a call tree: a parent request (r1) spawns children (r1-a, r1-b), which may spawn their own children. The `parent_request_id` field on `OperationContext` records this tree — it is the agency chain (ADR-015). -When `call.aborted` arrives for a parent request, the protocol cascades the abort to all non-terminal descendants in the tree. The default policy is **`abort-dependents`**: aborting a request aborts everything downstream, regardless of branch. This is the correct default because aborted parent work has no consumer waiting for results — continuing is wasted work at best and unwanted side effects at worst (e.g., a `bash/exec` that keeps running after the caller stopped caring). +When `call.aborted` arrives for a parent request, the protocol cascades the abort to all non-terminal descendants in the tree. The CallAdapter walks the tree (indexed by `parent_request_id` in `PendingRequestMap`) and sends `call.aborted` for each descendant. The default policy is **`abort-dependents`**: aborting a request aborts everything downstream, regardless of branch. This is the correct default because aborted parent work has no consumer waiting for results — continuing is wasted work at best and unwanted side effects at worst (e.g., a `bash/exec` that keeps running after the caller stopped caring). -An opt-in **`continue-running`** policy is available for cases where long-running work should survive a parent's abort (e.g., a subscription that should keep streaming). The caller or handler specifies the policy at call time. +An opt-in **`continue-running`** policy is available for cases where long-running work should survive a parent's abort (e.g., a subscription that should keep streaming). Under `continue-running`, descendants that have already started continue to completion; descendants that haven't started yet are aborted; no new descendants start. -The one-way door is the protocol event schema: `call.aborted` must carry cascade semantics before implementation, because retrofitting cascade onto a non-cascading abort is a breaking protocol change. The mechanism — how the runtime discovers descendants and propagates cancellation (cancellation tokens, parent-indexed map, or a separate graph structure) — is a two-way door for implementation. See OQ-17. +Handlers clean up resources when their call is cancelled (in Rust, the future is dropped and `Drop` guards release resources — HTTP streams, file handles, locks). This is a handler-level concern; the protocol's job is to cascade the abort. See ADR-016. ## Constraints @@ -289,7 +289,7 @@ The one-way door is the protocol event schema: `call.aborted` must carry cascade - The call protocol is transport-agnostic at the envelope level. The `EventEnvelope` framing can run over QUIC streams, WebSocket frames, or Worker `postMessage`. The `CallAdapter` is the QUIC-specific implementation. - `OperationEnv::invoke()` dispatches through the local registry. Remote dispatch (federation, head/worker routing) would be a separate mechanism at a different layer. See ADR-005 and OQ-13. - **The call protocol carries no secret material.** Secret material (private keys, API keys, mnemonics, decrypted credentials, raw tokens) must not appear in `call.requested` payloads, `call.responded` payloads, or `OperationContext.metadata`. The wire format carries `serde_json::Value` and cannot enforce this at the type level — the constraint is architectural, enforced by the operation registry and by convention. Operations that need to share public key material use a dedicated operation that returns only the public component. See ADR-014. -- **Abort cascades to descendants.** `call.aborted` for a parent request cascades to all non-terminal descendants in the call tree. Default policy is `abort-dependents`; `continue-running` is an opt-in. See OQ-17. +- **Abort cascades to descendants.** `call.aborted` for a parent request cascades to all non-terminal descendants in the call tree. Default policy is `abort-dependents`; `continue-running` is an opt-in. See ADR-016. ## Design Decisions @@ -302,6 +302,7 @@ The one-way door is the protocol event schema: `call.aborted` must carry cascade | Vault integration point | [ADR-008](../../decisions/008-secret-service-integration.md) | Vault is a capability source, accessed at assembly time | | Secret material flow | [ADR-014](../../decisions/014-secret-material-flow-and-capability-injection.md) | Call protocol carries no secret material; capabilities injected at assembly layer | | Privilege model and authority context | [ADR-015](../../decisions/015-privilege-model-and-authority-context.md) | `internal` = authority switch not ACL skip; External/Internal visibility; handler identity + scoped env | +| Abort cascade for nested calls | [ADR-016](../../decisions/016-abort-cascade-for-nested-calls.md) | `call.aborted` cascades to descendants; default `abort-dependents`, `continue-running` opt-in | ## Open Questions @@ -311,7 +312,6 @@ See [open-questions.md](../../open-questions.md) for full details. - **OQ-14** (resolved): Batch is a client-side pattern of correlated `call.requested` events, not a protocol primitive. - **OQ-15** (open): Call protocol client and adapter contract. ADR-014 constrains the adapter contract: adapters take credential sources from the assembly layer, not static tokens. ADR-015 constrains: adapter-registered operations are `Internal` by default. - **OQ-16** (resolved by ADR-014): No vault operations are exposed over the call protocol for now. -- **OQ-17** (open): Abort cascade semantics — `call.aborted` cascades to descendants, default `abort-dependents`, `continue-running` opt-in. One-way door on the event schema; mechanism is a two-way door. - **OQ-19** (open): Session-scoped operation registries — agent-written operations overlaid on global registry via `OperationEnv` trait layering. Protocol doesn't need changes. ## References diff --git a/docs/architecture/crates/call/operation-registry.md b/docs/architecture/crates/call/operation-registry.md index 91f7695..e3f6cfa 100644 --- a/docs/architecture/crates/call/operation-registry.md +++ b/docs/architecture/crates/call/operation-registry.md @@ -323,7 +323,6 @@ See [open-questions.md](../../open-questions.md) for full details. - **OQ-14** (resolved): Batch is a client-side pattern of correlated `call.requested` events, not a protocol primitive. - **OQ-15** (open): Call protocol client and adapter contract. ADR-014 constrains the adapter contract: adapters take credential sources from the assembly layer, not static tokens. ADR-015 constrains: adapter-registered operations are `Internal` by default. - **OQ-16** (resolved by ADR-014): No vault operations are exposed over the call protocol for now. -- **OQ-17** (open): Abort cascade semantics — `call.aborted` cascades to descendants, default `abort-dependents`, `continue-running` opt-in. One-way door on the event schema; mechanism is a two-way door. - **OQ-19** (open): Session-scoped operation registries — agent-written operations overlaid on the global registry via `OperationEnv` trait layering. Protocol doesn't need changes; one-way door is not closing the trait-based composition point. ## References diff --git a/docs/architecture/decisions/016-abort-cascade-for-nested-calls.md b/docs/architecture/decisions/016-abort-cascade-for-nested-calls.md new file mode 100644 index 0000000..37d2ac4 --- /dev/null +++ b/docs/architecture/decisions/016-abort-cascade-for-nested-calls.md @@ -0,0 +1,195 @@ +# ADR-016: Abort Cascade for Nested Calls + +## Status + +Accepted + +## Context + +The call protocol allows handlers to compose other operations through +`OperationEnv::invoke()`. This creates a call tree: a parent request spawns +children (via `parent_request_id`), which may spawn their own children. The +tree is the agency chain (ADR-015) — principal delegates to agent, agent may +delegate to sub-agent. + +When `call.aborted` arrives for a parent request, the current `PendingRequestMap` +removes only that single entry. The children are unaware — they continue running, +consuming resources, and potentially producing side effects. This is the nested +abort problem: + +``` +Client calls /agent/chat (r1) + agent handler calls /fs/readFile via env.invoke (r1-a) + fs handler calls /db/query via env.invoke (r1-a-1) + agent handler calls /bash/exec via env.invoke (r1-b) + +Client aborts r1 (call.aborted { id: "r1" }) + → r1 removed from PendingRequestMap + → r1-a, r1-a-1, r1-b continue running (ghost work) + → bash/exec keeps executing (unwanted side effect) + → db/query keeps running (wasted resources) + → results produced that nobody consumes +``` + +The `@alkdev/flowgraph` TypeScript package solved this with a directed graph +that tracks the call tree and a `FailurePolicy` enum: + +- `"abort-dependents"`: aborting a node cascades to all non-terminal descendants. + This is the "whole tree should abort" behavior. +- `"continue-running"`: only idle/waiting dependents are aborted; started ones + keep going. New ones don't start because their predecessors failed/aborted. + +The agent use case makes this concrete and urgent: an LLM composes deep, dynamic +call trees (parallel tools, sequential tools, sub-agents calling sub-tools). +Aborting a chat should tear down the entire tree — the LLM HTTP stream, all tool +calls, all sub-calls. But this is a protocol-level concern, not an agent feature: +every consumer (NAPI adapter, Python adapter, any service speaking EventEnvelope) +inherits whatever abort model the protocol defines. The call protocol is a +general-purpose cross-boundary RPC mechanism; nested composition is a core +protocol feature, and abort semantics for that composition are protocol semantics. + +## Decision + +### 1. `call.aborted` cascades to descendants + +When `call.aborted` arrives for a request, the protocol cascades the abort to +all non-terminal descendants in the call tree (identified via `parent_request_id`). +Each descendant receives a `call.aborted` event. The `PendingRequestMap` removes +all affected entries. + +The cascade is protocol-level: the event schema carries cascade semantics. A +`call.aborted` for a parent implies abort of all descendants. This is not a +client-side convention — the server (CallAdapter) is responsible for discovering +descendants and propagating the abort. + +### 2. Default policy: `abort-dependents` + +The default policy is `abort-dependents`: aborting a request aborts everything +downstream, regardless of branch. This is the correct default because aborted +parent work has no consumer waiting for results — continuing is wasted work at +best and unwanted side effects at worst (e.g., a `bash/exec` that keeps running +after the caller stopped caring, a DB mutation that completes after the +transaction was aborted). + +### 3. Opt-in policy: `continue-running` + +An opt-in `continue-running` policy is available for cases where long-running +work should survive a parent's abort. Under `continue-running`: +- Descendants that have already started (status: running) continue to completion. +- Descendants that haven't started yet (status: pending/waiting) are aborted + (their predecessors failed, so they can't proceed). +- No new descendants start (the parent is gone). + +Use cases for `continue-running`: a long-running subscription that should keep +streaming after its parent's sibling failed; a background task that was spawned +by a handler and should survive the handler's abort. + +The caller or handler specifies the policy at call time. The specific mechanism +(a field in the `call.requested` payload, a field on `OperationContext`, or a +per-operation declaration) is a two-way door for implementation. + +### 4. Cleanup hooks + +When a call is aborted, handlers need a mechanism to clean up resources: cancel +an HTTP stream, cancel a honker queue job, close a file handle, release a lock. +The protocol provides this through the call lifecycle — when a call is aborted, +the handler's task is cancelled (in Rust, the future is dropped). Cleanup is +handled by `Drop` implementations on resource guards, or by explicit +cancellation callbacks if the handler registers them. + +This is a handler-level concern, not a protocol-level one. The protocol's job is +to cascade the abort; the handler's job is to clean up when cancelled. The +mechanism (tokio `CancellationToken`, `Drop` guards, explicit callbacks) is a +two-way door for implementation. + +### 5. The call tree is tracked via `parent_request_id` + +The call tree is already recorded: `OperationContext.parent_request_id` links +each call to its parent. The cascade mechanism walks this tree to find +descendants. No separate graph structure is required at the protocol level — +the `PendingRequestMap` can index entries by `parent_request_id` to enable +efficient descendant lookup. + +The `@alkdev/flowgraph` package (directed graph with `descendants()`, +reactive status propagation, `FailurePolicy`) is prior art and may be adapted +as a separate Rust crate for consumers that need richer call-tree visualization +or reactive status tracking. It is not required for the protocol-level cascade +— a parent-indexed map suffices. + +## Consequences + +**Positive:** +- No ghost work. Aborting a parent call tears down the entire tree. Resources + are released, side effects are halted, no results are produced for absent + consumers. +- The default (`abort-dependents`) matches the intuitive expectation: if I + stop caring about the parent, I stop caring about everything it spawned. +- The opt-in (`continue-running`) covers the legitimate exception (long-running + work that should survive) without making it the default. +- The protocol carries cascade semantics, so every consumer inherits the + correct behavior — no consumer needs to implement its own abort propagation. +- The `parent_request_id` chain already exists; the cascade mechanism is an + index on it, not a new data structure. +- Cleanup hooks are handled by Rust's async drop semantics — dropping the + handler's future cancels it, and `Drop` guards release resources. This is + idiomatic Rust, not a custom mechanism. + +**Negative:** +- The `PendingRequestMap` needs a parent-indexed lookup (a `HashMap>` from parent_request_id to child request_ids, or a scan). This + is a minor implementation cost, not a protocol change. +- The `call.aborted` event schema carries cascade semantics — clients that + don't understand cascade (future versions, other implementations) would + need to handle it. Mitigated: cascade is server-side (the CallAdapter walks + the tree and sends `call.aborted` per descendant), so clients see individual + abort events regardless of whether they understand the cascade concept. +- The `continue-running` policy adds a parameter to the call lifecycle. The + specific location (payload field, context field, per-operation declaration) + is a two-way door, but the existence of the policy is a one-way commitment. + +## Assumptions + +1. **Aborting a parent should abort descendants by default.** If the default + should be `continue-running` (descendants survive), this ADR is wrong. The + assumption is that ghost work is worse than premature cancellation — a + cancelled descendant can be retried, but a ghost process consuming + resources and producing unwanted side effects is harder to recover from. + +2. **The server (CallAdapter) is responsible for cascade.** The client sends + `call.aborted` for one request ID; the server discovers descendants and + propagates. If the client were responsible for cascading, it would need to + know the full tree — which it may not (server-side composition creates + children the client never saw). + +3. **`parent_request_id` is sufficient to discover descendants.** The call tree + is a tree (acyclic, single parent per node). If future composition patterns + create multi-parent relationships (e.g., a shared subcall invoked by two + parents), the cascade model needs extension. The assumption is that + composition creates a tree, not a DAG. + +4. **Dropping the handler's future is sufficient for cleanup.** Rust's async + drop semantics cancel the future and run `Drop` guards. If a use case + requires explicit cleanup callbacks (e.g., external systems that need a + signal), the mechanism needs extension. The assumption is that `Drop` + guards cover the common cases (HTTP stream cancellation, file handle + release, lock release). + +5. **`continue-running` is per-call, not per-operation.** The policy is + specified at call time, not declared at registration. If the policy should + be a static property of the operation (declared in `OperationSpec`), the + model changes. The assumption is that the caller or handler decides at call + time based on the specific context. + +## References + +- ADR-012: Call protocol stream model (bidirectional streams, EventEnvelope, + ID-based correlation) +- ADR-015: Privilege model (the call tree is the agency chain — + `parent_request_id` traces principal → agent) +- OQ-17: Abort cascade semantics (resolved by this ADR) +- OQ-19: Session-scoped registries (session-scoped operations are in the call + tree and participate in cascade) +- `@alkdev/flowgraph` TypeScript package — prior art for call-graph tracking + with `descendants()`, `FailurePolicy`, reactive status propagation +- [call-protocol.md](../crates/call/call-protocol.md) +- [operation-registry.md](../crates/call/operation-registry.md) \ No newline at end of file diff --git a/docs/architecture/open-questions.md b/docs/architecture/open-questions.md index abbecc9..ea52c4a 100644 --- a/docs/architecture/open-questions.md +++ b/docs/architecture/open-questions.md @@ -186,17 +186,11 @@ These questions are acknowledged but not active. They will be promoted to open w ### OQ-17: Abort Cascade Semantics for Nested Calls - **Origin**: [call-protocol.md](crates/call/call-protocol.md), [operation-registry.md](crates/call/operation-registry.md) -- **Status**: open +- **Status**: resolved - **Door type**: One-way (protocol schema), two-way (mechanism) - **Priority**: high -- **Resolution**: When a handler composes other operations via `OperationEnv::invoke()`, it creates a call tree (parent → children via `parent_request_id`). When `call.aborted` arrives for a parent request, the protocol cascades the abort to all non-terminal descendants in the tree. The default policy is `abort-dependents`: aborting a request aborts everything downstream, regardless of branch. This is the correct default because aborted parent work has no consumer waiting for results — continuing is wasted work at best and unwanted side effects at worst (e.g., a `bash/exec` that keeps running after the caller stopped caring). An opt-in `continue-running` policy is available for cases where long-running work should survive a parent's abort (e.g., a subscription that should keep streaming). - - The one-way door is the protocol event schema: `call.aborted` must carry cascade semantics before implementation, because retrofitting cascade onto a non-cascading abort is a breaking protocol change (existing clients send `call.aborted` for one ID, the server processes one ID). The mechanism — how the runtime discovers descendants and propagates cancellation (cancellation tokens propagated through `OperationContext`, a parent-indexed map in `PendingRequestMap`, or a separate graph structure consuming call events) — is a two-way door for implementation. The `@alkdev/flowgraph` TypeScript package demonstrates a reactive call-graph approach (directed graph with `descendants()`, `FailurePolicy: "abort-dependents" | "continue-running"`, signal-based status propagation); a Rust adaptation could use `petgraph` for the graph structure or tokio `CancellationToken` for a simpler implicit tree. The flowgraph may live as a separate crate consuming call events (as the TS version does), not necessarily inside alknet-call. - - This is a protocol-level concern, not specific to any single consumer. The call protocol is a general-purpose cross-boundary RPC mechanism — every consumer (NAPI adapter, Python adapter, agent service, future services) inherits whatever abort model is locked in. Nested composition is a core protocol feature, not an agent feature. The agent use case makes the deep/dynamic call tree case concrete, but the abort cascade problem exists for any handler that composes other operations. - - This OQ will be resolved with an ADR before alknet-call implementation begins. -- **Cross-references**: ADR-012, [call-protocol.md](crates/call/call-protocol.md), [operation-registry.md](crates/call/operation-registry.md) +- **Resolution**: `call.aborted` cascades to all non-terminal descendants in the call tree. The CallAdapter walks the tree (indexed by `parent_request_id` in `PendingRequestMap`) and sends `call.aborted` for each descendant. Default policy is `abort-dependents` (abort everything downstream); `continue-running` is an opt-in for long-running work that should survive a parent's abort. Handlers clean up via Rust's async drop semantics (future dropped → `Drop` guards release resources). The cascade is protocol-level (server discovers descendants and propagates); the mechanism (parent-indexed map, cancellation tokens, or a separate graph) is a two-way door. See ADR-016. +- **Cross-references**: ADR-012, ADR-015, ADR-016, [call-protocol.md](crates/call/call-protocol.md), [operation-registry.md](crates/call/operation-registry.md) ### OQ-18: Privilege Model and Authority Context diff --git a/docs/architecture/overview.md b/docs/architecture/overview.md index 84420b5..b376d2c 100644 --- a/docs/architecture/overview.md +++ b/docs/architecture/overview.md @@ -205,6 +205,7 @@ All design decisions are documented as ADRs in [decisions/](decisions/). | [013](decisions/013-rust-canonical-implementation.md) | Rust as Canonical Implementation Language | Rust canonical, TypeScript reference adaptation | | [014](decisions/014-secret-material-flow-and-capability-injection.md) | Secret Material Flow and Capability Injection | Capabilities carry outbound credentials; call protocol carries no secret material | | [015](decisions/015-privilege-model-and-authority-context.md) | Privilege Model and Authority Context | `internal` = authority switch not ACL skip; External/Internal visibility; handler identity + scoped env | +| [016](decisions/016-abort-cascade-for-nested-calls.md) | Abort Cascade for Nested Calls | `call.aborted` cascades to descendants; default `abort-dependents`, `continue-running` opt-in | ## Open Questions