docs(arch): ADR-029 peer-graph routing model — supersedes ADR-028

ADR-028's remote_safe/trusted_peer was a parallel, weaker authorization system that duplicated the existing AccessControl/Identity machinery and couldn't express the head→N-workers pattern (the primary use case). The flat-namespace single-peer overlay model (one connection layer in CompositeOperationEnv) structurally breaks the moment a head has two workers both exposing /container/exec. ADR-029 replaces it with: - Peer-keyed overlays: PeerCompositeEnv { connections: HashMap<PeerId, ...> } replaces CompositeOperationEnv's singular connection layer. A head node routes invoke_peer() to the right peer via PeerRef::Specific / PeerRef::Any. - AccessControl-based peer authorization: the existing AccessControl::check (peer_identity) gates peer calls — the same mechanism that gates every other call. remote_safe/trusted_peer/RemoteFilter/list_operations_peer_scoped/ services_list_handler_peer_scoped are retired. The op's AccessControl IS the peer-authorization policy; no parallel system. - ScopedPeerEnv: peer-qualified reachability (peer-pinned allowlist) replaces from_call's namespace_prefix as the disambiguation mechanism. Cross-peer collision dissolves (separate sub-overlays); same-peer collision stays error. - services/list-peers opt-in for peer-attributed re-export listing. POC-validated against real types (scratch module written, type-checked, removed; build clean, 207 tests pass). Petgraph not needed for v1 (one-hop, shallow); nested HashMap suffices; extends to multi-hop without redesign (OQ-32). OQ impact: OQ-25 dissolved (no marking); OQ-28 cross-peer dissolved / same-peer stays; OQ-26/27/29 stay; new OQ-30 (Any routing policy), OQ-31 (list-peers semantics), OQ-32 (multi-hop federation). Research: docs/research/alknet-call-peer-routing/findings.md (POC shapes, prior art — Ray.io actors, Dapr service invocation, full ADR draft). ADR-028 marked Superseded; ADR-017 DC-1 amendment updated to point at ADR-029.
2026-06-27 06:04:19 +00:00
parent f9c0ab092b
commit 77eb35a8a5
10 changed files with 1379 additions and 156 deletions
--- a/docs/architecture/decisions/017-call-protocol-client-and-adapter-contract.md
+++ b/docs/architecture/decisions/017-call-protocol-client-and-adapter-contract.md
@@ -360,19 +360,20 @@ noted re-import hot-swap is a two-way door; §3 mentioned the namespace prefix).
 The call-completion gap analysis (`docs/research/alknet-call-completion/gap-analysis.md`
 DC-1..4) resolved them. The resolutions:

-### DC-1 — CallClient registry scope: resolved by ADR-028
+### DC-1 — CallClient registry scope: resolved by ADR-028, superseded by ADR-029

-The §1 Consequences security dimension is resolved by
-[ADR-028](028-callclient-peer-scoped-registry-filtering.md). The one-way
-door (existence of peer-scoped filtering as the v1 default) is locked:
-**default-deny**, with a `remote_safe: bool` on `HandlerRegistration`
-v1 shape and a trusted-peer opt-in. The shape of the marking is the
-two-way-door remainder, tracked as OQ-25. This ADR's §1 text ("It has its own
-operation registry to dispatch incoming calls from the remote side") and
-the Consequences note ("The specific mechanism … is a two-way door") are
-superseded by ADR-028's decision that the *default* is filtered, not
-shared-global. Share-global remains available as the explicit opt-in
-(ADR-028 §3).
+The §1 Consequences security dimension was originally resolved by ADR-028
+(default-deny `remote_safe: bool` + `trusted_peer` opt-in). **ADR-028 is now
+superseded by [ADR-029](029-peer-graph-routing-model.md)** (2026-06-27):
+the flat-namespace single-peer model ADR-028 built on cannot express the
+head→N-workers pattern, and the `remote_safe`/`trusted_peer` gate duplicates
+the existing `AccessControl`/`Identity` machinery while reintroducing the
+blanket-bypass anti-pattern ADR-015 killed. ADR-029 replaces the flat overlay
+with peer-keyed overlays + `PeerRef` routing, and retires `remote_safe`/
+`trusted_peer` in favor of `AccessControl::check(peer_identity)` — the
+existing authorization path that was already in the dispatch path. The peer-
+scoping question this section flagged is now answered structurally (peer-keyed
+overlays), not by a parallel boolean gate.

 ### DC-4 — OperationAdapter trait error type: resolved

--- a/docs/architecture/decisions/028-callclient-peer-scoped-registry-filtering.md
+++ b/docs/architecture/decisions/028-callclient-peer-scoped-registry-filtering.md
@@ -2,7 +2,20 @@

 ## Status

-Accepted
+**Superseded** by [ADR-029](029-peer-graph-routing-model.md) (2026-06-27).
+
+ADR-028 introduced `remote_safe: bool` and `trusted_peer: bool` as a parallel
+authorization system for peer-scoped dispatch. This was a structural miss: the
+flat-namespace single-peer model it built on cannot express the head→N-workers
+pattern (the primary use case), and the parallel `remote_safe`/`trusted_peer`
+gate duplicates the existing `AccessControl`/`Identity` machinery (which
+already authorizes peer calls) while reintroducing the blanket-bypass
+anti-pattern ADR-015 was written to kill. ADR-029 replaces the flat overlay
+with peer-keyed overlays + `PeerRef` routing, and retires `remote_safe`/
+`trusted_peer` in favor of the existing `AccessControl::check(peer_identity)`.
+See ADR-029 for the design that replaces this one; see
+`docs/research/alknet-call-peer-routing/findings.md` for the research that
+identified the gap.

 ## Context

--- a/docs/architecture/decisions/029-peer-graph-routing-model.md
+++ b/docs/architecture/decisions/029-peer-graph-routing-model.md
@@ -0,0 +1,293 @@
+# ADR-029: Peer-Graph Routing Model for alknet-call Composition
+
+## Status
+
+Proposed (supersedes ADR-028)
+
+## Context
+
+The call protocol's composition model is **flat per overlay and single-peer**.
+`CompositeOperationEnv` holds one `connection: Option<Arc<dyn OperationEnv>>`
+overlay; the Layer 2 imported-ops overlay on `CallConnection` is a flat
+`HashMap<String, HandlerRegistration>` keyed by operation name. This works for
+one remote peer. The head→many-workers / hub→spoke pattern (the ray.io model,
+and the primary downstream use case — the container-service rewrite this
+completion was supposed to unblock) cannot be expressed:
+
+1. **Overlay collision.** A head importing from worker A and worker B, both
+   exposing `/container/exec`, has no way to route
+   `invoke("container", "exec")` to the right peer. The composite env holds
+   one connection overlay; even with two, `contains("container/exec")` is
+   true for both with no disambiguation.
+
+2. **`from_call` namespace prefix is a naming-convention hack.** DC-3 / OQ-28
+   made `FromCallConfig::namespace_prefix` the disambiguation mechanism — the
+   operator prefixes imported op names so two peers' ops don't collide in a
+   flat map. This pushes disambiguation to the caller and into the
+   `ScopedOperationEnv { allowed: HashSet<String> }` reachability list. It is
+   bolted onto a flat map instead of being structural routing.
+
+3. **ADR-028's `remote_safe: bool` + `trusted_peer: bool` is a second,
+   parallel, weaker authorization system.** ADR-028 introduced a
+   `RemoteFilter { trusted_peer: bool }` gate in `protocol/dispatch.rs` that
+   runs *before* the existing `AccessControl::check`.
+   `trusted_peer: true` is a blanket security-bypass flag — the exact
+   anti-pattern ADR-015 was written to kill (it replaced `trusted: true` with
+   the authority-switch model). ADR-028 reintroduced it at the peer boundary.
+   The existing authorization machinery in core (`Identity` with scopes and
+   resources, `IdentityProvider`, `AccessControl::check`) is real, grounded,
+   and already wired into the dispatch path — ADR-028 should have *used* it for
+   peer authorization, not invented a parallel system.
+
+This is a blocking structural fix, not a "v1/later" refinement. The research
+at `docs/research/alknet-call-peer-routing/findings.md` validates the design
+through a POC that type-checks against the real types (since removed; the
+shapes are recorded in the research doc). ADR-028 is superseded by this ADR.
+
+## Decision
+
+### 1. Peer-keyed overlays
+
+The Layer 2 overlay becomes peer-keyed at the composition-env level.
+`CompositeOperationEnv`'s singular `connection: Option<Arc<dyn OperationEnv>>`
+is replaced by `PeerCompositeEnv` with peer-keyed connections:
+
+```rust
+pub struct PeerCompositeEnv {
+    pub base: Arc<dyn OperationEnv + Send + Sync>,       // Layer 0 curated
+    pub session: Option<Arc<dyn OperationEnv + Send + Sync>>,  // Layer 1
+    pub connections: HashMap<PeerId, Arc<dyn OperationEnv + Send + Sync>>,  // Layer 2, peer-keyed
+    connection_order: Vec<PeerId>,  // insertion order for PeerRef::Any first-match
+}
+```
+
+The per-`CallConnection` overlay stays flat (one connection = one peer — a
+flat `HashMap<String, HandlerRegistration>` per connection is correct). The
+peer-keying is at the *aggregation* layer: the head node's composition env
+holds a `HashMap<PeerId, connection_overlay>`, not one overlay. `PeerId` is
+the peer's `Identity.id` — the same field `Connection::identity()` already
+exposes, already resolved in the dispatch path, and already unique per peer.
+
+### 2. `PeerRef` routing selector
+
+`OperationEnv` gains a peer-routing method with a `PeerRef` selector. The
+default-impl preserves back-compat (existing impls that don't override it
+delegate to `invoke_with_policy`, preserving current behavior):
+
+```rust
+pub enum PeerRef {
+    Specific(PeerId),  // route to this peer; NOT_FOUND if it doesn't serve the op
+    Any,               // first peer (insertion order) that serves it
+}
+pub type PeerId = String;  // = Identity.id
+
+async fn invoke_peer(&self, peer: &PeerRef, namespace: &str, operation: &str,
+    input: Value, parent: &OperationContext, policy: AbortPolicy) -> ResponseEnvelope {
+    // default: ignore peer selector, dispatch via invoke_with_policy
+    self.invoke_with_policy(namespace, operation, input, parent, policy).await
+}
+fn peer_contains(&self, _peer: &PeerId, name: &str) -> bool { self.contains(name) }
+```
+
+`PeerRef::Specific(PeerId)` routes to the named peer's overlay; if that peer
+doesn't serve the op, `NOT_FOUND` (no silent fallthrough — explicit routing
+must be honored or fail loudly). `PeerRef::Any` routes to the first peer
+(insertion order) whose overlay contains the op — the "any worker that serves
+this name" fan-out primitive. A richer `RoutingPolicy` (round-robin,
+least-loaded) is the two-way-door remainder tracked as OQ-30; the `PeerRef`
+enum is designed to compose with it without breaking the signature.
+
+The existing `invoke()` / `invoke_with_policy()` methods stay as the
+`PeerRef::Any` equivalent for code that doesn't care about peer selection.
+
+### 3. `AccessControl`-based peer authorization; retire `remote_safe`/`trusted_peer`
+
+`RemoteFilter`, `HandlerRegistration::remote_safe`,
+`CallClient::trusted_peer`, `OperationRegistry::list_operations_peer_scoped`,
+and `services_list_handler_peer_scoped` are **removed**. Peer authorization
+flows through the existing `AccessControl::check` against the peer's resolved
+`Identity`:
+
+- A remote peer's call arrives → `dispatch_requested` resolves the peer's
+  `Identity` (already does, from the connection's TLS fingerprint or the
+  `auth_token` payload) → `OperationRegistry::invoke` runs
+  `AccessControl::check(peer_identity)`.
+- If the op's `AccessControl` is satisfied → dispatch (capabilities populated
+  from the bundle, same as today).
+- If not → `FORBIDDEN` (capabilities never populated — the security property
+  ADR-028 wanted, achieved by the existing ACL, not a parallel gate).
+- If the op is `Visibility::Internal` → `NOT_FOUND` before ACL (existing
+  behavior). This is the "never callable from wire" case.
+
+The three cases `remote_safe` was meant to handle map to existing mechanisms:
+
+| `remote_safe` case | Replacement |
+|---|---|
+| Op callable by any peer (was `remote_safe: true`) | `AccessControl::default()` — no restrictions; implicitly "remote-safe" because it requires no privileged scope. |
+| Op callable only by some peers | `AccessControl { required_scopes: [...] }` — only peers whose `Identity.scopes` satisfy the AND-gate may call. Per-peer differentiation via `IdentityProvider` config. |
+| Op never callable from wire | `Visibility::Internal` — `NOT_FOUND` before ACL. Existing mechanism, unchanged. |
+
+**The op's `AccessControl` *is* the peer-authorization policy.** There is no
+separate exposure decision. If the peer's `Identity` satisfies the op's
+`AccessControl`, the op dispatches and capabilities populate (same as for any
+authorized caller). If not, `FORBIDDEN` before the handler — capabilities
+never populate. The exposure decision and the authorization decision are the
+same decision, made through one mechanism, not two.
+
+### 4. Peer-qualified reachability (`ScopedPeerEnv`)
+
+`ScopedOperationEnv { allowed: HashSet<String> }` is extended with an optional
+peer-pinned allowlist. Unqualified reachability (peer-agnostic composition —
+"I want to call `container/exec` on whichever worker serves it") stays the
+common case; peer-pinning is opt-in for the disambiguation case that replaces
+`FromCallConfig::namespace_prefix`:
+
+```rust
+pub struct ScopedPeerEnv {
+    pub allowed_ops: HashSet<String>,    // peer-agnostic — reachable via PeerRef::Any
+    pub peer_pinned: HashSet<String>,    // "peer-id/op-name" — reachable only via PeerRef::Specific(that peer)
+}
+```
+
+Instead of prefixing the *op name* (the flat-namespace hack), you pin the
+*peer* in the reachability set. The existing `ScopedOperationEnv.allowed`
+becomes the `allowed_ops` field; peer-pinning is additive.
+
+### 5. `from_call` peer-keyed registration; collision rule change
+
+`from_call` registers into the specific peer's sub-overlay, not a flat
+overlay. Cross-peer collision dissolves: same name on different peers is fine
+(separate sub-overlays, no collision, no prefix needed). Same-peer collision
+stays an error (a peer shouldn't expose two ops with the same name).
+
+`FromCallConfig::namespace_prefix` becomes optional local-naming sugar for
+the case where the importing node wants to expose a peer's ops under a
+different name *locally* — a local-naming concern, not a disambiguation
+concern. It defaults to `None`.
+
+### 6. `services/list` `AccessControl`-filtered; `services/list-peers` opt-in
+
+`services/list` filters by `AccessControl::check(calling_peer_identity)` — the
+calling peer sees only ops it is authorized to call. The
+`services_list_handler` / `services_list_handler_peer_scoped` split collapses
+to a single `AccessControl`-filtered handler. `services/list-peers` is the
+opt-in for peer-attributed re-export listing (each peer's sub-overlay listed
+with attribution, filtered by the calling peer's authorization).
+
+## Consequences
+
+**Positive:**
+- The head→N-workers pattern works. A head with multiple worker connections
+  routes `invoke()` to the right peer via `PeerRef`. This is the primary use
+  case the previous model couldn't express.
+- One authorization system, not two. Peer authorization flows through the
+  existing `AccessControl`/`Identity` machinery — the same mechanism that
+  gates every other call. No parallel `remote_safe` gate, no blanket-bypass
+  `trusted_peer` flag. Per-peer differentiation is via `IdentityProvider`
+  config (different peers get different scopes), which is a real
+  authorization decision, not a boolean.
+- Structural disconnect cleanup. When a peer disconnects, its sub-overlay
+  drops (the `PeerId` key is removed from `connections`). No stale overlay,
+  no explicit deregistration. An in-flight `PeerRef::Specific(that_peer)` gets
+  `NOT_FOUND` — the correct failure mode.
+- `from_call` collision dissolves across peers. Two workers exposing
+  `/container/exec` coexist; the prefix is no longer the disambiguation
+  mechanism.
+- The `OperationEnv` trait gains a method with a default-impl, preserving
+  back-compat. Existing impls (`LocalOperationEnv`, `OverlayOperationEnv`)
+  work unchanged; `PeerCompositeEnv` overrides with real peer routing.
+- The peer-keyed overlay model extends naturally to multi-hop federation (a
+  chain of `PeerRef::Specific` routing decisions) without redesign. Petgraph
+  is not needed for v1 (one-hop, shallow); it pays off if multi-hop
+  path-finding becomes real (OQ-32).
+
+**Negative:**
+- `CompositeOperationEnv` → `PeerCompositeEnv` is a migration. Existing call
+  sites that construct `CompositeOperationEnv::new(base, Some(conn), session)`
+  migrate to `PeerCompositeEnv::new(base).with_session(session).attach_peer(peer_id, conn)`.
+  The singular-connection case (one peer) is the degenerate case
+  (`connections` with one entry).
+- `OperationEnv` trait gains a method. The default-impl preserves back-compat,
+  but it's a trait surface change; downstream impls (`alknet-http`,
+  `alknet-agent`) gain the method with the default delegation.
+- `services/list` semantics change: the filter is `AccessControl`-based, not
+  `remote_safe`-based. An op with `AccessControl::default()` (no restrictions)
+  is now listed to any peer — this is correct (it's implicitly callable by
+  any authenticated peer), but operators who relied on `remote_safe: false` to
+  hide ops from peers must instead set `required_scopes` or `Visibility::Internal`.
+- ADR-028 is superseded. The `remote_safe` field, `trusted_peer` flag,
+  `RemoteFilter`, `list_operations_peer_scoped`, and
+  `services_list_handler_peer_scoped` are removed. Code that references them
+  (the `CallClient`, `Dispatcher`, `HandlerRegistration`, `discovery.rs`)
+  changes. This is the cost of fixing a one-way-door miss — the previous model
+  shipped and was reviewed before the structural gap was caught.
+- `PeerId = Identity.id` (the fingerprint) is not stable across key rotation.
+  A peer that rotates its TLS key gets a new `PeerId`; in-flight
+  `PeerRef::Specific(old_id)` gets `NOT_FOUND` after reconnect. For the
+  immediate use case (head→workers where the operator controls key rotation),
+  this is acceptable. A stable logical node name decoupled from cryptographic
+  identity is the cleaner long-term shape (assumption 1).
+
+## Assumptions
+
+1. **`PeerId = Identity.id` (the fingerprint).** Reconnects with a rotated key
+   change the `PeerId`; the peer-keyed overlay drops the old `PeerId`'s
+   sub-overlay and creates a new one. An in-flight `PeerRef::Specific(old_id)`
+   gets `NOT_FOUND`. This is acceptable for v1 (operator-controlled key
+   rotation in the head→workers pattern). A stable logical node name separate
+   from the cryptographic identity is a future question; the peer-keyed overlay
+   model accommodates it by changing what `PeerId` aliases, not by redesign.
+
+2. **`PeerRef::Any` = insertion-order first-match.** Deterministic but
+   order-dependent (worker A connects before worker B → `Any` routes to A
+   until A disconnects). This is the simplest routing policy and is correct for
+   the immediate use case (the head picks the first worker that serves the
+   op). A richer `RoutingPolicy` (round-robin, least-loaded, affinity) is OQ-30;
+   the `PeerRef` enum composes with it without breaking the signature.
+
+3. **`services/list` defaults to "own ops only" (unchanged from today).**
+   Re-exported peer ops are not listed unless the calling peer invokes
+   `services/list-peers` (the opt-in). The re-export policy (which peers' ops a
+   given peer sees) is an `AccessControl` decision on the listing op.
+
+4. **Capability exposure under `PeerRef::Any`.** When a handler composes via
+   `Any` and routing picks worker A, the handler's `Capabilities` propagate to
+   worker A's call (same as today's `from_call` forwarding). This is correct:
+   the handler declared the op in its scoped env, so it authorized the
+   composition; the peer selection is a routing detail. If a handler needs
+   per-peer capability scoping, it uses `PeerRef::Specific` and peer-pinned
+   reachability.
+
+5. **Multi-hop federation is out of scope for v1.** Worker A does not
+   transitively see worker B's ops through the head unless the head explicitly
+   re-exports them. The peer-keyed overlay model extends to multi-hop without
+   redesign (a chain of `PeerRef::Specific` decisions), but path-finding
+   (which peer reaches which op transitively) is where petgraph would pay off
+   (OQ-32, not designed).
+
+## References
+
+- ADR-015: Privilege Model and Authority Context (the authority-switch pattern
+  ADR-028 violated by reintroducing a blanket-bypass flag)
+- ADR-017: Call Protocol Client and Adapter Contract (amended: `CallClient`
+  no longer has `trusted_peer`; the client/adapter spec updates)
+- ADR-022: Handler Registration, Provenance, and Composition Authority
+  (`remote_safe` field removed from the registration bundle)
+- ADR-024: Operation Registry Layering (Layer 2 becomes peer-keyed at the
+  composition-env aggregation level)
+- ADR-028: Peer-Scoped Registry Filtering for CallClient Inbound Dispatch
+  (superseded)
+- OQ-25: dissolved (no `remote_safe` marking — `AccessControl` is the policy)
+- OQ-26: stays (`AdapterError` — a `SamePeerCollision` variant may replace
+  the flat `Conflict` variant)
+- OQ-27: stays (re-import trigger — unchanged; the overlay is now peer-scoped)
+- OQ-28: dissolved cross-peer (same name on different peers is fine); stays
+  same-peer
+- OQ-29: stays (TLS client-auth — orthogonal to the routing model)
+- OQ-30: `PeerRef::Any` routing policy (new — round-robin/least-loaded)
+- OQ-31: `services/list-peers` re-export semantics (new)
+- OQ-32: Multi-hop federation (new — petgraph candidate)
+- Research: `docs/research/alknet-call-peer-routing/findings.md`
+- Prior art: Ray.io actors (`ActorHandle` = `PeerRef::Specific`), Dapr service
+  invocation (app-ID routing = `PeerRef::Specific`, access-control allowlist =
+  `AccessControl`-based peer authorization)