Files
alknet/docs/architecture/decisions/029-peer-graph-routing-model.md
glm-5.2 77eb35a8a5 docs(arch): ADR-029 peer-graph routing model — supersedes ADR-028
ADR-028's remote_safe/trusted_peer was a parallel, weaker authorization system
that duplicated the existing AccessControl/Identity machinery and couldn't
express the head→N-workers pattern (the primary use case). The flat-namespace
single-peer overlay model (one connection layer in CompositeOperationEnv)
structurally breaks the moment a head has two workers both exposing
/container/exec.

ADR-029 replaces it with:
- Peer-keyed overlays: PeerCompositeEnv { connections: HashMap<PeerId, ...> }
  replaces CompositeOperationEnv's singular connection layer. A head node
  routes invoke_peer() to the right peer via PeerRef::Specific / PeerRef::Any.
- AccessControl-based peer authorization: the existing AccessControl::check
  (peer_identity) gates peer calls — the same mechanism that gates every other
  call. remote_safe/trusted_peer/RemoteFilter/list_operations_peer_scoped/
  services_list_handler_peer_scoped are retired. The op's AccessControl IS the
  peer-authorization policy; no parallel system.
- ScopedPeerEnv: peer-qualified reachability (peer-pinned allowlist) replaces
  from_call's namespace_prefix as the disambiguation mechanism. Cross-peer
  collision dissolves (separate sub-overlays); same-peer collision stays error.
- services/list-peers opt-in for peer-attributed re-export listing.

POC-validated against real types (scratch module written, type-checked,
removed; build clean, 207 tests pass). Petgraph not needed for v1 (one-hop,
shallow); nested HashMap suffices; extends to multi-hop without redesign (OQ-32).

OQ impact: OQ-25 dissolved (no marking); OQ-28 cross-peer dissolved / same-peer
stays; OQ-26/27/29 stay; new OQ-30 (Any routing policy), OQ-31 (list-peers
semantics), OQ-32 (multi-hop federation).

Research: docs/research/alknet-call-peer-routing/findings.md (POC shapes,
prior art — Ray.io actors, Dapr service invocation, full ADR draft).
ADR-028 marked Superseded; ADR-017 DC-1 amendment updated to point at ADR-029.
2026-06-27 06:04:19 +00:00

16 KiB

ADR-029: Peer-Graph Routing Model for alknet-call Composition

Status

Proposed (supersedes ADR-028)

Context

The call protocol's composition model is flat per overlay and single-peer. CompositeOperationEnv holds one connection: Option<Arc<dyn OperationEnv>> overlay; the Layer 2 imported-ops overlay on CallConnection is a flat HashMap<String, HandlerRegistration> keyed by operation name. This works for one remote peer. The head→many-workers / hub→spoke pattern (the ray.io model, and the primary downstream use case — the container-service rewrite this completion was supposed to unblock) cannot be expressed:

  1. Overlay collision. A head importing from worker A and worker B, both exposing /container/exec, has no way to route invoke("container", "exec") to the right peer. The composite env holds one connection overlay; even with two, contains("container/exec") is true for both with no disambiguation.

  2. from_call namespace prefix is a naming-convention hack. DC-3 / OQ-28 made FromCallConfig::namespace_prefix the disambiguation mechanism — the operator prefixes imported op names so two peers' ops don't collide in a flat map. This pushes disambiguation to the caller and into the ScopedOperationEnv { allowed: HashSet<String> } reachability list. It is bolted onto a flat map instead of being structural routing.

  3. ADR-028's remote_safe: bool + trusted_peer: bool is a second, parallel, weaker authorization system. ADR-028 introduced a RemoteFilter { trusted_peer: bool } gate in protocol/dispatch.rs that runs before the existing AccessControl::check. trusted_peer: true is a blanket security-bypass flag — the exact anti-pattern ADR-015 was written to kill (it replaced trusted: true with the authority-switch model). ADR-028 reintroduced it at the peer boundary. The existing authorization machinery in core (Identity with scopes and resources, IdentityProvider, AccessControl::check) is real, grounded, and already wired into the dispatch path — ADR-028 should have used it for peer authorization, not invented a parallel system.

This is a blocking structural fix, not a "v1/later" refinement. The research at docs/research/alknet-call-peer-routing/findings.md validates the design through a POC that type-checks against the real types (since removed; the shapes are recorded in the research doc). ADR-028 is superseded by this ADR.

Decision

1. Peer-keyed overlays

The Layer 2 overlay becomes peer-keyed at the composition-env level. CompositeOperationEnv's singular connection: Option<Arc<dyn OperationEnv>> is replaced by PeerCompositeEnv with peer-keyed connections:

pub struct PeerCompositeEnv {
    pub base: Arc<dyn OperationEnv + Send + Sync>,       // Layer 0 curated
    pub session: Option<Arc<dyn OperationEnv + Send + Sync>>,  // Layer 1
    pub connections: HashMap<PeerId, Arc<dyn OperationEnv + Send + Sync>>,  // Layer 2, peer-keyed
    connection_order: Vec<PeerId>,  // insertion order for PeerRef::Any first-match
}

The per-CallConnection overlay stays flat (one connection = one peer — a flat HashMap<String, HandlerRegistration> per connection is correct). The peer-keying is at the aggregation layer: the head node's composition env holds a HashMap<PeerId, connection_overlay>, not one overlay. PeerId is the peer's Identity.id — the same field Connection::identity() already exposes, already resolved in the dispatch path, and already unique per peer.

2. PeerRef routing selector

OperationEnv gains a peer-routing method with a PeerRef selector. The default-impl preserves back-compat (existing impls that don't override it delegate to invoke_with_policy, preserving current behavior):

pub enum PeerRef {
    Specific(PeerId),  // route to this peer; NOT_FOUND if it doesn't serve the op
    Any,               // first peer (insertion order) that serves it
}
pub type PeerId = String;  // = Identity.id

async fn invoke_peer(&self, peer: &PeerRef, namespace: &str, operation: &str,
    input: Value, parent: &OperationContext, policy: AbortPolicy) -> ResponseEnvelope {
    // default: ignore peer selector, dispatch via invoke_with_policy
    self.invoke_with_policy(namespace, operation, input, parent, policy).await
}
fn peer_contains(&self, _peer: &PeerId, name: &str) -> bool { self.contains(name) }

PeerRef::Specific(PeerId) routes to the named peer's overlay; if that peer doesn't serve the op, NOT_FOUND (no silent fallthrough — explicit routing must be honored or fail loudly). PeerRef::Any routes to the first peer (insertion order) whose overlay contains the op — the "any worker that serves this name" fan-out primitive. A richer RoutingPolicy (round-robin, least-loaded) is the two-way-door remainder tracked as OQ-30; the PeerRef enum is designed to compose with it without breaking the signature.

The existing invoke() / invoke_with_policy() methods stay as the PeerRef::Any equivalent for code that doesn't care about peer selection.

3. AccessControl-based peer authorization; retire remote_safe/trusted_peer

RemoteFilter, HandlerRegistration::remote_safe, CallClient::trusted_peer, OperationRegistry::list_operations_peer_scoped, and services_list_handler_peer_scoped are removed. Peer authorization flows through the existing AccessControl::check against the peer's resolved Identity:

  • A remote peer's call arrives → dispatch_requested resolves the peer's Identity (already does, from the connection's TLS fingerprint or the auth_token payload) → OperationRegistry::invoke runs AccessControl::check(peer_identity).
  • If the op's AccessControl is satisfied → dispatch (capabilities populated from the bundle, same as today).
  • If not → FORBIDDEN (capabilities never populated — the security property ADR-028 wanted, achieved by the existing ACL, not a parallel gate).
  • If the op is Visibility::InternalNOT_FOUND before ACL (existing behavior). This is the "never callable from wire" case.

The three cases remote_safe was meant to handle map to existing mechanisms:

remote_safe case Replacement
Op callable by any peer (was remote_safe: true) AccessControl::default() — no restrictions; implicitly "remote-safe" because it requires no privileged scope.
Op callable only by some peers AccessControl { required_scopes: [...] } — only peers whose Identity.scopes satisfy the AND-gate may call. Per-peer differentiation via IdentityProvider config.
Op never callable from wire Visibility::InternalNOT_FOUND before ACL. Existing mechanism, unchanged.

The op's AccessControl is the peer-authorization policy. There is no separate exposure decision. If the peer's Identity satisfies the op's AccessControl, the op dispatches and capabilities populate (same as for any authorized caller). If not, FORBIDDEN before the handler — capabilities never populate. The exposure decision and the authorization decision are the same decision, made through one mechanism, not two.

4. Peer-qualified reachability (ScopedPeerEnv)

ScopedOperationEnv { allowed: HashSet<String> } is extended with an optional peer-pinned allowlist. Unqualified reachability (peer-agnostic composition — "I want to call container/exec on whichever worker serves it") stays the common case; peer-pinning is opt-in for the disambiguation case that replaces FromCallConfig::namespace_prefix:

pub struct ScopedPeerEnv {
    pub allowed_ops: HashSet<String>,    // peer-agnostic — reachable via PeerRef::Any
    pub peer_pinned: HashSet<String>,    // "peer-id/op-name" — reachable only via PeerRef::Specific(that peer)
}

Instead of prefixing the op name (the flat-namespace hack), you pin the peer in the reachability set. The existing ScopedOperationEnv.allowed becomes the allowed_ops field; peer-pinning is additive.

5. from_call peer-keyed registration; collision rule change

from_call registers into the specific peer's sub-overlay, not a flat overlay. Cross-peer collision dissolves: same name on different peers is fine (separate sub-overlays, no collision, no prefix needed). Same-peer collision stays an error (a peer shouldn't expose two ops with the same name).

FromCallConfig::namespace_prefix becomes optional local-naming sugar for the case where the importing node wants to expose a peer's ops under a different name locally — a local-naming concern, not a disambiguation concern. It defaults to None.

6. services/list AccessControl-filtered; services/list-peers opt-in

services/list filters by AccessControl::check(calling_peer_identity) — the calling peer sees only ops it is authorized to call. The services_list_handler / services_list_handler_peer_scoped split collapses to a single AccessControl-filtered handler. services/list-peers is the opt-in for peer-attributed re-export listing (each peer's sub-overlay listed with attribution, filtered by the calling peer's authorization).

Consequences

Positive:

  • The head→N-workers pattern works. A head with multiple worker connections routes invoke() to the right peer via PeerRef. This is the primary use case the previous model couldn't express.
  • One authorization system, not two. Peer authorization flows through the existing AccessControl/Identity machinery — the same mechanism that gates every other call. No parallel remote_safe gate, no blanket-bypass trusted_peer flag. Per-peer differentiation is via IdentityProvider config (different peers get different scopes), which is a real authorization decision, not a boolean.
  • Structural disconnect cleanup. When a peer disconnects, its sub-overlay drops (the PeerId key is removed from connections). No stale overlay, no explicit deregistration. An in-flight PeerRef::Specific(that_peer) gets NOT_FOUND — the correct failure mode.
  • from_call collision dissolves across peers. Two workers exposing /container/exec coexist; the prefix is no longer the disambiguation mechanism.
  • The OperationEnv trait gains a method with a default-impl, preserving back-compat. Existing impls (LocalOperationEnv, OverlayOperationEnv) work unchanged; PeerCompositeEnv overrides with real peer routing.
  • The peer-keyed overlay model extends naturally to multi-hop federation (a chain of PeerRef::Specific routing decisions) without redesign. Petgraph is not needed for v1 (one-hop, shallow); it pays off if multi-hop path-finding becomes real (OQ-32).

Negative:

  • CompositeOperationEnvPeerCompositeEnv is a migration. Existing call sites that construct CompositeOperationEnv::new(base, Some(conn), session) migrate to PeerCompositeEnv::new(base).with_session(session).attach_peer(peer_id, conn). The singular-connection case (one peer) is the degenerate case (connections with one entry).
  • OperationEnv trait gains a method. The default-impl preserves back-compat, but it's a trait surface change; downstream impls (alknet-http, alknet-agent) gain the method with the default delegation.
  • services/list semantics change: the filter is AccessControl-based, not remote_safe-based. An op with AccessControl::default() (no restrictions) is now listed to any peer — this is correct (it's implicitly callable by any authenticated peer), but operators who relied on remote_safe: false to hide ops from peers must instead set required_scopes or Visibility::Internal.
  • ADR-028 is superseded. The remote_safe field, trusted_peer flag, RemoteFilter, list_operations_peer_scoped, and services_list_handler_peer_scoped are removed. Code that references them (the CallClient, Dispatcher, HandlerRegistration, discovery.rs) changes. This is the cost of fixing a one-way-door miss — the previous model shipped and was reviewed before the structural gap was caught.
  • PeerId = Identity.id (the fingerprint) is not stable across key rotation. A peer that rotates its TLS key gets a new PeerId; in-flight PeerRef::Specific(old_id) gets NOT_FOUND after reconnect. For the immediate use case (head→workers where the operator controls key rotation), this is acceptable. A stable logical node name decoupled from cryptographic identity is the cleaner long-term shape (assumption 1).

Assumptions

  1. PeerId = Identity.id (the fingerprint). Reconnects with a rotated key change the PeerId; the peer-keyed overlay drops the old PeerId's sub-overlay and creates a new one. An in-flight PeerRef::Specific(old_id) gets NOT_FOUND. This is acceptable for v1 (operator-controlled key rotation in the head→workers pattern). A stable logical node name separate from the cryptographic identity is a future question; the peer-keyed overlay model accommodates it by changing what PeerId aliases, not by redesign.

  2. PeerRef::Any = insertion-order first-match. Deterministic but order-dependent (worker A connects before worker B → Any routes to A until A disconnects). This is the simplest routing policy and is correct for the immediate use case (the head picks the first worker that serves the op). A richer RoutingPolicy (round-robin, least-loaded, affinity) is OQ-30; the PeerRef enum composes with it without breaking the signature.

  3. services/list defaults to "own ops only" (unchanged from today). Re-exported peer ops are not listed unless the calling peer invokes services/list-peers (the opt-in). The re-export policy (which peers' ops a given peer sees) is an AccessControl decision on the listing op.

  4. Capability exposure under PeerRef::Any. When a handler composes via Any and routing picks worker A, the handler's Capabilities propagate to worker A's call (same as today's from_call forwarding). This is correct: the handler declared the op in its scoped env, so it authorized the composition; the peer selection is a routing detail. If a handler needs per-peer capability scoping, it uses PeerRef::Specific and peer-pinned reachability.

  5. Multi-hop federation is out of scope for v1. Worker A does not transitively see worker B's ops through the head unless the head explicitly re-exports them. The peer-keyed overlay model extends to multi-hop without redesign (a chain of PeerRef::Specific decisions), but path-finding (which peer reaches which op transitively) is where petgraph would pay off (OQ-32, not designed).

References

  • ADR-015: Privilege Model and Authority Context (the authority-switch pattern ADR-028 violated by reintroducing a blanket-bypass flag)
  • ADR-017: Call Protocol Client and Adapter Contract (amended: CallClient no longer has trusted_peer; the client/adapter spec updates)
  • ADR-022: Handler Registration, Provenance, and Composition Authority (remote_safe field removed from the registration bundle)
  • ADR-024: Operation Registry Layering (Layer 2 becomes peer-keyed at the composition-env aggregation level)
  • ADR-028: Peer-Scoped Registry Filtering for CallClient Inbound Dispatch (superseded)
  • OQ-25: dissolved (no remote_safe marking — AccessControl is the policy)
  • OQ-26: stays (AdapterError — a SamePeerCollision variant may replace the flat Conflict variant)
  • OQ-27: stays (re-import trigger — unchanged; the overlay is now peer-scoped)
  • OQ-28: dissolved cross-peer (same name on different peers is fine); stays same-peer
  • OQ-29: stays (TLS client-auth — orthogonal to the routing model)
  • OQ-30: PeerRef::Any routing policy (new — round-robin/least-loaded)
  • OQ-31: services/list-peers re-export semantics (new)
  • OQ-32: Multi-hop federation (new — petgraph candidate)
  • Research: docs/research/alknet-call-peer-routing/findings.md
  • Prior art: Ray.io actors (ActorHandle = PeerRef::Specific), Dapr service invocation (app-ID routing = PeerRef::Specific, access-control allowlist = AccessControl-based peer authorization)