ADR-028's remote_safe/trusted_peer was a parallel, weaker authorization system
that duplicated the existing AccessControl/Identity machinery and couldn't
express the head→N-workers pattern (the primary use case). The flat-namespace
single-peer overlay model (one connection layer in CompositeOperationEnv)
structurally breaks the moment a head has two workers both exposing
/container/exec.
ADR-029 replaces it with:
- Peer-keyed overlays: PeerCompositeEnv { connections: HashMap<PeerId, ...> }
replaces CompositeOperationEnv's singular connection layer. A head node
routes invoke_peer() to the right peer via PeerRef::Specific / PeerRef::Any.
- AccessControl-based peer authorization: the existing AccessControl::check
(peer_identity) gates peer calls — the same mechanism that gates every other
call. remote_safe/trusted_peer/RemoteFilter/list_operations_peer_scoped/
services_list_handler_peer_scoped are retired. The op's AccessControl IS the
peer-authorization policy; no parallel system.
- ScopedPeerEnv: peer-qualified reachability (peer-pinned allowlist) replaces
from_call's namespace_prefix as the disambiguation mechanism. Cross-peer
collision dissolves (separate sub-overlays); same-peer collision stays error.
- services/list-peers opt-in for peer-attributed re-export listing.
POC-validated against real types (scratch module written, type-checked,
removed; build clean, 207 tests pass). Petgraph not needed for v1 (one-hop,
shallow); nested HashMap suffices; extends to multi-hop without redesign (OQ-32).
OQ impact: OQ-25 dissolved (no marking); OQ-28 cross-peer dissolved / same-peer
stays; OQ-26/27/29 stay; new OQ-30 (Any routing policy), OQ-31 (list-peers
semantics), OQ-32 (multi-hop federation).
Research: docs/research/alknet-call-peer-routing/findings.md (POC shapes,
prior art — Ray.io actors, Dapr service invocation, full ADR draft).
ADR-028 marked Superseded; ADR-017 DC-1 amendment updated to point at ADR-029.
16 KiB
ADR-029: Peer-Graph Routing Model for alknet-call Composition
Status
Proposed (supersedes ADR-028)
Context
The call protocol's composition model is flat per overlay and single-peer.
CompositeOperationEnv holds one connection: Option<Arc<dyn OperationEnv>>
overlay; the Layer 2 imported-ops overlay on CallConnection is a flat
HashMap<String, HandlerRegistration> keyed by operation name. This works for
one remote peer. The head→many-workers / hub→spoke pattern (the ray.io model,
and the primary downstream use case — the container-service rewrite this
completion was supposed to unblock) cannot be expressed:
-
Overlay collision. A head importing from worker A and worker B, both exposing
/container/exec, has no way to routeinvoke("container", "exec")to the right peer. The composite env holds one connection overlay; even with two,contains("container/exec")is true for both with no disambiguation. -
from_callnamespace prefix is a naming-convention hack. DC-3 / OQ-28 madeFromCallConfig::namespace_prefixthe disambiguation mechanism — the operator prefixes imported op names so two peers' ops don't collide in a flat map. This pushes disambiguation to the caller and into theScopedOperationEnv { allowed: HashSet<String> }reachability list. It is bolted onto a flat map instead of being structural routing. -
ADR-028's
remote_safe: bool+trusted_peer: boolis a second, parallel, weaker authorization system. ADR-028 introduced aRemoteFilter { trusted_peer: bool }gate inprotocol/dispatch.rsthat runs before the existingAccessControl::check.trusted_peer: trueis a blanket security-bypass flag — the exact anti-pattern ADR-015 was written to kill (it replacedtrusted: truewith the authority-switch model). ADR-028 reintroduced it at the peer boundary. The existing authorization machinery in core (Identitywith scopes and resources,IdentityProvider,AccessControl::check) is real, grounded, and already wired into the dispatch path — ADR-028 should have used it for peer authorization, not invented a parallel system.
This is a blocking structural fix, not a "v1/later" refinement. The research
at docs/research/alknet-call-peer-routing/findings.md validates the design
through a POC that type-checks against the real types (since removed; the
shapes are recorded in the research doc). ADR-028 is superseded by this ADR.
Decision
1. Peer-keyed overlays
The Layer 2 overlay becomes peer-keyed at the composition-env level.
CompositeOperationEnv's singular connection: Option<Arc<dyn OperationEnv>>
is replaced by PeerCompositeEnv with peer-keyed connections:
pub struct PeerCompositeEnv {
pub base: Arc<dyn OperationEnv + Send + Sync>, // Layer 0 curated
pub session: Option<Arc<dyn OperationEnv + Send + Sync>>, // Layer 1
pub connections: HashMap<PeerId, Arc<dyn OperationEnv + Send + Sync>>, // Layer 2, peer-keyed
connection_order: Vec<PeerId>, // insertion order for PeerRef::Any first-match
}
The per-CallConnection overlay stays flat (one connection = one peer — a
flat HashMap<String, HandlerRegistration> per connection is correct). The
peer-keying is at the aggregation layer: the head node's composition env
holds a HashMap<PeerId, connection_overlay>, not one overlay. PeerId is
the peer's Identity.id — the same field Connection::identity() already
exposes, already resolved in the dispatch path, and already unique per peer.
2. PeerRef routing selector
OperationEnv gains a peer-routing method with a PeerRef selector. The
default-impl preserves back-compat (existing impls that don't override it
delegate to invoke_with_policy, preserving current behavior):
pub enum PeerRef {
Specific(PeerId), // route to this peer; NOT_FOUND if it doesn't serve the op
Any, // first peer (insertion order) that serves it
}
pub type PeerId = String; // = Identity.id
async fn invoke_peer(&self, peer: &PeerRef, namespace: &str, operation: &str,
input: Value, parent: &OperationContext, policy: AbortPolicy) -> ResponseEnvelope {
// default: ignore peer selector, dispatch via invoke_with_policy
self.invoke_with_policy(namespace, operation, input, parent, policy).await
}
fn peer_contains(&self, _peer: &PeerId, name: &str) -> bool { self.contains(name) }
PeerRef::Specific(PeerId) routes to the named peer's overlay; if that peer
doesn't serve the op, NOT_FOUND (no silent fallthrough — explicit routing
must be honored or fail loudly). PeerRef::Any routes to the first peer
(insertion order) whose overlay contains the op — the "any worker that serves
this name" fan-out primitive. A richer RoutingPolicy (round-robin,
least-loaded) is the two-way-door remainder tracked as OQ-30; the PeerRef
enum is designed to compose with it without breaking the signature.
The existing invoke() / invoke_with_policy() methods stay as the
PeerRef::Any equivalent for code that doesn't care about peer selection.
3. AccessControl-based peer authorization; retire remote_safe/trusted_peer
RemoteFilter, HandlerRegistration::remote_safe,
CallClient::trusted_peer, OperationRegistry::list_operations_peer_scoped,
and services_list_handler_peer_scoped are removed. Peer authorization
flows through the existing AccessControl::check against the peer's resolved
Identity:
- A remote peer's call arrives →
dispatch_requestedresolves the peer'sIdentity(already does, from the connection's TLS fingerprint or theauth_tokenpayload) →OperationRegistry::invokerunsAccessControl::check(peer_identity). - If the op's
AccessControlis satisfied → dispatch (capabilities populated from the bundle, same as today). - If not →
FORBIDDEN(capabilities never populated — the security property ADR-028 wanted, achieved by the existing ACL, not a parallel gate). - If the op is
Visibility::Internal→NOT_FOUNDbefore ACL (existing behavior). This is the "never callable from wire" case.
The three cases remote_safe was meant to handle map to existing mechanisms:
remote_safe case |
Replacement |
|---|---|
Op callable by any peer (was remote_safe: true) |
AccessControl::default() — no restrictions; implicitly "remote-safe" because it requires no privileged scope. |
| Op callable only by some peers | AccessControl { required_scopes: [...] } — only peers whose Identity.scopes satisfy the AND-gate may call. Per-peer differentiation via IdentityProvider config. |
| Op never callable from wire | Visibility::Internal — NOT_FOUND before ACL. Existing mechanism, unchanged. |
The op's AccessControl is the peer-authorization policy. There is no
separate exposure decision. If the peer's Identity satisfies the op's
AccessControl, the op dispatches and capabilities populate (same as for any
authorized caller). If not, FORBIDDEN before the handler — capabilities
never populate. The exposure decision and the authorization decision are the
same decision, made through one mechanism, not two.
4. Peer-qualified reachability (ScopedPeerEnv)
ScopedOperationEnv { allowed: HashSet<String> } is extended with an optional
peer-pinned allowlist. Unqualified reachability (peer-agnostic composition —
"I want to call container/exec on whichever worker serves it") stays the
common case; peer-pinning is opt-in for the disambiguation case that replaces
FromCallConfig::namespace_prefix:
pub struct ScopedPeerEnv {
pub allowed_ops: HashSet<String>, // peer-agnostic — reachable via PeerRef::Any
pub peer_pinned: HashSet<String>, // "peer-id/op-name" — reachable only via PeerRef::Specific(that peer)
}
Instead of prefixing the op name (the flat-namespace hack), you pin the
peer in the reachability set. The existing ScopedOperationEnv.allowed
becomes the allowed_ops field; peer-pinning is additive.
5. from_call peer-keyed registration; collision rule change
from_call registers into the specific peer's sub-overlay, not a flat
overlay. Cross-peer collision dissolves: same name on different peers is fine
(separate sub-overlays, no collision, no prefix needed). Same-peer collision
stays an error (a peer shouldn't expose two ops with the same name).
FromCallConfig::namespace_prefix becomes optional local-naming sugar for
the case where the importing node wants to expose a peer's ops under a
different name locally — a local-naming concern, not a disambiguation
concern. It defaults to None.
6. services/list AccessControl-filtered; services/list-peers opt-in
services/list filters by AccessControl::check(calling_peer_identity) — the
calling peer sees only ops it is authorized to call. The
services_list_handler / services_list_handler_peer_scoped split collapses
to a single AccessControl-filtered handler. services/list-peers is the
opt-in for peer-attributed re-export listing (each peer's sub-overlay listed
with attribution, filtered by the calling peer's authorization).
Consequences
Positive:
- The head→N-workers pattern works. A head with multiple worker connections
routes
invoke()to the right peer viaPeerRef. This is the primary use case the previous model couldn't express. - One authorization system, not two. Peer authorization flows through the
existing
AccessControl/Identitymachinery — the same mechanism that gates every other call. No parallelremote_safegate, no blanket-bypasstrusted_peerflag. Per-peer differentiation is viaIdentityProviderconfig (different peers get different scopes), which is a real authorization decision, not a boolean. - Structural disconnect cleanup. When a peer disconnects, its sub-overlay
drops (the
PeerIdkey is removed fromconnections). No stale overlay, no explicit deregistration. An in-flightPeerRef::Specific(that_peer)getsNOT_FOUND— the correct failure mode. from_callcollision dissolves across peers. Two workers exposing/container/execcoexist; the prefix is no longer the disambiguation mechanism.- The
OperationEnvtrait gains a method with a default-impl, preserving back-compat. Existing impls (LocalOperationEnv,OverlayOperationEnv) work unchanged;PeerCompositeEnvoverrides with real peer routing. - The peer-keyed overlay model extends naturally to multi-hop federation (a
chain of
PeerRef::Specificrouting decisions) without redesign. Petgraph is not needed for v1 (one-hop, shallow); it pays off if multi-hop path-finding becomes real (OQ-32).
Negative:
CompositeOperationEnv→PeerCompositeEnvis a migration. Existing call sites that constructCompositeOperationEnv::new(base, Some(conn), session)migrate toPeerCompositeEnv::new(base).with_session(session).attach_peer(peer_id, conn). The singular-connection case (one peer) is the degenerate case (connectionswith one entry).OperationEnvtrait gains a method. The default-impl preserves back-compat, but it's a trait surface change; downstream impls (alknet-http,alknet-agent) gain the method with the default delegation.services/listsemantics change: the filter isAccessControl-based, notremote_safe-based. An op withAccessControl::default()(no restrictions) is now listed to any peer — this is correct (it's implicitly callable by any authenticated peer), but operators who relied onremote_safe: falseto hide ops from peers must instead setrequired_scopesorVisibility::Internal.- ADR-028 is superseded. The
remote_safefield,trusted_peerflag,RemoteFilter,list_operations_peer_scoped, andservices_list_handler_peer_scopedare removed. Code that references them (theCallClient,Dispatcher,HandlerRegistration,discovery.rs) changes. This is the cost of fixing a one-way-door miss — the previous model shipped and was reviewed before the structural gap was caught. PeerId = Identity.id(the fingerprint) is not stable across key rotation. A peer that rotates its TLS key gets a newPeerId; in-flightPeerRef::Specific(old_id)getsNOT_FOUNDafter reconnect. For the immediate use case (head→workers where the operator controls key rotation), this is acceptable. A stable logical node name decoupled from cryptographic identity is the cleaner long-term shape (assumption 1).
Assumptions
-
PeerId = Identity.id(the fingerprint). Reconnects with a rotated key change thePeerId; the peer-keyed overlay drops the oldPeerId's sub-overlay and creates a new one. An in-flightPeerRef::Specific(old_id)getsNOT_FOUND. This is acceptable for v1 (operator-controlled key rotation in the head→workers pattern). A stable logical node name separate from the cryptographic identity is a future question; the peer-keyed overlay model accommodates it by changing whatPeerIdaliases, not by redesign. -
PeerRef::Any= insertion-order first-match. Deterministic but order-dependent (worker A connects before worker B →Anyroutes to A until A disconnects). This is the simplest routing policy and is correct for the immediate use case (the head picks the first worker that serves the op). A richerRoutingPolicy(round-robin, least-loaded, affinity) is OQ-30; thePeerRefenum composes with it without breaking the signature. -
services/listdefaults to "own ops only" (unchanged from today). Re-exported peer ops are not listed unless the calling peer invokesservices/list-peers(the opt-in). The re-export policy (which peers' ops a given peer sees) is anAccessControldecision on the listing op. -
Capability exposure under
PeerRef::Any. When a handler composes viaAnyand routing picks worker A, the handler'sCapabilitiespropagate to worker A's call (same as today'sfrom_callforwarding). This is correct: the handler declared the op in its scoped env, so it authorized the composition; the peer selection is a routing detail. If a handler needs per-peer capability scoping, it usesPeerRef::Specificand peer-pinned reachability. -
Multi-hop federation is out of scope for v1. Worker A does not transitively see worker B's ops through the head unless the head explicitly re-exports them. The peer-keyed overlay model extends to multi-hop without redesign (a chain of
PeerRef::Specificdecisions), but path-finding (which peer reaches which op transitively) is where petgraph would pay off (OQ-32, not designed).
References
- ADR-015: Privilege Model and Authority Context (the authority-switch pattern ADR-028 violated by reintroducing a blanket-bypass flag)
- ADR-017: Call Protocol Client and Adapter Contract (amended:
CallClientno longer hastrusted_peer; the client/adapter spec updates) - ADR-022: Handler Registration, Provenance, and Composition Authority
(
remote_safefield removed from the registration bundle) - ADR-024: Operation Registry Layering (Layer 2 becomes peer-keyed at the composition-env aggregation level)
- ADR-028: Peer-Scoped Registry Filtering for CallClient Inbound Dispatch (superseded)
- OQ-25: dissolved (no
remote_safemarking —AccessControlis the policy) - OQ-26: stays (
AdapterError— aSamePeerCollisionvariant may replace the flatConflictvariant) - OQ-27: stays (re-import trigger — unchanged; the overlay is now peer-scoped)
- OQ-28: dissolved cross-peer (same name on different peers is fine); stays same-peer
- OQ-29: stays (TLS client-auth — orthogonal to the routing model)
- OQ-30:
PeerRef::Anyrouting policy (new — round-robin/least-loaded) - OQ-31:
services/list-peersre-export semantics (new) - OQ-32: Multi-hop federation (new — petgraph candidate)
- Research:
docs/research/alknet-call-peer-routing/findings.md - Prior art: Ray.io actors (
ActorHandle=PeerRef::Specific), Dapr service invocation (app-ID routing =PeerRef::Specific, access-control allowlist =AccessControl-based peer authorization)