Land the storage and auth strategy research (findings.md) as four accepted ADRs and amend the core and call specs to match: - ADR-030: PeerEntry and Identity.id decoupling. Replaces authorized_fingerprints with peers: Vec<PeerEntry>; Identity.id becomes the stable peer_id, decoupled from the rotating fingerprint. Supersedes ADR-029 Assumption 1's UUID source (one-way door preserved, source changes). Resolves OQ-33 and the storage-boundary half of OQ-34. Records the API-key asymmetry as deliberate (OQ-35). - ADR-031: CredentialStore repo trait + InMemoryCredentialStore default adapter in core. Second repo trait alongside IdentityProvider. Vault encrypts; the store persists the EncryptedData blob; assembly layer loads into Capabilities. EncryptedData core mirror includes salt for wire-format compat. - ADR-032: Forwarded-for identity. forwarded_for field on call.requested and OperationContext — metadata only, never read by AccessControl::check (enforced structurally via the check signature). The from_call handler populates it. Wire-format one-way door, folded into the ADR-029 migration window. - ADR-033: Storage boundary and repo/adapter pattern. Core defines repo traits + in-memory defaults; persistence adapters are separate crates; assembly layer wires. Resolves OQ-34. Concrete adapter shapes deferred for exploration (OQ-36). Amends auth.md, config.md, operation-registry.md, client-and-adapters.md, open-questions.md, README.md, crates/core/README.md. Marks ADR-029 Accepted (Assumption 1 carries the ADR-030 superseded note). Marks the research findings doc reviewed.
17 KiB
ADR-029: Peer-Graph Routing Model for alknet-call Composition
Status
Accepted (supersedes ADR-028; Assumption 1's PeerId source is superseded
by ADR-030 on the source dimension — the one-way door is preserved)
Context
The call protocol's composition model is flat per overlay and single-peer.
CompositeOperationEnv holds one connection: Option<Arc<dyn OperationEnv>>
overlay; the Layer 2 imported-ops overlay on CallConnection is a flat
HashMap<String, HandlerRegistration> keyed by operation name. This works for
one remote peer. The head→many-workers / hub→spoke pattern (the ray.io model,
and the primary downstream use case — the container-service rewrite this
completion was supposed to unblock) cannot be expressed:
-
Overlay collision. A head importing from worker A and worker B, both exposing
/container/exec, has no way to routeinvoke("container", "exec")to the right peer. The composite env holds one connection overlay; even with two,contains("container/exec")is true for both with no disambiguation. -
from_callnamespace prefix is a naming-convention hack. DC-3 / OQ-28 madeFromCallConfig::namespace_prefixthe disambiguation mechanism — the operator prefixes imported op names so two peers' ops don't collide in a flat map. This pushes disambiguation to the caller and into theScopedOperationEnv { allowed: HashSet<String> }reachability list. It is bolted onto a flat map instead of being structural routing. -
ADR-028's
remote_safe: bool+trusted_peer: boolis a second, parallel, weaker authorization system. ADR-028 introduced aRemoteFilter { trusted_peer: bool }gate inprotocol/dispatch.rsthat runs before the existingAccessControl::check.trusted_peer: trueis a blanket security-bypass flag — the exact anti-pattern ADR-015 was written to kill (it replacedtrusted: truewith the authority-switch model). ADR-028 reintroduced it at the peer boundary. The existing authorization machinery in core (Identitywith scopes and resources,IdentityProvider,AccessControl::check) is real, grounded, and already wired into the dispatch path — ADR-028 should have used it for peer authorization, not invented a parallel system.
This is a blocking structural fix, not a "v1/later" refinement. The research
at docs/research/alknet-call-peer-routing/findings.md validates the design
through a POC that type-checks against the real types (since removed; the
shapes are recorded in the research doc). ADR-028 is superseded by this ADR.
Decision
1. Peer-keyed overlays
The Layer 2 overlay becomes peer-keyed at the composition-env level.
CompositeOperationEnv's singular connection: Option<Arc<dyn OperationEnv>>
is replaced by PeerCompositeEnv with peer-keyed connections:
pub struct PeerCompositeEnv {
pub base: Arc<dyn OperationEnv + Send + Sync>, // Layer 0 curated
pub session: Option<Arc<dyn OperationEnv + Send + Sync>>, // Layer 1
pub connections: HashMap<PeerId, Arc<dyn OperationEnv + Send + Sync>>, // Layer 2, peer-keyed
connection_order: Vec<PeerId>, // insertion order for PeerRef::Any first-match
}
The per-CallConnection overlay stays flat (one connection = one peer — a
flat HashMap<String, HandlerRegistration> per connection is correct). The
peer-keying is at the aggregation layer: the head node's composition env
holds a HashMap<PeerId, connection_overlay>, not one overlay. PeerId is
the peer's Identity.id — the same field Connection::identity() already
exposes, already resolved in the dispatch path, and already unique per peer.
2. PeerRef routing selector
OperationEnv gains a peer-routing method with a PeerRef selector. The
default-impl preserves back-compat (existing impls that don't override it
delegate to invoke_with_policy, preserving current behavior):
pub enum PeerRef {
Specific(PeerId), // route to this peer; NOT_FOUND if it doesn't serve the op
Any, // first peer (insertion order) that serves it
}
pub type PeerId = String; // logical id, NOT Identity.id — see OQ-33
async fn invoke_peer(&self, peer: &PeerRef, namespace: &str, operation: &str,
input: Value, parent: &OperationContext, policy: AbortPolicy) -> ResponseEnvelope {
// default: ignore peer selector, dispatch via invoke_with_policy
self.invoke_with_policy(namespace, operation, input, parent, policy).await
}
fn peer_contains(&self, _peer: &PeerId, name: &str) -> bool { self.contains(name) }
PeerRef::Specific(PeerId) routes to the named peer's overlay; if that peer
doesn't serve the op, NOT_FOUND (no silent fallthrough — explicit routing
must be honored or fail loudly). PeerRef::Any routes to the first peer
(insertion order) whose overlay contains the op — the "any worker that serves
this name" fan-out primitive. A richer RoutingPolicy (round-robin,
least-loaded) is the two-way-door remainder tracked as OQ-30; the PeerRef
enum is designed to compose with it without breaking the signature.
The existing invoke() / invoke_with_policy() methods stay as the
PeerRef::Any equivalent for code that doesn't care about peer selection.
3. AccessControl-based peer authorization; retire remote_safe/trusted_peer
RemoteFilter, HandlerRegistration::remote_safe,
CallClient::trusted_peer, OperationRegistry::list_operations_peer_scoped,
and services_list_handler_peer_scoped are removed. Peer authorization
flows through the existing AccessControl::check against the peer's resolved
Identity:
- A remote peer's call arrives →
dispatch_requestedresolves the peer'sIdentity(already does, from the connection's TLS fingerprint or theauth_tokenpayload) →OperationRegistry::invokerunsAccessControl::check(peer_identity). - If the op's
AccessControlis satisfied → dispatch (capabilities populated from the bundle, same as today). - If not →
FORBIDDEN(capabilities never populated — the security property ADR-028 wanted, achieved by the existing ACL, not a parallel gate). - If the op is
Visibility::Internal→NOT_FOUNDbefore ACL (existing behavior). This is the "never callable from wire" case.
The three cases remote_safe was meant to handle map to existing mechanisms:
remote_safe case |
Replacement |
|---|---|
Op callable by any peer (was remote_safe: true) |
AccessControl::default() — no restrictions; implicitly "remote-safe" because it requires no privileged scope. |
| Op callable only by some peers | AccessControl { required_scopes: [...] } — only peers whose Identity.scopes satisfy the AND-gate may call. Per-peer differentiation via IdentityProvider config. |
| Op never callable from wire | Visibility::Internal — NOT_FOUND before ACL. Existing mechanism, unchanged. |
The op's AccessControl is the peer-authorization policy. There is no
separate exposure decision. If the peer's Identity satisfies the op's
AccessControl, the op dispatches and capabilities populate (same as for any
authorized caller). If not, FORBIDDEN before the handler — capabilities
never populate. The exposure decision and the authorization decision are the
same decision, made through one mechanism, not two.
4. Peer-qualified reachability (ScopedPeerEnv)
ScopedOperationEnv { allowed: HashSet<String> } is extended with an optional
peer-pinned allowlist. Unqualified reachability (peer-agnostic composition —
"I want to call container/exec on whichever worker serves it") stays the
common case; peer-pinning is opt-in for the disambiguation case that replaces
FromCallConfig::namespace_prefix:
pub struct ScopedPeerEnv {
pub allowed_ops: HashSet<String>, // peer-agnostic — reachable via PeerRef::Any
pub peer_pinned: HashSet<String>, // "peer-id/op-name" — reachable only via PeerRef::Specific(that peer)
}
Instead of prefixing the op name (the flat-namespace hack), you pin the
peer in the reachability set. The existing ScopedOperationEnv.allowed
becomes the allowed_ops field; peer-pinning is additive.
5. from_call peer-keyed registration; collision rule change
from_call registers into the specific peer's sub-overlay, not a flat
overlay. Cross-peer collision dissolves: same name on different peers is fine
(separate sub-overlays, no collision, no prefix needed). Same-peer collision
stays an error (a peer shouldn't expose two ops with the same name).
FromCallConfig::namespace_prefix becomes optional local-naming sugar for
the case where the importing node wants to expose a peer's ops under a
different name locally — a local-naming concern, not a disambiguation
concern. It defaults to None.
6. services/list AccessControl-filtered; services/list-peers opt-in
services/list filters by AccessControl::check(calling_peer_identity) — the
calling peer sees only ops it is authorized to call. The
services_list_handler / services_list_handler_peer_scoped split collapses
to a single AccessControl-filtered handler. services/list-peers is the
opt-in for peer-attributed re-export listing (each peer's sub-overlay listed
with attribution, filtered by the calling peer's authorization).
Consequences
Positive:
- The head→N-workers pattern works. A head with multiple worker connections
routes
invoke()to the right peer viaPeerRef. This is the primary use case the previous model couldn't express. - One authorization system, not two. Peer authorization flows through the
existing
AccessControl/Identitymachinery — the same mechanism that gates every other call. No parallelremote_safegate, no blanket-bypasstrusted_peerflag. Per-peer differentiation is viaIdentityProviderconfig (different peers get different scopes), which is a real authorization decision, not a boolean. - Structural disconnect cleanup. When a peer disconnects, its sub-overlay
drops (the
PeerIdkey is removed fromconnections). No stale overlay, no explicit deregistration. An in-flightPeerRef::Specific(that_peer)getsNOT_FOUND— the correct failure mode. from_callcollision dissolves across peers. Two workers exposing/container/execcoexist; the prefix is no longer the disambiguation mechanism.- The
OperationEnvtrait gains a method with a default-impl, preserving back-compat. Existing impls (LocalOperationEnv,OverlayOperationEnv) work unchanged;PeerCompositeEnvoverrides with real peer routing. - The peer-keyed overlay model extends naturally to multi-hop federation (a
chain of
PeerRef::Specificrouting decisions) without redesign. Petgraph is not needed for v1 (one-hop, shallow); it pays off if multi-hop path-finding becomes real (OQ-32).
Negative:
CompositeOperationEnv→PeerCompositeEnvis a migration. Existing call sites that constructCompositeOperationEnv::new(base, Some(conn), session)migrate toPeerCompositeEnv::new(base).with_session(session).attach_peer(peer_id, conn). The singular-connection case (one peer) is the degenerate case (connectionswith one entry).OperationEnvtrait gains a method. The default-impl preserves back-compat, but it's a trait surface change; downstream impls (alknet-http,alknet-agent) gain the method with the default delegation.services/listsemantics change: the filter isAccessControl-based, notremote_safe-based. An op withAccessControl::default()(no restrictions) is now listed to any peer — this is correct (it's implicitly callable by any authenticated peer), but operators who relied onremote_safe: falseto hide ops from peers must instead setrequired_scopesorVisibility::Internal.- ADR-028 is superseded. The
remote_safefield,trusted_peerflag,RemoteFilter,list_operations_peer_scoped, andservices_list_handler_peer_scopedare removed. Code that references them (theCallClient,Dispatcher,HandlerRegistration,discovery.rs) changes. This is the cost of fixing a one-way-door miss — the previous model shipped and was reviewed before the structural gap was caught. PeerIdis a logical identifier, notIdentity.id(the fingerprint or API-key prefix). CouplingPeerIdto the crypto material would break every in-flightPeerRef::Specificand every ACL entry referencing that peer on key rotation. v1 uses a connection-assigned UUID; a configured node name is the future shape. See OQ-33 for the full decision and the key-rotation/ACL rationale.
Assumptions
-
PeerIdis a logical identifier, notIdentity.id. v1 source is a connection-assigned UUID (v4) — stable for the connection's lifetime, changes on reconnect. This is a no-storage workaround: the core crates are deliberately DB-free (smaller, fewer deps), which works for local-only state but not for cross-node peer identity that wants to persist across restarts and key rotations. An in-flightPeerRef::Specific(stale_uuid)getsNOT_FOUNDon reconnect — the correct failure mode (the peer is gone); re-from_callproduces a freshPeerRef. The real solution (a persistent peer registry that maps a stable logical name to current crypto material, surviving key rotation) is tracked as OQ-34, not a v1 blocker. The one-way door:PeerIdis logical, not crypto — this determines thePeerCompositeEnvkey type andPeerRef::Specificpayload. See OQ-33.Superseded by ADR-030 on the
PeerIdsource dimension. The one-way door (PeerIdis logical, not crypto) is preserved. The v1 UUID source is replaced byIdentity.idfromPeerEntry.peer_id(stable across key rotation). The "no-storage workaround" framing is no longer accurate — the storage boundary is nowconfig + in-memory adapter(ADR-030 + ADR-033), with persistence adapters additive. See ADR-030 and OQ-33 (resolved). -
PeerRef::Any= insertion-order first-match. Deterministic but order-dependent (worker A connects before worker B →Anyroutes to A until A disconnects). This is the simplest routing policy and is correct for the immediate use case (the head picks the first worker that serves the op). A richerRoutingPolicy(round-robin, least-loaded, affinity) is OQ-30; thePeerRefenum composes with it without breaking the signature. -
services/listdefaults to "own ops only" (unchanged from today). Re-exported peer ops are not listed unless the calling peer invokesservices/list-peers(the opt-in). The re-export policy (which peers' ops a given peer sees) is anAccessControldecision on the listing op. -
Capability exposure under
PeerRef::Any. When a handler composes viaAnyand routing picks worker A, the handler'sCapabilitiespropagate to worker A's call (same as today'sfrom_callforwarding). This is correct: the handler declared the op in its scoped env, so it authorized the composition; the peer selection is a routing detail. If a handler needs per-peer capability scoping, it usesPeerRef::Specificand peer-pinned reachability. -
Multi-hop federation is out of scope for v1. Worker A does not transitively see worker B's ops through the head unless the head explicitly re-exports them. The peer-keyed overlay model extends to multi-hop without redesign (a chain of
PeerRef::Specificdecisions), but path-finding (which peer reaches which op transitively) is where petgraph would pay off (OQ-32, not designed).
References
- ADR-015: Privilege Model and Authority Context (the authority-switch pattern ADR-028 violated by reintroducing a blanket-bypass flag)
- ADR-017: Call Protocol Client and Adapter Contract (amended:
CallClientno longer hastrusted_peer; the client/adapter spec updates) - ADR-022: Handler Registration, Provenance, and Composition Authority
(
remote_safefield removed from the registration bundle) - ADR-024: Operation Registry Layering (Layer 2 becomes peer-keyed at the composition-env aggregation level)
- ADR-028: Peer-Scoped Registry Filtering for CallClient Inbound Dispatch (superseded)
- OQ-25: dissolved (no
remote_safemarking —AccessControlis the policy) - OQ-26: resolved (
AdapterErrorvariants —SamePeerCollisionreplaces the flatConflictvariant;#[non_exhaustive]) - OQ-27: stays (re-import trigger — unchanged; the overlay is now peer-scoped)
- OQ-28: dissolved cross-peer (same name on different peers is fine); stays same-peer
- OQ-29: stays (TLS client-auth — orthogonal to the routing model)
- OQ-30:
PeerRef::Anyrouting policy (new — round-robin/least-loaded) - OQ-31:
services/list-peersre-export semantics (new) - OQ-32: Multi-hop federation (new — petgraph candidate)
- OQ-33: resolved —
PeerIdis a logical id (UUID v1), notIdentity.id; decoupling from crypto material keeps the door open for key-rotation-safe ACLs - OQ-34: persistent peer registry (new — the storage dimension OQ-33 surfaced; not a v1 blocker, tracked so the no-DB posture's limit is deliberate)
- Research:
docs/research/alknet-call-peer-routing/findings.md - Prior art: Ray.io actors (
ActorHandle=PeerRef::Specific), Dapr service invocation (app-ID routing =PeerRef::Specific, access-control allowlist =AccessControl-based peer authorization)