From 99c6dd9483c102168c8a6ec10f3cdd5e8ba7660c Mon Sep 17 00:00:00 2001 From: "glm-5.2" Date: Sat, 27 Jun 2026 06:34:35 +0000 Subject: [PATCH] docs(arch): resolve OQ-26 (AdapterError variants) + OQ-33 (PeerId = logical id) + OQ-34 (persistent peer registry) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit OQ-26 (resolved): AdapterError variants decided — DiscoveryFailed, SchemaParse, Transport, Unauthorized, SamePeerCollision (replaces flat Conflict per ADR-029 §5). #[non_exhaustive] for downstream extension. Two-way door; the initial set is the code's return type. OQ-33 (resolved): PeerId is a logical identifier, NOT Identity.id. The research's v1 default (PeerId = fingerprint) is overridden: coupling PeerId to crypto material breaks every in-flight PeerRef::Specific and every ACL entry on key rotation. v1 source is a connection-assigned UUID — a no-storage workaround that works for the immediate use case (head→workers, reconnect produces fresh PeerRef, in-flight gets NOT_FOUND which is correct). The one-way door: PeerId is logical, not crypto — this determines PeerCompositeEnv key type and PeerRef::Specific payload. The id source (UUID vs configured name vs peer registry) is the two-way-door remainder. OQ-34 (new): the storage dimension OQ-33 surfaced. The core crates are deliberately DB-free (smaller, fewer deps, simpler testing) — this served local-only state (vault, registry) well, but peer identity is the first cross-node state that wants persistence. The real solution (a persistent peer registry mapping stable logical name → current crypto material, surviving key rotation) is not a v1 blocker (UUID works), but tracked so the no-DB posture's limit is deliberate, not accidental. The storage boundary (core gets a PeerRegistry trait vs stays storage-free) is the one-way door; the backend choice is two-way. Key-rotation/ACL note: decoupling PeerId from crypto keeps the door open for ACL entries that persist across key rotation — when the peer registry is built, ACLs key on the logical name and key rotation becomes vault-only with no remote-side ACL update. --- docs/architecture/README.md | 4 +- docs/architecture/crates/call/README.md | 4 +- .../crates/call/client-and-adapters.md | 15 +- .../decisions/029-peer-graph-routing-model.md | 41 +++-- docs/architecture/open-questions.md | 141 ++++++++++++++++-- 5 files changed, 167 insertions(+), 38 deletions(-) diff --git a/docs/architecture/README.md b/docs/architecture/README.md index 96b73a0..7c74acb 100644 --- a/docs/architecture/README.md +++ b/docs/architecture/README.md @@ -100,13 +100,15 @@ See [open-questions.md](open-questions.md) for the full tracker. **Open (two-way-door remainders from alknet-call completion + peer-graph routing):** - **OQ-25**: ~~Remote-safe marking shape~~ — **dissolved by ADR-029** (no marking; peer authorization is `AccessControl::check(peer_identity)`) -- **OQ-26**: `OperationAdapter` error type — `import()` returns `Result<_, AdapterError>`; variants decided in implementation +- **OQ-26**: ~~`OperationAdapter` error type~~ — **resolved** (`AdapterError` variants: `DiscoveryFailed`, `SchemaParse`, `Transport`, `Unauthorized`, `SamePeerCollision`; `#[non_exhaustive]`) - **OQ-27**: `from_call` re-import trigger — v1 default auto-on-reconnect; explicit `refresh()` additive - **OQ-28**: `from_call` namespace collision — cross-peer **dissolved by ADR-029** (separate sub-overlays); same-peer stays error - **OQ-29**: `CallClient` TLS client-auth — v1 `with_no_client_auth()` + `AcceptAnyServerCertVerifier`; wiring RawKey client-auth is additive - **OQ-30**: `PeerRef::Any` routing policy — v1 insertion-order first-match; round-robin/least-loaded is future (ADR-029) - **OQ-31**: `services/list-peers` re-export semantics — v1 "own ops only"; `services/list-peers` is opt-in (ADR-029) - **OQ-32**: Multi-hop federation — v1 one-hop; peer-keyed model extends without redesign; petgraph candidate (ADR-029) +- **OQ-33**: ~~PeerId stability~~ — **resolved** (logical id, not `Identity.id`; v1 UUID, decoupled from crypto material for key-rotation-safe ACLs) +- **OQ-34**: Persistent peer registry — the storage dimension OQ-33 surfaced; not a v1 blocker (UUID works); tracked so the no-DB posture's limit is deliberate **Deferred (not active):** - **OQ-09**: WASM target boundaries — design constraint, not deliverable diff --git a/docs/architecture/crates/call/README.md b/docs/architecture/crates/call/README.md index 2546704..3cff9d9 100644 --- a/docs/architecture/crates/call/README.md +++ b/docs/architecture/crates/call/README.md @@ -51,13 +51,15 @@ Structured RPC over QUIC: operations, request/response, streaming subscriptions, | OQ-16 | Safe vault operations for call protocol exposure | resolved (ADR-014) | None exposed for now | | OQ-19 | Session-scoped operation registries | resolved | Agent-written operations overlaid on curated registry via `OperationEnv` trait layering. Protocol doesn't need changes; `OperationEnv` must remain a trait. Generalized by ADR-024 to cover connection-scoped overlays. | | OQ-25 | ~~Remote-safe marking shape~~ | **dissolved** (ADR-029) | `remote_safe`/`trusted_peer` retired; peer authorization is `AccessControl::check(peer_identity)` | -| OQ-26 | OperationAdapter error type (AdapterError variants) | open (two-way) | `import()` returns `Result<_, AdapterError>`; variants decided in implementation | +| OQ-26 | OperationAdapter error type (AdapterError variants) | **resolved** | `DiscoveryFailed`, `SchemaParse`, `Transport`, `Unauthorized`, `SamePeerCollision`; `#[non_exhaustive]` | | OQ-27 | from_call re-import trigger | open (two-way) | v1 default: auto-on-reconnect; explicit `refresh()` additive | | OQ-28 | from_call namespace collision | cross-peer **dissolved** (ADR-029) / same-peer stays | Cross-peer: separate sub-overlays, no collision. Same-peer: error. `namespace_prefix` is local-naming sugar | | OQ-29 | CallClient TLS client-auth and remote-identity verification | open (two-way) | v1 `with_no_client_auth()` + `AcceptAnyServerCertVerifier`; wiring RawKey client-auth is additive (orthogonal to ADR-029) | | OQ-30 | `PeerRef::Any` routing policy | open (two-way) | v1 insertion-order first-match; round-robin/least-loaded is future (ADR-029) | | OQ-31 | `services/list-peers` re-export semantics | open (two-way) | v1 "own ops only"; `services/list-peers` is opt-in (ADR-029) | | OQ-32 | Multi-hop federation | open | v1 one-hop; peer-keyed model extends without redesign; petgraph candidate (ADR-029) | +| OQ-33 | PeerId — crypto identity vs stable logical id | **resolved** | Logical id (UUID v1), not `Identity.id`; decoupled from crypto for key-rotation-safe ACLs | +| OQ-34 | Persistent peer registry (cross-node state storage) | open | Not a v1 blocker (UUID works); the no-DB posture's limit, tracked for deliberate future decision | ## Key Design Principles diff --git a/docs/architecture/crates/call/client-and-adapters.md b/docs/architecture/crates/call/client-and-adapters.md index 366b71f..12d1a62 100644 --- a/docs/architecture/crates/call/client-and-adapters.md +++ b/docs/architecture/crates/call/client-and-adapters.md @@ -173,7 +173,7 @@ pub struct PeerCompositeEnv { pub connections: HashMap>, // Layer 2, peer-keyed connection_order: Vec, // insertion order for PeerRef::Any first-match } -pub type PeerId = String; // = Identity.id +pub type PeerId = String; // logical id (UUID v1), NOT Identity.id — see OQ-33 ``` `OperationEnv` gains a peer-routing method with a `PeerRef` selector @@ -608,10 +608,9 @@ See [open-questions.md](../../open-questions.md) for full details. - **OQ-25** (dissolved by ADR-029): `remote_safe` marking shape — moot. `remote_safe`/`trusted_peer` are retired; peer authorization is `AccessControl::check(peer_identity)`. No marking to shape. -- **OQ-26** (open, two-way): `AdapterError` enum variants (DC-4). The - *presence* of an error type is recorded here; the variants are - implementation-detail. A `SamePeerCollision` variant may replace the flat - `Conflict` variant (ADR-029 §5). +- **OQ-26** (resolved): `AdapterError` variants — `DiscoveryFailed`, + `SchemaParse`, `Transport`, `Unauthorized`, `SamePeerCollision` + (replaces flat `Conflict`). `#[non_exhaustive]`. - **OQ-27** (open, two-way): `from_call` re-import trigger — auto-on-reconnect (v1 default, recorded here) vs explicit `CallConnection::refresh()`. v1 is auto-on-reconnect; the explicit path is additive. The overlay is now @@ -632,6 +631,12 @@ See [open-questions.md](../../open-questions.md) for full details. - **OQ-32** (open): Multi-hop federation — v1 is one-hop; the peer-keyed overlay model extends to multi-hop without redesign; petgraph is the candidate if path-finding becomes real (ADR-029 §3.7). +- **OQ-33** (resolved): `PeerId` is a logical id (connection-assigned UUID), + not `Identity.id` — decoupling from crypto material keeps the door open for + key-rotation-safe ACLs. See OQ-33 in open-questions.md. +- **OQ-34** (open): Persistent peer registry — the storage dimension OQ-33 + surfaced; not a v1 blocker (UUID works), tracked so the no-DB posture's + limit is deliberate. See OQ-34 in open-questions.md. ## References diff --git a/docs/architecture/decisions/029-peer-graph-routing-model.md b/docs/architecture/decisions/029-peer-graph-routing-model.md index 9dfcd03..80182da 100644 --- a/docs/architecture/decisions/029-peer-graph-routing-model.md +++ b/docs/architecture/decisions/029-peer-graph-routing-model.md @@ -79,7 +79,7 @@ pub enum PeerRef { Specific(PeerId), // route to this peer; NOT_FOUND if it doesn't serve the op Any, // first peer (insertion order) that serves it } -pub type PeerId = String; // = Identity.id +pub type PeerId = String; // logical id, NOT Identity.id — see OQ-33 async fn invoke_peer(&self, peer: &PeerRef, namespace: &str, operation: &str, input: Value, parent: &OperationContext, policy: AbortPolicy) -> ResponseEnvelope { @@ -221,22 +221,27 @@ with attribution, filtered by the calling peer's authorization). (the `CallClient`, `Dispatcher`, `HandlerRegistration`, `discovery.rs`) changes. This is the cost of fixing a one-way-door miss — the previous model shipped and was reviewed before the structural gap was caught. -- `PeerId = Identity.id` (the fingerprint) is not stable across key rotation. - A peer that rotates its TLS key gets a new `PeerId`; in-flight - `PeerRef::Specific(old_id)` gets `NOT_FOUND` after reconnect. For the - immediate use case (head→workers where the operator controls key rotation), - this is acceptable. A stable logical node name decoupled from cryptographic - identity is the cleaner long-term shape (assumption 1). +- `PeerId` is a logical identifier, **not** `Identity.id` (the fingerprint or + API-key prefix). Coupling `PeerId` to the crypto material would break every + in-flight `PeerRef::Specific` and every ACL entry referencing that peer on + key rotation. v1 uses a connection-assigned UUID; a configured node name is + the future shape. See OQ-33 for the full decision and the key-rotation/ACL + rationale. ## Assumptions -1. **`PeerId = Identity.id` (the fingerprint).** Reconnects with a rotated key - change the `PeerId`; the peer-keyed overlay drops the old `PeerId`'s - sub-overlay and creates a new one. An in-flight `PeerRef::Specific(old_id)` - gets `NOT_FOUND`. This is acceptable for v1 (operator-controlled key - rotation in the head→workers pattern). A stable logical node name separate - from the cryptographic identity is a future question; the peer-keyed overlay - model accommodates it by changing what `PeerId` aliases, not by redesign. +1. **`PeerId` is a logical identifier, not `Identity.id`.** v1 source is a + connection-assigned UUID (v4) — stable for the connection's lifetime, + changes on reconnect. This is a no-storage workaround: the core crates are + deliberately DB-free (smaller, fewer deps), which works for local-only + state but not for cross-node peer identity that wants to persist across + restarts and key rotations. An in-flight `PeerRef::Specific(stale_uuid)` + gets `NOT_FOUND` on reconnect — the correct failure mode (the peer is + gone); re-`from_call` produces a fresh `PeerRef`. The real solution (a + persistent peer registry that maps a stable logical name to current crypto + material, surviving key rotation) is tracked as OQ-34, not a v1 blocker. + The one-way door: `PeerId` is logical, not crypto — this determines the + `PeerCompositeEnv` key type and `PeerRef::Specific` payload. See OQ-33. 2. **`PeerRef::Any` = insertion-order first-match.** Deterministic but order-dependent (worker A connects before worker B → `Any` routes to A @@ -278,8 +283,8 @@ with attribution, filtered by the calling peer's authorization). - ADR-028: Peer-Scoped Registry Filtering for CallClient Inbound Dispatch (superseded) - OQ-25: dissolved (no `remote_safe` marking — `AccessControl` is the policy) -- OQ-26: stays (`AdapterError` — a `SamePeerCollision` variant may replace - the flat `Conflict` variant) +- OQ-26: resolved (`AdapterError` variants — `SamePeerCollision` replaces + the flat `Conflict` variant; `#[non_exhaustive]`) - OQ-27: stays (re-import trigger — unchanged; the overlay is now peer-scoped) - OQ-28: dissolved cross-peer (same name on different peers is fine); stays same-peer @@ -287,6 +292,10 @@ with attribution, filtered by the calling peer's authorization). - OQ-30: `PeerRef::Any` routing policy (new — round-robin/least-loaded) - OQ-31: `services/list-peers` re-export semantics (new) - OQ-32: Multi-hop federation (new — petgraph candidate) +- OQ-33: resolved — `PeerId` is a logical id (UUID v1), not `Identity.id`; + decoupling from crypto material keeps the door open for key-rotation-safe ACLs +- OQ-34: persistent peer registry (new — the storage dimension OQ-33 surfaced; + not a v1 blocker, tracked so the no-DB posture's limit is deliberate) - Research: `docs/research/alknet-call-peer-routing/findings.md` - Prior art: Ray.io actors (`ActorHandle` = `PeerRef::Specific`), Dapr service invocation (app-ID routing = `PeerRef::Specific`, access-control allowlist = diff --git a/docs/architecture/open-questions.md b/docs/architecture/open-questions.md index 7360530..066bf8b 100644 --- a/docs/architecture/open-questions.md +++ b/docs/architecture/open-questions.md @@ -349,22 +349,26 @@ revisited during implementation without a new ADR. ### OQ-26: OperationAdapter Error Type (AdapterError Variants) -- **Origin**: [client-and-adapters.md](crates/call/client-and-adapters.md), ADR-017 §5 -- **Status**: open +- **Origin**: [client-and-adapters.md](crates/call/client-and-adapters.md), ADR-017 §5, [ADR-029](decisions/029-peer-graph-routing-model.md) §5 +- **Status**: **resolved** (2026-06-27) - **Door type**: Two-way - **Priority**: medium -- **Resolution**: ADR-017 §5 showed `async fn import(&self) -> - Vec` with no error type. The trait returns - `Result, AdapterError>` where `AdapterError` is a - crate-level enum. The *presence* of an error type is recorded in - [client-and-adapters.md](crates/call/client-and-adapters.md); the exact - variants are the two-way-door remainder. The failure modes real - implementations hit: discovery transport failure (`from_call` remote - unreachable), schema parse failure (`from_openapi`, `from_jsonschema`), - unauthorized (HTTP 401 for `from_openapi`, `from_mcp`). Likely variants: - `DiscoveryFailed`, `SchemaParse`, `Transport`, `Unauthorized`. Decided - during implementation; recorded here, not in a full ADR. -- **Cross-references**: ADR-017, [client-and-adapters.md](crates/call/client-and-adapters.md) +- **Resolution**: The `AdapterError` enum is `#[non_exhaustive]` + + `thiserror::Error`, with these v1 variants: + - `DiscoveryFailed { message: String }` — `from_call` remote unreachable / `services/list` failed + - `SchemaParse { message: String }` — `from_openapi` / `from_jsonschema` couldn't parse the spec + - `Transport { message: String }` — underlying transport error (QUIC for `from_call`, HTTP for `from_openapi`/`from_mcp`) + - `Unauthorized { message: String }` — HTTP 401 for `from_openapi`/`from_mcp`, auth rejected for `from_call` + - `SamePeerCollision { message: String }` — namespace collision *within a single peer* (ADR-029 §5: cross-peer collision dissolves; same-peer collision stays an error). Replaces the flat `Conflict` variant from the pre-ADR-029 implementation. + + `#[non_exhaustive]` lets `alknet-http`'s adapters extend without breaking + match arms. The variant payloads are `String` messages — kept simple and + `Send + Sync` by construction. This matches the shipped implementation + (`crates/alknet-call/src/client/mod.rs`) except `Conflict` → + `SamePeerCollision` (the ADR-029 migration renames it). Two-way door: + adding variants later is non-breaking; renaming a variant is a match-arm + update but not an architectural change. +- **Cross-references**: ADR-017, ADR-029, [client-and-adapters.md](crates/call/client-and-adapters.md) ### OQ-27: from_call Re-Import Trigger @@ -485,4 +489,111 @@ revisited during implementation without a new ADR. suffices. Whether multi-hop federation becomes a real use case is a future decision; the peer-keyed model does not foreclose it. Not designed; tracked here so the v1 model's extendability is recorded. -- **Cross-references**: ADR-029, [client-and-adapters.md](crates/call/client-and-adapters.md) \ No newline at end of file +- **Cross-references**: ADR-029, [client-and-adapters.md](crates/call/client-and-adapters.md) + +### OQ-33: PeerId — Cryptographic Identity vs Stable Logical Identifier + +- **Origin**: [ADR-029](decisions/029-peer-graph-routing-model.md) Assumption 1, `docs/research/alknet-call-peer-routing/findings.md` §6.1 +- **Status**: **resolved** (2026-06-27) +- **Door type**: One-way (composition semantics), two-way (id source) +- **Priority**: high +- **Resolution**: `PeerId` is a **logical identifier, decoupled from the + cryptographic identity**. It is *not* `Identity.id` (the TLS fingerprint or + API-key prefix) — those change on key rotation, which would break every + in-flight `PeerRef::Specific` and every ACL entry referencing that peer. + + **v1 source**: connection-assigned UUID (v4) at `connect()`/`accept()` time. + Stable for the connection's lifetime; changes on reconnect. This is a + **no-storage workaround** — the project has deliberately avoided a DB + backend for the core crates (smaller, fewer deps, simpler testing), which + has served the local-only crates (vault, registry) well. But peer identity + is the first *cross-node* state that wants persistence: what we actually + want is a persistent mapping from a logical peer identity to its current + cryptographic material, updated on key rotation, surviving restarts. + Without a DB, the UUID is the least-bad ephemeral option — the failure + mode (in-flight `PeerRef::Specific` gets `NOT_FOUND` on reconnect) is + acceptable for v1, and the re-`from_call` produces a fresh `PeerRef`. + + **The real solution (future, tracked as OQ-34):** a persistent peer + registry — a mapping from a stable logical peer identity (configured node + name or registered identity) to its current cryptographic material, + persisted across restarts and key rotations. This is what makes the + ACL-stability concern below work correctly: the ACL entry keys on the + logical name, the peer registry tracks the current crypto identity for + that name, and key rotation becomes a vault-only operation with no ACL + update on the remote side. The no-DB posture of the core crates means + this registry lives outside the core — likely in a service crate or an + assembly-layer store — not in alknet-call itself. See OQ-34. + + **Key-rotation / ACL note (context for the future, not a v1 decision):** + if `PeerId` were the fingerprint, rotating a node's TLS key would change + its `PeerId`, invalidating every ACL entry that references that peer. The + vault makes local key rotation easy (derive a new key, re-encrypt, + ADR-021); the problem is the *remote* side's ACL — the hub's + `authorized_fingerprints` / `AccessControl` entries that reference the old + fingerprint. Decoupling `PeerId` from the crypto material means the ACL + entry *can* persist across key rotation — but only if there's a store that + maps the logical name to the new crypto identity after rotation. That + store is OQ-34. The v1 decision (logical id, not crypto; UUID source) + keeps the door open for it without requiring it now. + + **The one-way door:** `PeerId` is a logical id, not `Identity.id`. This + determines the `PeerCompositeEnv` key type, the `PeerRef::Specific` + payload type, and the `ScopedPeerEnv.peer_pinned` entry shape. Reversing + it (switching to `Identity.id`) would break the peer-keyed overlay, the + routing selector, and the reachability set simultaneously. The *source* of + the logical id (UUID now, peer registry later) is the two-way-door + remainder — switching from UUID to a persistent registry changes the + id-generation path, not the composition model. +- **Cross-references**: ADR-009, ADR-014, ADR-015, ADR-017, ADR-021, ADR-027, + ADR-029, OQ-34, [client-and-adapters.md](crates/call/client-and-adapters.md), + [operation-registry.md](crates/call/operation-registry.md), + [auth.md](crates/core/auth.md) + +### OQ-34: Persistent Peer Registry (Cross-Node State Storage) + +- **Origin**: OQ-33 (the storage dimension it surfaced), the no-DB posture of ADR-008/018/025 +- **Status**: open +- **Door type**: One-way (storage boundary), two-way (backend choice) +- **Priority**: medium (not a v1 blocker — UUID works for v1; becomes real + when key rotation across nodes or peer-attribution persistence matters) +- **Resolution**: The core crates (alknet-core, alknet-call, alknet-vault) + are deliberately storage-free — no DB, no persistence layer, in-memory + state only. This has kept the core small and testable, and it works for + local-only state (vault key rotation is version-indexed paths, no DB + needed, ADR-021). **Peer identity is the first cross-node state that + wants persistence**: a stable logical peer identity mapped to its current + cryptographic material, surviving restarts and key rotations. The v1 + workaround (OQ-33: connection-assigned UUID) is ephemeral — it works for + the immediate use case (head→workers, operator-controlled, reconnects + produce a fresh UUID) but doesn't support ACL entries that persist across + key rotation, because there's nowhere to store "worker-a's current crypto + identity is X." + + **What this OQ tracks (not designed, not a v1 decision):** + - Whether a persistent peer registry belongs in a service crate (e.g., an + `alknet-registry` or `alknet-peer-store`), in the assembly layer (a + SQLite file the binary owns), or as a new alknet-core abstraction + (a `PeerRegistry` trait with no built-in impl, like `IdentityProvider`). + - Whether the no-DB posture extends to "core has a trait, service has the + impl" (the `IdentityProvider` pattern) or stays "core is storage-free, + persistence is entirely outside the crate graph." + - The backend choice (SQLite, a key-value store, a config file) is the + two-way-door remainder; the *storage boundary* (does core know about + persistence at all?) is the one-way door. + + **Why this is a one-way door on the storage boundary, not a two-way door:** + if core gains a `PeerRegistry` trait, downstream crates depend on it and + the trait shape becomes a contract. If core stays storage-free, the + registry lives in a service crate and core never knows about persistence. + Reversing either direction breaks downstream consumers. The decision + should be made when a concrete use case (key rotation across nodes, + durable peer attribution, multi-hop federation with OQ-32) forces it — + not before. + + **Not a v1 blocker.** The UUID works for v1; this OQ exists so the + no-DB posture's limit is tracked and the decision is made deliberately + when it's needed, not accidentally when someone bolts a SQLite file onto + the assembly layer and it becomes load-bearing. +- **Cross-references**: ADR-008, ADR-018, ADR-021, ADR-025, ADR-029, OQ-33, + [auth.md](crates/core/auth.md), [config.md](crates/core/config.md) \ No newline at end of file