Files
alknet/docs/reviews/001-pre-implementation-architecture-sanity-check.md
glm-5.2 6a7f8f91ad docs(architecture): resolve S1 — abort policy on OperationContext, not wire
ADR-016 Decision 6 specifies that the abort policy (abort-dependents vs
continue-running) is set on OperationContext and propagated through
OperationEnv::invoke() — the composing handler decides the child's
policy, not the wire caller. The call.requested payload does not carry
an abort policy field. This resolves the TBD that was masquerading as a
two-way door: two of the three options ADR-016 floated (wire payload,
per-operation declaration) were inconsistent with the ADR's own
assumptions.

Also marks review #001 as resolved — all 5 critical, 4 warning, and 4
suggestion findings are now addressed.
2026-06-21 10:34:12 +00:00

25 KiB
Raw Blame History

status, last_updated, reviewed_documents, reviewer
status last_updated reviewed_documents reviewer
resolved 2026-06-20
overview.md
README.md
open-questions.md
crates/core/README.md
crates/core/core-types.md
crates/core/endpoint.md
crates/core/auth.md
crates/core/config.md
crates/call/README.md
crates/call/call-protocol.md
crates/call/operation-registry.md
crates/vault/README.md
crates/vault/mnemonic-derivation.md
crates/vault/encryption.md
crates/vault/service.md
crates/vault/protocol.md
decisions/001 through 021 (all 21 ADRs)
architect

Architecture Gap Review #001

Purpose

This document captures gaps, ambiguities, and inconsistencies found in the alknet architecture specifications (core, call, vault crates) before decomposition into implementation tasks. Each finding is categorized by severity and structured as ProblemSuggestion (or Open Question where no solution is yet known).

This is a final sanity check before implementation. The vault crate carries over source from the POC with documented drift (OsRng, unwrap(), CURRENT_KEY_VERSION=1, spawn return bug) flagged for correction during implementation sync; that drift is out of scope for this review. The focus is cross-boundary ambiguities and inconsistencies that would cause divergent implementations across crates.

Severity Definitions

Severity Meaning
Critical Must resolve before implementation — would cause divergent implementations or security issues
Warning Should resolve — missing specifications that could cause confusion or rework
Suggestion Consider — improvements for clarity or completeness

Critical Findings

C1. handler_identity Has No Registration Path

Documents: ADR-015 (lines 79, 141, 251), operation-registry.md (lines 81, 104, 122, 126, 185), call-protocol.md (lines 265, 279, 288)

Problem: ADR-015 and operation-registry.md are emphatic and repeated that handler_identity is "set at registration by the assembly layer":

ACL check runs against the composing handler's handler_identity (set at registration) — operation-registry.md L81

handler_identity: The identity of the handler processing this call. Set at registration by the assembly layer. — operation-registry.md L122

Handler identity is set at registration by the assembly layer. — ADR-015 Assumption 2

But the registration API shown in operation-registry.md has no place for it:

  • OperationSpec (L3341) has fields: name, namespace, op_type, visibility, input_schema, output_schema, access_control. No handler_identity.
  • OperationRegistry::register(spec, handler) (L140) takes (OperationSpec, Handler).
  • OperationRegistryBuilder::with(spec, handler) (L148153, L259266) takes (OperationSpec, Handler).

So the spec says handler_identity is "set at registration" but the registration API doesn't accept it. Tracing how it would reach OperationContext at call time reveals the gap: build_root_context (call-protocol.md L265286) sets handler_identity: None for wire-originated calls — correct for the root (ACL runs against caller identity). OperationEnv::invoke() propagates parent.handler_identity.clone() to children (L185). But the parent's handler_identity was None (root), so every internal call gets handler_identity: None. Under the shown API, ADR-015's "ACL runs against handler_identity for internal calls" is checking against None — which is the privilege-escalation gap ADR-015 was written to close.

The handler's identity must enter the system somewhere. The natural place is registration: the assembly layer associates an Identity (scopes) with each handler, and the dispatch path attaches it to the OperationContext for calls dispatched to that handler (so it propagates via parent.handler_identity when that handler composes children). The spec shows no such mechanism.

The TypeScript reference (/workspace/@alkdev/operations/src/registry.ts) has the same shape problem at a smaller scale — OperationContext.trusted (types.ts L27) is set by buildEnv() (env.ts L37), not by register(). The TypeScript registry never carried a handler identity; it relied on trusted being set by the env builder. The Rust rewrite replaces trusted with handler_identity (ADR-015) but doesn't show the equivalent of buildEnv's wiring point.

Suggestion: Decide and document where handler_identity is supplied at registration. Likely a third parameter to register/with or part of a registration bundle (see C4). The dispatch path needs to look up the handler's registered identity by operation name and attach it to the OperationContext before invoking the handler, so that handler's children inherit it via parent.handler_identity. This should be its own ADR (ADR-022: Handler Registration and Dispatch Context) before any call-crate implementation begins — see C4 for the synthesis.


C2. Scoped Composition Env Has No Registration/Construction Path

Documents: ADR-015 (lines 150169, 256260), operation-registry.md (lines 126, 168, 174, 307), call-protocol.md (line 285)

Problem: Same shape as C1. ADR-015 and operation-registry.md say the scoped composition env is "set at registration by the assembly layer":

The OperationEnv given to a handler is scoped — it can only invoke a declared set of operations, set at registration by the assembly layer — operation-registry.md L307

Static scoping at registration: the assembly layer declares which operations a handler may compose — ADR-015 L159

The scoped env is declared at registration (static) — ADR-015 Assumption 3

But:

  • register(spec, handler) / with(spec, handler) take no scoped-env declaration. OperationSpec has no field for it.
  • The only OperationEnv implementation shown is LocalOperationEnv { registry: Arc<OperationRegistry> } (operation-registry.md L168) — the full registry, not a scoped one.
  • LocalOperationEnv::invoke (L174) does self.registry.invoke(&name, ...) against the unfiltered registry. There's no scoping layer shown.
  • build_root_context sets env: self.env.clone() (call-protocol.md L285) — one env cloned to every root context. That's a single global env, not per-handler scoping.

For scoping to work, each handler must receive a different scoped env, but the dispatch path shown gives every root context the same self.env. ADR-015 explicitly calls the scoped env one of the three independent controls that are "all needed" (L177198), and says without it "the handler can call anything in the registry" — the parameterized-dispatch attack surface. The spec locks the requirement but not the mechanism, and the mechanism is non-obvious (it changes the registration API and the per-handler context construction).

The TypeScript reference (/workspace/@alkdev/operations/src/env.ts) implements scoping via buildEnv({ registry, context, allowedNamespaces }) — the env builder takes an allowedNamespaces filter and returns a pre-filtered env (L1446). The Rust equivalent needs the same: a per-handler env built from a per-handler scope declaration. The declaration point (registration) and the construction point (dispatch) both need specifying.

Suggestion: Specify how a scoped env is declared at registration and how it reaches the handler's OperationContext.env. Likely the same bundle that resolves C1 — see C4.


C3. Capabilities Lives in Two Places and the Two Models Aren't Reconciled

Documents: ADR-014 (lines 8283, 114124), operation-registry.md (lines 105, 253256, 274297), call-protocol.md (line 274)

Problem: ADR-014 and operation-registry.md show two different models for how a handler gets outbound credentials:

  • Model A — construction-time capture in the handler closure. operation-registry.md L253256: agent_chat_handler(Capabilities::new().with_api_key("google", google_api_key)). The handler is constructed with capabilities baked in. ADR-014 L82: "injected at handler construction (the common case: a static decrypted API key held for the handler's lifetime)."

  • Model B — per-request on OperationContext.capabilities, propagated through composition. operation-registry.md L105: OperationContext { pub capabilities: Capabilities }. L186: capabilities: parent.capabilities.clone() in invoke(). ADR-014 L83: "or scoped per-request for internal-only flows." call-protocol.md L274: build_root_context(..., capabilities: Capabilities) takes a capabilities parameter.

These two models don't connect. If the handler closure captured the capabilities at construction (Model A), then OperationContext.capabilities (Model B) is either redundant or must be populated from the closure — but the closure isn't passed the context, the closure receives the context. There's a circular dependency: the handler needs capabilities to do its work, the capabilities are on the context the handler receives, but the context is built by the CallAdapter which doesn't have the handler's construction-time capabilities (the CallAdapter struct at call-protocol.md L3538 has no capabilities field).

build_root_context's comment (call-protocol.md L274) says "the CallAdapter's own capabilities (if any)" — hedged with "if any," suggesting even the spec author wasn't sure. For composition, invoke() clones parent.capabilities — so children inherit the parent's capabilities. But if the parent's context.capabilities was empty (because the parent handler used its captured capabilities, not the context's), children get empty capabilities too. An agent handler composing /fs/readFile with an outbound LLM call would lose the LLM API key at the composition boundary.

Suggestion: Pick one of:

  • (a) Capabilities are only per-request on OperationContext, populated by the dispatch path from a per-handler capabilities table held by the registry/adapter (the construction-time "baking" just populates that table).
  • (b) Capabilities are captured in the handler closure and OperationContext.capabilities is removed, with composition passing capabilities explicitly as an invoke() parameter.
  • (c) Keep both but document the bridge (e.g., the handler closure captures an Arc<Capabilities> and the registry also holds a clone that the CallAdapter uses to populate build_root_context).

Option (a) is closest to the shape that resolves C1 and C2 cleanly — see C4.


C4. C1C3 Are One Tangle: Registration Bundle Is Missing

Documents: operation-registry.md (lines 130153, 168194), call-protocol.md (lines 265288)

Problem: C1, C2, C3 are the same underlying gap: the registration/dispatch API surface in operation-registry.md doesn't carry the three things ADR-014 and ADR-015 say are "set at registration" — handler identity, scoped env, and capabilities. All three are load-bearing for the security model, all three are locked in ADRs, and none are reflected in the register(spec, handler) signature or the OperationSpec struct.

The three controls in ADR-015 (visibility, handler identity, scoped env) plus the capability injection in ADR-014 all enter the system at the same boundary: the assembly layer hands the registry a (spec, handler) pair plus the handler's runtime context material. The natural shape is a registration bundle that carries all three:

// Illustrative — the exact shape is the decision to make.
pub struct HandlerRegistration {
    pub spec: OperationSpec,
    pub handler: Handler,
    pub identity: Identity,               // C1: handler identity for ACL
    pub scoped_env: ScopedOperationEnv,  // C2: reachable operations
    pub capabilities: Capabilities,      // C3: outbound credentials
}

And the dispatch path (build_root_context + OperationEnv::invoke) reads from this bundle: the CallAdapter looks up the handler's registration by operation name, attaches registration.identity to OperationContext.handler_identity (so children inherit it), uses registration.scoped_env as OperationContext.env (so the handler can only reach declared ops), and populates OperationContext.capabilities from registration.capabilities (so composition propagates them via parent.capabilities.clone()).

This resolves all three cleanly and matches the prior art: the TypeScript buildEnv({ registry, context, allowedNamespaces }) already bundles the scoping decision with the env construction; the Rust version generalizes this to a per-handler registration rather than a per-call build.

Suggestion: Write ADR-022 (Handler Registration and Dispatch Context) that:

  1. Defines the registration bundle shape.
  2. Updates OperationRegistry::register / OperationRegistryBuilder::with signatures.
  3. Shows how build_root_context and OperationEnv::invoke() read from the bundle.
  4. Reconciles Model A and Model B for capabilities (recommend option (a): the bundle carries capabilities, the closure captures nothing, the dispatch path populates OperationContext.capabilities from the bundle).
  5. Updates operation-registry.md and call-protocol.md to match.

This is the single highest-value piece of work before implementation. It closes the gap between ADR-014/015's locked security model and the API surface in one place, before any implementer has to guess.


C5. Operation Registry Has No Error Schema Concept

Documents: operation-registry.md (lines 3341, OperationSpec), call-protocol.md (lines 127143, call.error payload), ADR-017 (lines 113124, from_call mirrors remote spec)

Problem: The OperationSpec has input_schema and output_schema but no error_schemas. The call.error payload (call-protocol.md L128134) carries a code (string enum: NOT_FOUND, FORBIDDEN, INVALID_INPUT, INTERNAL, TIMEOUT) and a message, but there's no way for an operation to declare the specific errors it can return beyond these five infrastructure codes.

This matters for two reasons:

  1. OpenAPI/HTTP adapter fidelity. OpenAPI specs naturally include error information (response status codes with schemas). The from_openapi adapter (ADR-017 L113124) imports operations and mirrors "the remote operation's name, namespace, type, schemas, and access control" — but with no error schema field, error responses from the OpenAPI source are dropped on import. to_openapi has nowhere to project error information to. The same gap applies to from_mcp/to_mcp. An OpenAPI operation that returns 404: { schema: NotFoundError } and 422: { schema: ValidationError } cannot be faithfully represented in alknet's OperationSpec.

  2. Client ergonomics and type safety. A client calling /fs/readFile can't know from the schema that it might return FILE_NOT_FOUND vs PERMISSION_DENIED vs INVALID_PATH. The five infrastructure codes cover protocol-level failures; they don't cover operation-level domain errors. A caller has to parse message strings — the exact anti-pattern that typed RPC is meant to avoid.

The TypeScript reference (/workspace/@alkdev/operations/src/types.ts L3847, L94, L112) defines ErrorDefinitionSchema and an optional errorSchemas?: ErrorDefinition[] on OperationSpec. Each error definition has code, description, schema, and optional httpStatus. The mapError() function (error.ts L2551) matches thrown errors against the declared error schemas by code prefix. This is a proven pattern; the Rust implementation should carry it forward. The translator agent likely omitted it because it's Optional in the TS schema (so dropping it doesn't break the happy path) and because error schemas are semantically different from input/output schemas (an operation returns one output but could return any of several errors). That's a reasonable judgment call for a first translation pass, but it leaves a real gap for adapters and clients.

Without error schemas, from_openapi and to_openapi are lossy on the error axis — an OpenAPI operation's error contract is silently dropped. This undermines the "discoverable, type-safe RPC" goal in operation-registry.md L1824. It's also a cross-boundary issue: the call protocol, the operation registry, and the adapters all need to agree on the error representation, and right now only the call protocol has one (the five infrastructure codes).

Suggestion: Add an optional error_schemas: Vec<ErrorDefinition> (or similar) to OperationSpec, where ErrorDefinition carries a code, optional http_status (for adapter projection), and a JSON Schema for the error detail payload. Extend the call.error payload to optionally carry an error detail matching the declared schema. Document that the five infrastructure codes (NOT_FOUND etc.) are protocol-level and distinct from operation-level error codes. This should be its own ADR (ADR-023: Operation Error Schemas) — the error representation is a one-way door (once the wire format carries error codes, changing them breaks clients), and it must be decided before implementing from_openapi/to_openapi, otherwise the adapter contract in ADR-017 can't be faithfully implemented.


Warning Findings

W1. ADR-020 Contains a Stale Example Path That Contradicts ADR-021

Documents: ADR-020 (line 70)

Problem: ADR-020 L70 says "different derivation paths (m/74'/2'/0'/0' for v1, m/74'/2'/1'/0' for a future v2)". This puts the version index at the third level (2'/1'/0'). ADR-021 (which resolves the actual rotation scheme) and all the vault specs put the version index at the last level: v2 = m/74'/2'/0'/0', v3 = m/74'/2'/0'/1' (mnemonic-derivation.md L226, encryption.md L160164, ADR-021 L6873). ADR-020's example is a pre-ADR-021 leftover and directly contradicts the locked decision. An implementer reading ADR-020 in isolation would derive at the wrong path.

Suggestion: Update ADR-020 L70's example to m/74'/2'/0'/0' for v2, m/74'/2'/0'/1' for v3, or remove the illustrative example and point to ADR-021 for the scheme.


W2. EncryptionError::KeyVersionMismatch Is Stale After OQ-22 Resolved

Documents: encryption.md (lines 197, 205210)

Problem: EncryptionError::KeyVersionMismatch { expected, actual } is defined with comment "reserved for future rotation (OQ-22)". encryption.md L205 says it's "defined but unused in v2." But OQ-22 is now resolved by ADR-021, and ADR-021's rotation mechanism (version-indexed paths + decrypt derives at the version's path + rotate re-encrypts) never emits KeyVersionMismatch — there's no version-mismatch to detect, because decrypt always derives the key the blob says to use. So after ADR-021, KeyVersionMismatch isn't "reserved for future rotation" (rotation is implemented, without it) — it's effectively dead. The "reserved for future (OQ-22)" comment is stale and could mislead an implementer into thinking they need to wire it up for rotation.

Suggestion: Update the encryption.md comment to state that ADR-021 implements rotation without this variant, and clarify whether it's retained for a different future use or should be removed.


W3. from_mcp Is Missing from Adapter Lists in operation-registry.md and ADR-014

Documents: operation-registry.md (lines 59, 301), ADR-014 (line 105)

Problem: ADR-017 L160165 lists four import adapters: FromOpenAPI, FromMCP, FromCall, FromJsonSchema. But:

  • operation-registry.md L59 says "from_openapi and from_jsonschema adapters register operations as Internal by default" — omits from_mcp and from_call.
  • operation-registry.md L301 says "The from_openapi, from_jsonschema, and from_call adapter patterns" — omits from_mcp.
  • ADR-014 L105 says "The from_openapi and from_jsonschema adapter patterns" — omits from_mcp and from_call.

from_mcp is a real import adapter (in ADR-017's trait list and ADR-013's adapter contract). Its absence from the prose in operation-registry.md/ADR-014 is likely an oversight, but it could make an implementer wonder whether from_mcp is real or whether it follows the Internal-by-default rule.

Suggestion: Standardize on listing all four import adapters (or say "all import adapters" / "all OperationAdapter implementations") when stating the Internal-by-default rule, so the rule clearly applies to from_mcp too.


W4. derive_encryption_key_for_version (ADR-021) Not in service.md Method List

Documents: service.md (lines 119127, 161179), ADR-021 (lines 104122)

Problem: ADR-021 L104122 introduces derive_encryption_key_for_version(version) and shows decrypt using it. But service.md's "Derive Methods" section (L119127) only shows derive_encryption_key(path), and its encrypt/decrypt (L161, L172) reference the version-indexed path scheme in prose but don't show the version-aware method. An implementer whose primary reference is service.md wouldn't see derive_encryption_key_for_version and might implement decrypt as "always derive at PATHS::ENCRYPTION" (the pre-ADR-021 behavior, which ADR-021 explicitly calls out as a drift bug to fix). The ADR is authoritative, but the spec doc is what the implementer reads first.

Suggestion: Add derive_encryption_key_for_version to service.md's method list and show decrypt calling it, mirroring ADR-021. Clarify whether derive_encryption_key(path) (the path-based method) is still public API or subsumed by the version-based one.


Suggestions

S1. Abort Policy Field Absent from call.requested Payload (RESOLVED)

Documents: call-protocol.md (lines 112118, call.requested payload), ADR-016 (lines 8689, 177181)

Resolution: ADR-016 Decision 6 (added during this review's resolution pass) specifies that the abort policy is set on OperationContext and propagated through OperationEnv::invoke() — not on the wire payload, not on OperationSpec. The composing handler decides the child's policy. The call.requested payload explicitly does not carry an abort policy field. call-protocol.md updated to state this. ADR-016 Assumption 5 updated to be definitive. No longer a TBD.


S2. build_root_context "CallAdapter's Own Capabilities (If Any)" Hedge (RESOLVED)

Documents: call-protocol.md (line 274)

Resolution: Resolved by ADR-022. The hedge is gone — build_root_context now reads registration.capabilities.clone() from the HandlerRegistration bundle looked up by operation name. The comment is definitive.


S3. Example Style: Arc::new(handler) Inline vs. Pre-wrapped (RESOLVED)

Documents: operation-registry.md (lines 149, 264)

Suggestion: L149 uses .with(spec, Arc::new(services_list_handler)) while L264 uses .with(spec, agent_handler) where agent_handler was already Arc::new'd at L253. Both are valid; pick one style in the examples to avoid a moment of "wait, does with take an Arc or a Handler?" confusion. Trivial.

Resolution: Resolved by ADR-022. The registration API changed to with(HandlerRegistration) / with_local(...) / with_leaf(...), so the inline Arc::new vs pre-wrapped ambiguity no longer exists — the builder methods take the handler as part of the registration bundle.


S4. last_updated Dates Scattered; README Index Older Than Sub-docs

Documents: frontmatter across all docs

Suggestion: The README index date (2026-06-19) was older than several sub-documents (core/README 06-21, call/operation-registry 06-22, etc.). If frontmatter dates are meant to mean something, bump README/overview when sub-docs change. Not worth a cycle on its own.

Resolution: README and overview dates bumped during the ADR-022/023 resolution pass. Ongoing date maintenance is cosmetic and can be addressed opportunistically.


Summary Statistics

Severity Count Status
Critical 5 All resolved (C1C4 via ADR-022, C5 via ADR-023)
Warning 4 All resolved (W1W4 fixed in spec edits)
Suggestion 4 All resolved (S1 via ADR-016 Decision 6, S2 via ADR-022, S3 via ADR-022, S4 during ADR-022/023 pass)

All findings resolved. Resolution summary:

  1. C1C4 (registration bundle tangle) → ADR-022 (Handler Registration, Provenance, and Composition Authority). Registration bundle carries provenance, composition authority, scoped env, capabilities. Dispatch path reads from bundle.
  2. C5 (error schemas) → ADR-023 (Operation Error Schemas). error_schemas on OperationSpec, details on call.error, adapter fidelity for from_openapi/to_openapi.
  3. W1W4 (spec drift/missing methods) → Fixed in spec edits during the ADR-022/023 resolution pass.
  4. S1S4 → S1 resolved by ADR-016 Decision 6 (abort policy on OperationContext, not wire payload). S2, S3 resolved by ADR-022 (API surface change eliminated the hedges/ambiguities). S4 resolved during the ADR-022/023 pass (dates bumped).

Review #001 is complete. All 5 critical, 4 warning, and 4 suggestion findings resolved.