docs(architecture): security constraints from security review

Address security review findings by adding explicit constraints to specs and implementation specialist role: Architectural constraints (spec updates): - metadata does not propagate through OperationEnv::invoke() — fresh HashMap for nested calls, closes the back-door leak channel where a handler that puts a secret in metadata would leak it to children and across from_call to remote nodes (ADR-014) - Config reload must be authenticated/local-only — malicious reload = root-equivalent privilege grant (config.md) - from_call trust is transitive — scoped env bounds reachability, not what the remote op does (operation-registry.md) - Token entropy ≥128 bits — prefix is lookup aid not secret, offline hash verification requires high-entropy tokens (auth.md) Implementation constraints (auth.md security constraints section + role spec): - OsRng for cryptographic nonces (AES-GCM IV reuse is catastrophic) - CachedKey derives Zeroize/ZeroizeOnDrop (no secrets in freed heap) - No unwrap()/expect() outside tests (poisoned lock recovery, not crash) - Implementation specialist role spec updated with all four constraints
2026-06-19 06:55:54 +00:00
parent c0a322ac29
commit 400c60e7f4
4 changed files with 53 additions and 10 deletions
--- a/.opencode/agents/implementation-specialist.md
+++ b/.opencode/agents/implementation-specialist.md
@@ -212,12 +212,24 @@ Read `AGENTS.md` at project root for full details. Key rules:
 1. **No comments in code** — Per project convention.
 2. **Error handling** — Use `anyhow::Result` for application code, `thiserror` for
   library error types. Never panic in library code.
-3. **Feature flags** — Transports are feature-gated (`tls`, `iroh`, `acme`). Base
+3. **No `unwrap()` or `expect()` outside tests** — These are debug signals that
+   something wasn't clear. If you reach for `unwrap()`, it means the error
+   handling path wasn't specified — stop and think about what should actually
+   happen on that error. For poisoned locks, use
+   `unwrap_or_else(|e| e.into_inner())` or explicit error propagation. A panic
+   in one operation must not cascade to other operations.
+4. **Cryptographic nonces use `OsRng`** — AES-GCM IVs and any other cryptographic
+   nonces must use `OsRng` (or equivalent CSPRNG), never `rand::random()`. IV
+   reuse under the same key is catastrophic for GCM.
+5. **Secret material is zeroized on drop** — Any type holding derived keys,
+   decrypted credentials, or other secret material must derive `Zeroize` and
+   `ZeroizeOnDrop`. Secrets must not linger in freed heap memory.
+6. **Feature flags** — Transports are feature-gated (`tls`, `iroh`, `acme`). Base
   crate should compile lean.
-4. **Async runtime** — `tokio` is the async runtime. All I/O is async.
-5. **Naming conventions** — Rust standard: `snake_case` for functions/variables/
+7. **Async runtime** — `tokio` is the async runtime. All I/O is async.
+8. **Naming conventions** — Rust standard: `snake_case` for functions/variables/
   modules, `PascalCase` for types/traits, `SCREAMING_SNAKE_CASE` for constants.
-6. **Module structure** — One module per component under `src/`. Re-export via
+9. **Module structure** — One module per component under `src/`. Re-export via
   `mod.rs` or `lib.rs` as appropriate.

 ## Key Principles
--- a/docs/architecture/crates/call/operation-registry.md
+++ b/docs/architecture/crates/call/operation-registry.md
@@ -1,6 +1,6 @@
 ---
 status: draft
-last_updated: 2026-06-21
+last_updated: 2026-06-22
 ---

 # Operation Registry
@@ -114,7 +114,7 @@ pub struct OperationContext {
 - `identity`: The authenticated caller (from `IdentityProvider`) — inbound auth (who is calling me). For external calls, this is who sent the `call.requested`. For internal calls, this is the parent handler's `handler_identity` (propagated through `OperationEnv::invoke()`)
 - `handler_identity`: The identity of the handler processing this call. Set at registration by the assembly layer. For internal calls (`internal: true`), the ACL check runs against this identity (ADR-015)
 - `capabilities`: Outbound credentials the handler may use (decrypted API keys, scoped vault access) — see [Capability Injection](#capability-injection) below
- `metadata`: Additional context (connection info, tracing IDs). **Must not hold secret material** — see ADR-014
+- `metadata`: Request-scoped context (tracing IDs, connection info). **Must not hold secret material** — see ADR-014. **Does not propagate through `OperationEnv::invoke()`** — nested calls get fresh metadata. The tracing link between parent and child is `parent_request_id`, not metadata propagation. Anything a handler needs to pass to a child goes in the call `input`.
 - `env`: The operation environment for composing calls to other operations. Scoped — the handler can only invoke a declared set of operations (ADR-015)
 - `internal`: When `true`, this call originated from composition (a handler calling another operation via `OperationEnv`), not from a wire request. This switches the authority context: ACL runs against `handler_identity`, not `identity`. The `internal` field uses module-private construction — handlers construct `OperationContext` through `OperationEnv::invoke()` which sets `internal: true`, or through the `CallAdapter` dispatch path which sets `internal: false`. The field is not `pub` for writes; only `pub fn is_internal(&self) -> bool` is exposed for reads. See ADR-015.

@@ -151,11 +151,26 @@ The CLI binary (or assembly layer) constructs the registry and passes it to the

 ```rust
 #[async_trait]
-pub trait OperationEnv: Send + Sync {
-    async fn invoke(&self, namespace: &str, operation: &str, input: Value, parent: &OperationContext) -> ResponseEnvelope;
+impl OperationEnv for LocalOperationEnv {
+    async fn invoke(&self, namespace: &str, operation: &str, input: Value, parent: &OperationContext) -> ResponseEnvelope {
+        let name = format!("{namespace}/{operation}");
+        let context = OperationContext {
+            request_id: format!("env-{name}"),
+            parent_request_id: Some(parent.request_id.clone()),
+            identity: parent.handler_identity.clone(),   // Parent's handler identity becomes the caller
+            handler_identity: parent.handler_identity.clone(), // Inherit handler authority for ACL
+            capabilities: parent.capabilities.clone(),   // Inherit caller's capabilities
+            metadata: HashMap::new(),                     // Fresh — does NOT propagate parent metadata
+            env: self.clone(),
+            internal: true,                              // Nested calls use handler authority
+        };
+        self.registry.invoke(&name, input, context).await
+    }
 }
 ```

+**Metadata does not propagate through composition.** Nested calls get fresh metadata (`HashMap::new()`), not the parent's metadata bag. This is a security constraint (ADR-014): `metadata: HashMap<String, Value>` accepts any `serde_json::Value`, including secret material. If metadata propagated through `env.invoke()`, a handler that accidentally placed a secret in metadata would leak it to every child operation — and if a child is a `from_call` operation (ADR-017), the metadata would cross the wire to the remote node. The tracing link between parent and child is `parent_request_id`, not metadata propagation. Anything a handler needs to pass to a child goes in the call `input`, not in ambient context.
+
 `OperationEnv` is the universal composition mechanism. A handler calls `context.env.invoke("fs", "readFile", input, &context)` and gets a `ResponseEnvelope` back — regardless of whether the operation runs locally, via an irpc service, or on a remote node.

 The `parent` parameter propagates the calling context: the nested call gets `parent_request_id: Some(parent.request_id)`, inherits `parent.identity`, and is marked `internal: true`.
@@ -294,6 +309,8 @@ The `Capabilities` type holds non-serializable, zeroized secret material. It doe

 **`from_call` imports remote operations.** The `from_call` adapter (ADR-017) discovers operations on a remote call protocol endpoint via `services/list` and `services/schema`, then registers them with handlers that forward calls over the QUIC connection. This makes cross-node composition transparent — a handler calling `env.invoke("worker", "exec", ...)` doesn't know whether the operation is local or remote. Connection direction (who opened the QUIC connection) is independent of call direction (who calls whom) — both sides can call each other once connected.

+**`from_call` trust is transitive.** A `from_call`-imported operation executes the remote node's code, not yours. The scoped env (ADR-015) bounds *which* operations are reachable, but not *what* they do. A compromised remote node can do anything its operations are declared to do (and anything its handler bugs allow). This is inherent to remote composition — same as trusting any RPC endpoint — but it must be explicit in the threat model. `from_call` means "I trust the remote node as much as my own handlers." The scoping protects the caller from reaching arbitrary ops; it does not protect against what the reached op does.
+
 **Scoped composition env.** The `OperationEnv` given to a handler is scoped — it can only invoke a declared set of operations, set at registration by the assembly layer. This bounds the parameterized-dispatch attack surface: a handler (or an LLM picking tools, or a quickjs sandbox) can only reach declared operations, not the entire registry. The scoped env is the reachability control; the handler identity is the authority control. Both are needed for least privilege. See ADR-015.

 ## Constraints
@@ -307,6 +324,7 @@ The `Capabilities` type holds non-serializable, zeroized secret material. It doe
 - **The composition env is scoped.** A handler can only invoke operations declared in its scoped env. This bounds parameterized-dispatch attack surface. See ADR-015.
 - **No vault operations are registered in the call protocol.** The vault is assembly-layer only (ADR-008, ADR-014). Handlers receive secret material through `OperationContext.capabilities`, not by calling vault operations over the wire.
 - **The call protocol carries no secret material.** Secret material (private keys, API keys, mnemonics, decrypted credentials) must not appear in `call.requested` payloads, `call.responded` payloads, or `OperationContext.metadata`. See ADR-014.
+- **Metadata does not propagate through composition.** `OperationEnv::invoke()` constructs fresh metadata for nested calls (`HashMap::new()`), not the parent's metadata. This prevents a handler that accidentally places a secret in metadata from leaking it to child operations — and if a child is a `from_call` operation (ADR-017), across the wire to a remote node. The tracing link is `parent_request_id`, not metadata propagation. See ADR-014.

 ## Design Decisions

--- a/docs/architecture/crates/core/auth.md
+++ b/docs/architecture/crates/core/auth.md
@@ -250,3 +250,14 @@ The endpoint's `AlknetEndpoint` also holds `Arc<dyn IdentityProvider>` for endpo
 ## Open Questions

 None. All auth-related open questions are resolved.
+
+## Security Constraints
+
+These are security-critical implementation requirements, not architectural decisions (the architecture is locked by the ADRs above). They are documented here so implementation agents don't miss them.
+
+- **Token entropy**: generated `alk_` tokens must have ≥128 bits of entropy. The prefix (first 8 chars) is for O(1) lookup and is not secret — it appears in logs by design. SHA-256 of the full token allows offline verification; this is safe only if the full token is high-entropy. The prefix alone must not be sufficient to authenticate.
+- **Config reload must be authenticated**: a reload that adds an authorized fingerprint or API key grants access immediately (see [config.md](config.md)). The reload trigger must be local-only (SIGHUP, file watch) or an admin-scoped call protocol operation. A malicious reload is equivalent to root-level privilege grant.
+- **Connection-level identity is for observability only**: `Connection::set_identity` stores the handler-resolved identity for logging/audit. Per-request identity (`OperationContext.identity`) takes precedence for ACL. See OQ-11.
+- **Cryptographic nonces use OsRng**: AES-GCM IVs and any other cryptographic nonces must use `OsRng` (or equivalent CSPRNG), not `rand::random()`. IV reuse under the same key is catastrophic for GCM (authenticity breaks, two-time-pad on plaintext). The vault implementation (`crates/alknet-vault/src/encryption.rs`) must use `OsRng` for IV generation.
+- **Derived keys are zeroized on drop**: cached derived keys (`CachedKey`) must derive `Zeroize` and `ZeroizeOnDrop`. When the cache evicts an entry (LRU) or the process exits without explicit `lock()`, derived private keys must not linger in freed heap memory. The cache must clear on drop, not just on explicit `lock()`.
+- **No `unwrap()` or `expect()` outside tests**: poisoned lock recovery uses `unwrap_or_else(|e| e.into_inner())` or explicit error propagation. A panic in one vault operation must not brick the vault for all other operations.
--- a/docs/architecture/crates/core/config.md
+++ b/docs/architecture/crates/core/config.md
@@ -1,6 +1,6 @@
 ---
 status: draft
-last_updated: 2026-06-16
+last_updated: 2026-06-21
 ---

 # Configuration
@@ -197,6 +197,8 @@ impl ConfigReloadHandle {

 The CLI binary creates a `ConfigReloadHandle` and passes it to a config watcher (file watcher, SIGHUP handler, or call protocol operation) that calls `reload()` when config changes are detected.

+**Config reload is a privilege-escalation path.** `ConfigIdentityProvider` reads from `ArcSwap<DynamicConfig>`, so a reload that adds an authorized fingerprint or API key grants access immediately. A malicious reload is equivalent to root-level privilege grant. The reload trigger **must be authenticated/local-only**: SIGHUP (local signal), local file watch, or an admin call protocol operation with the same auth treatment as any other mutation (requires `admin` scope, ADR-015). The implementation must not ship a reload endpoint with no auth "for convenience."
+
 ## ConfigError

 ```rust