docs(architecture): add ADR-015 privilege model and authority context, resolve OQ-18

ADR-015 locks the call protocol's security model: - internal flag switches authority context to handler identity, not skip ACL - Operations have External/Internal visibility (Internal returns NOT_FOUND from wire, excluded from services/list) - OperationContext carries both identity (caller/principal) and handler_identity (handler/agent) - Scoped composition env bounds reachability (handler can only invoke declared operations) - Three controls together: visibility (wire boundary) + handler identity (authority) + scoped env (reachability) = least privilege Spec updates: - OperationSpec gains Visibility field (External/Internal) - OperationContext gains handler_identity field - AccessControl section: ACL runs against caller identity for external, handler identity for internal - LocalOperationEnv propagates handler_identity - services/list only returns External operations - Adapter-registered operations are Internal by default - OQ-18 resolved, ADR-015 referenced across all call crate specs
2026-06-18 08:55:34 +00:00
parent b4aadc6b93
commit 6285779c30
7 changed files with 345 additions and 51 deletions
--- a/docs/architecture/decisions/015-privilege-model-and-authority-context.md
+++ b/docs/architecture/decisions/015-privilege-model-and-authority-context.md
@@ -0,0 +1,290 @@
+# ADR-015: Privilege Model and Authority Context
+
+## Status
+
+Accepted
+
+## Context
+
+The call protocol allows handlers to compose other operations through
+`OperationEnv::invoke()`. This creates a call tree: a parent request spawns
+children, which may spawn their own children. The `parent_request_id` field
+records this tree.
+
+The previous design had a `trusted: bool` flag on `OperationContext`. When a
+handler invoked another operation through `OperationEnv`, the nested call was
+marked `trusted: true` and **all ACL checks were skipped**. The intent was to
+avoid double-checking: if `/agent/chat` is allowed and it internally calls
+`/auth/verify`, the auth check is "trusted" because the caller already passed
+ACL on `/agent/chat`.
+
+This is a privilege escalation vector. Two concrete attacks:
+
+**Buggy handler**: a handler accidentally calls an operation it shouldn't. With
+`trusted: true`, ACL is skipped entirely. A handler with `read` scope that
+accidentally calls an operation requiring `admin` succeeds — the caller's `read`
+scope effectively triggered an `admin` operation.
+
+**Parameterized dispatch**: a handler takes caller input that determines which
+internal operation to call. This is the core agent use case — an LLM picks which
+tool to invoke based on the user's prompt. With `trusted: true`, the LLM (and
+therefore the user) can invoke any registered operation without ACL checks,
+regardless of the caller's scopes. A caller with `chat` scope can invoke
+operations requiring `admin` by choosing the right tool name.
+
+The call protocol is a general-purpose cross-boundary RPC mechanism. Every
+consumer — NAPI adapter, Python adapter, agent service, future services —
+inherits whatever privilege model the protocol defines. The privilege boundary
+between external and internal calls, and the authority context switch for
+composition, are core protocol semantics. This is not a feature of any single
+consumer; it is the protocol's security model.
+
+The agent service is a useful test case because it exercises every edge case
+(parameterized dispatch, deep composition, dynamic operations, role-based
+escalation), but the decision belongs to the call protocol.
+
+## Mental Models
+
+Two analogies clarify the model:
+
+**Kernel/user mode**: external operations are syscalls — curated entry points
+where an unprivileged caller can enter the kernel. Internal operations are
+kernel functions — callable only from composition, not from userspace. The
+`internal` flag means "this call is in kernel mode." Kernel mode has access
+controls — it runs under a different principal, not with no principal.
+
+**Domain/integration events**: external operations are integration events —
+they cross a boundary and are visible to external systems. Internal operations
+are domain events — they stay within the bounded context. `services/list` is
+the integration contract; it only exposes integration events.
+
+**Principal/agent (legal contracting)**: the caller is the principal; the
+handler is the agent. The principal delegates scoped authority to the agent.
+The agent acts under its own identity (for attribution) but with the principal's
+delegated authority (for scope). Liabilities flow upstream (traceable through
+`parent_request_id`); privileges flow downstream (the agent gets a subset of the
+principal's authority). Role-based escalation: a lower-privileged role can
+escalate through a chain of command (agent requests promotion, architect
+performs it), not through direct authority.
+
+## Decision
+
+### 1. The `internal` flag switches authority context, not skips ACL
+
+The `internal` flag on `OperationContext` marks calls that originated from
+composition (a handler calling another operation via `OperationEnv`), as opposed
+to external calls that arrived as `call.requested` from a wire client.
+
+When `internal: true`:
+- The ACL check runs against the **handler's identity** (set at registration by
+  the assembly layer), not the caller's identity and not as a blanket skip.
+- The handler's identity has scopes scoped to its composition needs (least
+  privilege), not blanket root and not the caller's scopes.
+
+When `internal: false` (external call from the wire):
+- The ACL check runs against the **caller's identity** (from `AuthContext`,
+  resolved per-request).
+
+The `internal` flag is set by `OperationEnv`, not by callers. A handler cannot
+mark its own call as internal. The field uses module-private construction; only
+`pub fn is_internal(&self) -> bool` is exposed for reads.
+
+### 2. Operations have External/Internal visibility
+
+`OperationSpec` has a `visibility: Visibility` field:
+
+```rust
+pub enum Visibility {
+    External,  // Callable from the wire (call.requested from a client)
+    Internal,  // Composition-only (env.invoke from a handler)
+}
+```
+
+The assembly layer declares visibility when registering operations.
+
+When a `call.requested` arrives from a wire client:
+- An `Internal` operation returns `call.error` with code `NOT_FOUND` (not
+  `FORBIDDEN`). This does not leak that the operation exists.
+- An `External` operation proceeds to ACL checking.
+
+`services/list` only returns `External` operations to remote callers. Internal
+operations are not part of the wire-facing API surface. A remote client cannot
+enumerate the internal call tree.
+
+### 3. Handler identity is carried on OperationContext
+
+`OperationContext` carries both the caller's identity (who invoked me) and the
+handler's identity (who am I acting as):
+
+```rust
+pub struct OperationContext {
+    pub request_id: String,
+    pub parent_request_id: Option<String>,
+    pub identity: Option<Identity>,            // Caller's identity (inbound)
+    pub handler_identity: Option<Identity>,    // Handler's identity (composition authority)
+    pub capabilities: Capabilities,
+    pub metadata: HashMap<String, Value>,
+    pub env: OperationEnv,
+    pub internal: bool,
+}
+```
+
+- `identity`: the authenticated caller (from `AuthContext`). For external calls,
+  this is who sent the `call.requested`. For internal calls, this is the
+  *parent handler's* identity (propagated through `OperationEnv::invoke()`).
+- `handler_identity`: the identity of the handler processing this call. Set at
+  registration by the assembly layer. For external calls, this is the handler's
+  own identity. For internal calls, the ACL check runs against this identity.
+
+The distinction is the principal/agent model: `identity` is the principal (who
+delegated), `handler_identity` is the agent (who is acting). Attribution traces
+through both — any action can be attributed to the handler that performed it and
+the caller that initiated the chain.
+
+### 4. Scoped composition env
+
+The `OperationEnv` given to a handler is scoped — it can only invoke a declared
+set of operations. This bounds the parameterized-dispatch attack surface: a
+caller (or an LLM) picking which operation to invoke picks from the declared
+set, not from the entire registry.
+
+Scoping happens at two levels:
+
+**Static scoping at registration**: the assembly layer declares which operations
+a handler may compose. The `OperationEnv` given to that handler is pre-filtered
+— `invoke("fs", "readFile", ...)` works, `invoke("admin", "deleteUser", ...)`
+returns `NOT_FOUND`. This is the reachability control.
+
+**Dynamic scoping at sandbox creation**: when a handler spawns a sandbox
+(quickjs), it passes a *further scoped* env to the sandbox — a subset of what
+the handler itself can reach. The handler might have `fs:read` and `bash:exec`,
+but it only gives the sandbox `fs:read` (not `bash:exec`), because the sandbox
+runs untrusted LLM-generated code. This is the "privileges flow downstream"
+principle: the principal delegates a subset.
+
+The specific API for declaring the scoped operation set (allowed-operations
+list, allowed-namespaces, or a trait-based filter) is a two-way door for
+implementation. The TypeScript `@alkdev/operations` `buildEnv()` used an
+`allowedNamespaces` filter; the Rust implementation may be finer-grained
+(operation-level, not just namespace-level) to be safe.
+
+### 5. The three controls together
+
+The three controls are independent and all are needed:
+
+| Control | What it gates | Without it |
+|---------|--------------|-----------|
+| Operation visibility | Whether an operation is callable from the wire | Internal operations exposed to external callers |
+| Handler identity | What authority composition runs under | ACL skipped or caller's scopes propagated (escalation) |
+| Scoped composition env | What operations a handler can reach | Handler can call anything in the registry |
+
+- Visibility alone: internal operations are hidden from the wire, but
+  composition skips ACL (escalation through buggy handler).
+- Handler identity alone: ACL checks against handler scopes, but the handler can
+  reach any operation (parameterized dispatch unbounded).
+- Scoped env alone: handler can only reach declared operations, but ACL is
+  skipped (if a declared operation requires a scope the handler doesn't have, it
+  still runs).
+
+All three together: the handler can only reach declared operations (scoped env),
+those operations are ACL-checked against the handler's scoped identity (handler
+identity), and internal operations are never exposed to the wire (visibility).
+Principle of least privilege.
+
+## Consequences
+
+**Positive:**
+- No privilege escalation through composition. A handler can only compose
+  operations its own identity is authorized for, and only from its declared
+  scope.
+- Parameterized dispatch is safe. The agent/LLM tool selection case is bounded
+  by the scoped env — the LLM picks from the declared tool set, not from the
+  entire registry. The ACL checks against the handler's identity, not the
+  caller's.
+- Buggy handlers can't accidentally escalate. A handler that tries to call an
+  operation outside its scoped env gets `NOT_FOUND`; one that calls an operation
+  its identity lacks scopes for gets `FORBIDDEN`.
+- Attribution is complete. Every call carries both the caller's identity (who
+  initiated the chain) and the handler's identity (who is acting). The
+  `parent_request_id` chain traces the full agency chain. This supports the
+  gitea-per-agent pattern where each agent (human or LLM) has its own account.
+- Session-scoped operations (OQ-19) are safe by construction. They're always
+  `Internal`, run under the handler's identity, through the scoped env, in a
+  locked-down sandbox. The self-improving workflow (agents writing tools) is
+  bounded.
+- Role-based escalation is explicit. An agent requesting promotion (session →
+  core) is a lower-privileged role asking a higher-privileged role (architect
+  with `promote` scope) to perform an action. The escalation goes through the
+  chain of command, not through direct authority.
+
+**Negative:**
+- `OperationContext` has two identity fields (`identity` and
+  `handler_identity`), which is more complex than a single identity. This is
+  necessary — the principal/agent distinction is real and both are needed for
+  attribution and ACL.
+- The assembly layer has more responsibility: it must declare each handler's
+  identity (scopes), its scoped composition env (which operations it may
+  compose), and operation visibility. This is expected — the assembly layer
+  assembles everything (ADR-008), and forcing explicit declaration of privilege
+  is a feature, not a bug.
+- Adding a new composition to a handler requires updating the assembly layer
+  (declare the new operation in the scoped env), not just the handler code.
+  This prevents accidental composition of unauthorized operations.
+- The scoped env API is not fully specified here. The one-way constraint
+  (scoped env exists, is declared at registration, can be further scoped at
+  runtime) is fixed; the concrete API is a two-way door for implementation.
+
+## Assumptions
+
+1. **Internal calls should run under a different authority than external calls,
+   not skip ACL entirely.** If internal calls should skip ACL (the old `trusted`
+   model), this entire ADR is wrong. The assumption is that the escalation
+   vectors (buggy handler, parameterized dispatch) are real and must be
+   prevented.
+
+2. **Handler identity is set at registration by the assembly layer.** The
+   assembly layer is the trust boundary (ADR-008, ADR-014). If the assembly
+   layer is compromised, all handler identities are compromised. This is the
+   same trust boundary as capabilities.
+
+3. **The scoped env is declared at registration (static) and can be further
+   scoped at runtime (dynamic, for sandbox creation).** The static scoping is
+   the reachability control; the dynamic scoping is the sandbox boundary. If a
+   use case requires fully dynamic scoping (handler discovers at call time what
+   it can compose), the model needs extension — but the assumption is that
+   composition reachability is knowable at registration time.
+
+4. **`services/list` hides internal operations.** If internal operations should
+   be discoverable by remote callers (e.g., for debugging), the visibility model
+   needs a third state. The assumption is that internal operations are
+   implementation details, not part of the external API surface.
+
+5. **Internal operations return `NOT_FOUND`, not `FORBIDDEN`.** This prevents
+   existence leakage. If a use case requires distinguishing "you can't call
+   this" from "this doesn't exist" (e.g., for debugging), the error model needs
+   refinement. The assumption is that not leaking internal operation existence
+   is more important than debuggability from the wire.
+
+6. **The handler identity is a full `Identity` (with scopes), not a special
+   principal type.** This reuses the existing `Identity` type and `IdentityProvider`
+   infrastructure (ADR-004). If handler identities need different resolution
+   semantics (e.g., not resolvable through `IdentityProvider`), a separate type
+   may be needed. The assumption is that the existing identity infrastructure
+   suffices.
+
+## References
+
+- ADR-004: Auth as shared core (`IdentityProvider`, `Identity`)
+- ADR-008: Vault integration (assembly layer is the trust boundary)
+- ADR-014: Secret material flow and capability injection (capabilities are
+  orthogonal — both are set at registration by the assembly layer)
+- OQ-15: Call protocol client and adapter contract (adapters produce scoped envs)
+- OQ-17: Abort cascade (the call tree is the agency chain — `parent_request_id`
+  traces principal → agent)
+- OQ-19: Session-scoped registries (session operations are always `Internal`)
+- [operation-registry.md](../crates/call/operation-registry.md)
+- [call-protocol.md](../crates/call/call-protocol.md)
+- TypeScript `@alkdev/operations` `buildEnv()` with `allowedNamespaces` — prior
+  art for scoped composition env
+- POC at `/workspace/toolEnv` — demonstrated the sandbox-to-registry bridge with
+  the full-registry exposure gap