docs(arch): set OQ-42 decision direction — repo/adapter storage + Option 2 integration

Two structural decisions for dynamic resource ownership (OQ-42), recorded in the OQ so ADR drafting starts from a clear position: 1. Storage side reuses the repo/adapter pattern (ADR-033) — a fourth instance alongside IdentityProvider/IdentityStore/CredentialStore. Trait in alknet-core with an in-memory default adapter; persistence adapter separable. Sync read with ArcSwap + honker-NOTIFY cache invalidation, same shape as ConfigIdentityProvider (ADR-035). No new shape invented; no Phase 0 needed for the storage side. 2. Integration point is Option 2 — AccessControl::check consults the ownership provider directly. Rejected Option 1 (augment identity with a per-request snapshot) because its purity was theatrical — the question 'can X exec into container C' was never purely a function of identity, it just looked that way because the resource set was static. Option 2 makes check's signature honest about what ACL checking is in the presence of dynamic resources. Cost is a check signature change (one-way door, every call site updates) — implementation cost, not semantic cost, per the project's decision principle. Refinement that makes Option 2 clean: OperationSpec gains resource_id_path (JSON pointer into the input, e.g. '$.containerId'). Fits naturally with the existing JSON-Schema-backed input_schema — the pointer is within an existing schema on the same spec. OperationSpec becomes fully self-describing for authorization: resource type, action, and which input field drives the resource lookup, all declared on the spec. Four specifics remain open for the ADR: the no-specific-resource (list) case, teardown coupling, fleet representation (spoke resources on the hub), and composition interaction with dynamic ownership. These were surfaced by choosing Option 2 rather than by leaving the integration point undecided.
2026-07-04 13:04:14 +00:00
parent e29672942c
commit f390550a06
2 changed files with 140 additions and 88 deletions
--- a/docs/architecture/open-questions.md
+++ b/docs/architecture/open-questions.md
@@ -963,107 +963,159 @@ is a feature extension, not an unmade architecture decision.
 - **Origin**: [alknet-docker POC summary](../../research/alknet-docker/poc-summary.md)
  §"Open Unknowns" #3 (container-as-resource identity model); generalized
  during the Phase 1 review pass triggered by that research finding.
- **Status**: open
- **Door type**: One-way (the `Identity.resources` → `AccessControl::check`
-  model in core/call), two-way (the mechanism for supplementing it)
+- **Status**: open — the two structural decisions below are made; the
+  remaining questions (listed under "Open for the ADR") are what the ADR
+  must settle before the dependent crate specs can be drafted.
+- **Door type**: One-way (the `AccessControl::check` signature change and
+  the `OperationSpec.resource_id_path` addition in core/call), two-way (the
+  ownership provider mechanism, per the established repo/adapter pattern)
 - **Priority**: high — blocks the alknet-docker, alknet-tty, opencode-runner
  wrapper, and `alknet-container` (fleet normalization) crate specs. None of
-  those specs can declare their `AccessControl` shapes until this is resolved,
-  because the model available to them determines what ACL declarations are
-  even expressible. Permitting "docker picks a per-crate default and the
-  others follow" is the door-type-as-deferral anti-pattern (ADR-009 §"What
-  this framework is NOT"): each crate bakes in an ACL shape and downstream
-  crates build on whatever default was picked, making the "cheap reversal"
-  expensive.
- **Resolution**: Not yet made. This OQ records the issue and its scope so the
-  architecture workflow can see it; the resolution requires an ADR (and,
-  given the one-way door, likely a Phase 0 research/POC pass first).
+  those specs can declare their `AccessControl` shapes until this is
+  resolved, because the model available to them determines what ACL
+  declarations are even expressible. Permitting "docker picks a per-crate
+  default and the others follow" is the door-type-as-deferral anti-pattern
+  (ADR-009 §"What this framework is NOT"): each crate bakes in an ACL shape
+  and downstream crates build on whatever default was picked, making the
+  "cheap reversal" expensive.

-  **The issue, generalized.** The alknet-docker POC research flagged that
-  containers are a natural resource for `AccessControl` (`resource_type:
-  "container"`, `resource_action: "exec"`), but containers are created at
-  runtime — the resource set is dynamic, and "who can exec into container C"
-  is a function of "who created C," not of a static `PeerEntry.resources`
-  entry an operator wrote. The POC agent's one-line summary — "the
-  `IdentityProvider` model in alknet-core is currently static (`PeerEntry`
-  set). Dynamic resource ownership needs a spec" — is accurate, and the
-  consequence it draws is the right one.
+- **Resolution (decided so far).**

-  The issue is broader than the docker case that surfaced it. **Every crate
-  that spawns a thing at runtime and exposes it over the call protocol hits
-  the same shape:**
+  **Decision 1 — storage side: reuse the repo/adapter pattern.** The
+  ownership store is a fourth instance of the established repo/adapter
+  pattern (ADR-033), alongside `IdentityProvider` (ADR-004), `IdentityStore`
+  (ADR-035), and `CredentialStore` (ADR-031). Concretely: a trait in
+  `alknet-core` (read method: "does identity X own resource R with action
+  A?" / "what resources of type T does X own?"; write method: "record X
+  spawned R", "revoke R on teardown"), with an in-memory default adapter in
+  core. The in-memory default carries the docker/runner cases with no
+  backend dependency — ownership is runtime state, meaningless across
+  restarts (a container ID from a previous process doesn't exist), so the
+  default case has no persistence requirement. A persistence adapter (e.g.
+  sqlite/honker-backed, for a hub that wants fleet ownership to survive
+  restarts) is separable and built when a concrete use case forces it, same
+  as `alknet-store-sqlite` for peer/credential persistence. The read stays
+  sync (called from `AccessControl::check` on the dispatch hot path, no
+  `.await`), with persistence adapters caching in memory and using honker
+  NOTIFY for invalidation — same `ArcSwap`-backed full-reload pattern as
+  `ConfigIdentityProvider` (ADR-035). No new shape invented on the storage
+  side; no Phase 0 needed for it.

-  | Crate (some prospective) | Runtime-spawned resource | Ownership derivation |
-  |--------------------------|--------------------------|----------------------|
-  | alknet-docker | container ID | who created the container |
-  | alknet-tty | a TTY session (often bound to a container) | who opened the session |
-  | opencode-runner wrapper (a runner crate that starts an opencode instance in a container and wraps its OpenAPI surface via `alknet-http`'s `from_openapi`) | an opencode instance | who started the instance |
-  | `alknet-container` (fleet normalization, per the POC summary's §6 boundary) | a normalized container across multiple hosts | transitively, who created the underlying container |
+  **Decision 2 — integration point: Option 2, `check` consults the
+  ownership provider directly.** `AccessControl::check` grows a parameter
+  for the ownership provider (or reads one carried on `OperationContext`),
+  and consults it for `resource_type`/`resource_action` checks against
+  runtime-spawned resources. The alternative considered and rejected —
+  Option 1, augmenting `Identity.resources` with a per-request snapshot
+  before calling `check` — preserves `check`'s purity by moving the
+  impurity one frame up the stack: the dispatcher would pull owned
+  resources into a per-request identity snapshot so `check` *looks*
+  unchanged while reading state that was never part of the static
+  identity. The purity was always theatrical (the question "can X exec
+  into container C" was never purely a function of identity; it just
+  looked that way because the resource set was static). Option 2 makes
+  `check`'s signature honest about what ACL checking *is* in the presence
+  of dynamic resources: a function of (ACL, Identity, current-ownership-
+  state). The impurity is real either way; Option 2 puts it in the
+  signature where it's visible, rather than hiding it in a per-request
+  snapshot pretending to be static identity. Option 3 (handler-level
+  ownership check, `AccessControl` gates only scope) was rejected because
+  it splits the ACL story — some resources statically checked, some
+  handler-checked — which is the kind of inconsistency that creates the
+  "figure out how it fits with what is there" cleanup this OQ exists to
+  prevent.

-  All share: (a) resources are runtime-spawned, not config-declared; (b)
-  ownership is derived from creation, not from a static ACL entry; (c) the
-  resource set churns continuously (instances start and stop constantly),
-  so any model requiring operator config edits per resource is a non-starter;
-  (d) the resource's lifecycle is bound to a process the call protocol
-  itself is managing, so ownership state and spawn/teardown must be coupled,
-  not two separate operator workflows.
+  The cost of Option 2 is a `check` signature change — a one-way door,
+  every call site and test updates. Per the project's decision principle
+  (implementation workload is a non-issue relative to semantic correctness
+  and long-term clarity; "path of least resistance compounded over many
+  decisions is strictly dominated"), this is implementation cost, not a
+  semantic cost, and does not bias the choice.

-  Solving this inside the docker crate spec would re-solve it identically in
-  each of those crates and they would diverge. The model has to live in
-  core/call, once.
+  **Refinement that makes Option 2 work cleanly: `OperationSpec` declares
+  where the resource ID lives in the input.** `OperationSpec` gains a
+  `resource_id_path: Option<String>` — a JSON pointer into the operation
+  input, e.g. `"$.containerId"` for `docker/container/exec`. The dispatcher
+  extracts the resource ID from the input using the spec-declared path,
+  passes it to `check`, and `check` asks the provider "does this identity
+  own `<resource_type>/<resource_id>` with action `<resource_action>`?" —
+  a single targeted lookup, not a whole-resource-set pull. The fit with
+  JSON Schema is load-bearing, not incidental: `OperationSpec.input_schema`
+  is already a JSON Schema, so `resource_id_path` is a pointer *within* an
+  existing schema on the same spec. The `OperationSpec` becomes fully
+  self-describing for authorization — what resource type, what action, and
+  *which input field* drives the resource lookup. No per-namespace
+  conventions, no handler-level knowledge, no "the dispatcher just knows."
+  The contract is on the spec, where it belongs.

-  **What the current model does, and why it doesn't fit.** Today
-  `Identity.resources: HashMap<String, Vec<String>>` is populated on two
-  paths, both static: from `PeerEntry.resources` (fingerprint/auth-token,
-  ADR-030) or from `CompositionAuthority.resources` (composition, ADR-015 /
-  ADR-022). `AccessControl::check`
-  (`crates/alknet-call/src/registry/spec.rs:59–108`) does a literal
-  `identity.resources.get(resource_type)` lookup and a string-equals match
-  on `resource_action`. There is no notion of "the resource set is dynamic"
-  in that function — it reads a map built when the identity was resolved.
-  `ConfigIdentityProvider` reads from `ArcSwap<DynamicConfig>` (hot-reloadable
-  config), and `IdentityStore` (ADR-035) adds async `put_peer` /
-  `update_peer` / `remove_peer` — but those are *administrative* mutations
-  (operator-grade peer-record management), not "peer X just spawned resource
-  Y at runtime; record that Y is owned by X." The resource ownership path is
-  a different concern from peer-record management, even if they end up
-  sharing storage.
+- **Open for the ADR.** The decisions above settle the storage shape and
+  the integration point. The ADR must still address:

-  **The one-way door.** Whether `Identity.resources` stays static-config-
-  sourced or gains a dynamic-ownership supplement (and what the trait shape
-  for that supplement is) determines what `AccessControl` declarations the
-  docker/tty/runner/fleet specs can make, what every `AccessControl::check`
-  call site reads, and what every consumer of `Identity` assumes. This is
-  core, not per-crate. The *mechanism* (a new `ResourceOwnershipProvider`
-  trait vs. extending `IdentityStore` vs. labels/ownership-tags on the
-  spawned resources themselves vs. a capability-style model) is two-way —
-  additive, reversible per the ADR-009 definition — but the model-level
-  decision is one-way.
+  1. **No-specific-resource operations (the `list` case).** Operations
+     with `resource_type` set but `resource_id_path` absent — e.g.
+     `docker/container/list`, which doesn't reference a specific container.
+     There the question is "does this peer have *any* resource of this
+     type?" rather than "does this peer own *this* resource?" Option 2
+     handles this naturally (check asks the provider "any resources of
+     type T for this identity?" when no specific ID is present), but the
+     exact semantics need to be pinned: does the ACL gate the whole call
+     (allow/deny), or does the handler filter the result to owned
+     containers (allow + filter)? The former is the scope-gating path; the
+     latter is the result-filtering path. They compose (scope-gate the
+     call, then filter the result), but the ADR should state which is the
+     default and how a spec declares which it wants. `list` is the case
+     that forces this; `exec`/`inspect`/`stop` against a specific
+     container are the clean case.

-  **Known consumers of the resolution.** Any spec that declares an
-  `AccessControl` with `resource_type`/`resource_action` against a
-  runtime-spawned resource set needs this resolved first. Today that's the
-  docker, tty, opencode-runner, and `alknet-container` specs; future
-  runner-shaped crates (any GPU-job runner, any "spawn a process and expose
-  it over the call protocol" crate) inherit the same requirement. The
-  resolution should make the model available in core/call so those specs
-  declare ACLs against it directly, rather than each crate inventing a
-  per-crate ownership layer.
+  2. **Teardown coupling.** The ownership store's write path (revoke on
+     teardown) must be coupled to the spawned resource's lifecycle, not
+     left to operator workflows. When a container dies or is removed, the
+     ownership entry must be revoked — otherwise the store accumulates
+     stale entries and an ACL check could reference a resource that no
+     longer exists. The coupling mechanism (the docker handler explicitly
+     calls revoke on container exit, vs. a background reaper, vs. TTL-based
+     expiry) is two-way-door mechanism work, but the ADR should state the
+     coupling requirement and the default mechanism.
+
+  3. **Fleet representation (spoke resources on the hub).** When a worker
+     spoke spawns a container and exposes it to the hub over the call
+     protocol, the hub's ownership store needs to represent "peer X owns
+     resource R" for routing/ACL on the hub side. Whether the spoke pushes
+     ownership records to the hub on spawn, the hub derives them from
+     `from_call`-discovered operations, or the spoke owns the ACL decision
+     and the hub forwards — is a real question with cross-node state
+     implications. The POC summary's §6 head-worker/machine-node model
+     frames the topology; this question is where that topology meets the
+     ownership model. Likely the most consequential of the three open
+     questions.
+
+  4. **Composition interaction.** ADR-015/022 populate
+     `CompositionAuthority.resources` for internal composition calls. With
+     dynamic ownership, an internal composition that targets a runtime-
+     spawned resource (a handler composing `docker/container/exec` against
+     a specific container) needs the composition authority to be checkable
+     against the ownership store too, not just against the static
+     `CompositionAuthority.resources` map. Whether `CompositionAuthority`
+     grows a dynamic-ownership path parallel to `Identity`, or composition
+     always runs under the caller's ownership, or some third option — needs
+     to be stated so the privilege model stays coherent with the ownership
+     model.
+
+  These are genuine open questions, not deferred decisions — the ADR must
+  answer them. They were surfaced by choosing Option 2 + `resource_id_path`
+  rather than by leaving the integration point undecided; recording them
+  here so the ADR drafting starts from a known set of specifics to work
+  out, not from a blank page.

-  **Phase 0 may be warranted.** Given the one-way door at the model level
-  and the number of plausible mechanisms (each with different tradeoffs
-  around consistency, teardown coupling, fleet-scale state, and how a
-  remote spoke's spawned resources are represented on the hub), a targeted
-  research/POC pass before the ADR is likely the right sequencing — but
-  that's a decision to make once the candidate mechanisms are enumerated,
-  not something this OQ pre-commits.
 - **Cross-references**: ADR-009 (door-type-as-deferral anti-pattern),
  ADR-015, ADR-022 (the static `CompositionAuthority.resources` model this
-  questions), ADR-030, ADR-035 (`IdentityStore` — administrative peer
-  mutations, a different concern from runtime resource ownership),
+  extends — see open question 4), ADR-030, ADR-033 (repo/adapter pattern —
+  reused for the ownership store), ADR-035 (`IdentityStore` —
+  administrative peer mutations, a different concern from runtime resource
+  ownership, but the sync-read + ArcSwap + honker-NOTIFY shape is reused),
  [auth.md](crates/core/auth.md) (`Identity.resources`, `AccessControl::check`
-  interaction),
+  interaction — both under edit by this decision),
  [operation-registry.md](crates/call/operation-registry.md) (`AccessControl`,
-  `OperationSpec`),
+  `OperationSpec` — `resource_id_path` addition),
  [alknet-docker POC summary](../../research/alknet-docker/poc-summary.md)
  §"Open Unknowns" #3