docs(arch): set OQ-42 decision direction — repo/adapter storage + Option 2 integration

Two structural decisions for dynamic resource ownership (OQ-42), recorded
in the OQ so ADR drafting starts from a clear position:

1. Storage side reuses the repo/adapter pattern (ADR-033) — a fourth
   instance alongside IdentityProvider/IdentityStore/CredentialStore.
   Trait in alknet-core with an in-memory default adapter; persistence
   adapter separable. Sync read with ArcSwap + honker-NOTIFY cache
   invalidation, same shape as ConfigIdentityProvider (ADR-035). No new
   shape invented; no Phase 0 needed for the storage side.

2. Integration point is Option 2 — AccessControl::check consults the
   ownership provider directly. Rejected Option 1 (augment identity with
   a per-request snapshot) because its purity was theatrical — the
   question 'can X exec into container C' was never purely a function of
   identity, it just looked that way because the resource set was static.
   Option 2 makes check's signature honest about what ACL checking is in
   the presence of dynamic resources. Cost is a check signature change
   (one-way door, every call site updates) — implementation cost, not
   semantic cost, per the project's decision principle.

Refinement that makes Option 2 clean: OperationSpec gains resource_id_path
(JSON pointer into the input, e.g. '$.containerId'). Fits naturally with
the existing JSON-Schema-backed input_schema — the pointer is within an
existing schema on the same spec. OperationSpec becomes fully
self-describing for authorization: resource type, action, and which input
field drives the resource lookup, all declared on the spec.

Four specifics remain open for the ADR: the no-specific-resource (list)
case, teardown coupling, fleet representation (spoke resources on the
hub), and composition interaction with dynamic ownership. These were
surfaced by choosing Option 2 rather than by leaving the integration point
undecided.
This commit is contained in:
2026-07-04 13:04:14 +00:00
parent e29672942c
commit f390550a06
2 changed files with 140 additions and 88 deletions

View File

@@ -963,107 +963,159 @@ is a feature extension, not an unmade architecture decision.
- **Origin**: [alknet-docker POC summary](../../research/alknet-docker/poc-summary.md)
§"Open Unknowns" #3 (container-as-resource identity model); generalized
during the Phase 1 review pass triggered by that research finding.
- **Status**: open
- **Door type**: One-way (the `Identity.resources``AccessControl::check`
model in core/call), two-way (the mechanism for supplementing it)
- **Status**: open — the two structural decisions below are made; the
remaining questions (listed under "Open for the ADR") are what the ADR
must settle before the dependent crate specs can be drafted.
- **Door type**: One-way (the `AccessControl::check` signature change and
the `OperationSpec.resource_id_path` addition in core/call), two-way (the
ownership provider mechanism, per the established repo/adapter pattern)
- **Priority**: high — blocks the alknet-docker, alknet-tty, opencode-runner
wrapper, and `alknet-container` (fleet normalization) crate specs. None of
those specs can declare their `AccessControl` shapes until this is resolved,
because the model available to them determines what ACL declarations are
even expressible. Permitting "docker picks a per-crate default and the
others follow" is the door-type-as-deferral anti-pattern (ADR-009 §"What
this framework is NOT"): each crate bakes in an ACL shape and downstream
crates build on whatever default was picked, making the "cheap reversal"
expensive.
- **Resolution**: Not yet made. This OQ records the issue and its scope so the
architecture workflow can see it; the resolution requires an ADR (and,
given the one-way door, likely a Phase 0 research/POC pass first).
those specs can declare their `AccessControl` shapes until this is
resolved, because the model available to them determines what ACL
declarations are even expressible. Permitting "docker picks a per-crate
default and the others follow" is the door-type-as-deferral anti-pattern
(ADR-009 §"What this framework is NOT"): each crate bakes in an ACL shape
and downstream crates build on whatever default was picked, making the
"cheap reversal" expensive.
**The issue, generalized.** The alknet-docker POC research flagged that
containers are a natural resource for `AccessControl` (`resource_type:
"container"`, `resource_action: "exec"`), but containers are created at
runtime — the resource set is dynamic, and "who can exec into container C"
is a function of "who created C," not of a static `PeerEntry.resources`
entry an operator wrote. The POC agent's one-line summary — "the
`IdentityProvider` model in alknet-core is currently static (`PeerEntry`
set). Dynamic resource ownership needs a spec" — is accurate, and the
consequence it draws is the right one.
- **Resolution (decided so far).**
The issue is broader than the docker case that surfaced it. **Every crate
that spawns a thing at runtime and exposes it over the call protocol hits
the same shape:**
**Decision 1 — storage side: reuse the repo/adapter pattern.** The
ownership store is a fourth instance of the established repo/adapter
pattern (ADR-033), alongside `IdentityProvider` (ADR-004), `IdentityStore`
(ADR-035), and `CredentialStore` (ADR-031). Concretely: a trait in
`alknet-core` (read method: "does identity X own resource R with action
A?" / "what resources of type T does X own?"; write method: "record X
spawned R", "revoke R on teardown"), with an in-memory default adapter in
core. The in-memory default carries the docker/runner cases with no
backend dependency — ownership is runtime state, meaningless across
restarts (a container ID from a previous process doesn't exist), so the
default case has no persistence requirement. A persistence adapter (e.g.
sqlite/honker-backed, for a hub that wants fleet ownership to survive
restarts) is separable and built when a concrete use case forces it, same
as `alknet-store-sqlite` for peer/credential persistence. The read stays
sync (called from `AccessControl::check` on the dispatch hot path, no
`.await`), with persistence adapters caching in memory and using honker
NOTIFY for invalidation — same `ArcSwap`-backed full-reload pattern as
`ConfigIdentityProvider` (ADR-035). No new shape invented on the storage
side; no Phase 0 needed for it.
| Crate (some prospective) | Runtime-spawned resource | Ownership derivation |
|--------------------------|--------------------------|----------------------|
| alknet-docker | container ID | who created the container |
| alknet-tty | a TTY session (often bound to a container) | who opened the session |
| opencode-runner wrapper (a runner crate that starts an opencode instance in a container and wraps its OpenAPI surface via `alknet-http`'s `from_openapi`) | an opencode instance | who started the instance |
| `alknet-container` (fleet normalization, per the POC summary's §6 boundary) | a normalized container across multiple hosts | transitively, who created the underlying container |
**Decision 2 — integration point: Option 2, `check` consults the
ownership provider directly.** `AccessControl::check` grows a parameter
for the ownership provider (or reads one carried on `OperationContext`),
and consults it for `resource_type`/`resource_action` checks against
runtime-spawned resources. The alternative considered and rejected —
Option 1, augmenting `Identity.resources` with a per-request snapshot
before calling `check` — preserves `check`'s purity by moving the
impurity one frame up the stack: the dispatcher would pull owned
resources into a per-request identity snapshot so `check` *looks*
unchanged while reading state that was never part of the static
identity. The purity was always theatrical (the question "can X exec
into container C" was never purely a function of identity; it just
looked that way because the resource set was static). Option 2 makes
`check`'s signature honest about what ACL checking *is* in the presence
of dynamic resources: a function of (ACL, Identity, current-ownership-
state). The impurity is real either way; Option 2 puts it in the
signature where it's visible, rather than hiding it in a per-request
snapshot pretending to be static identity. Option 3 (handler-level
ownership check, `AccessControl` gates only scope) was rejected because
it splits the ACL story — some resources statically checked, some
handler-checked — which is the kind of inconsistency that creates the
"figure out how it fits with what is there" cleanup this OQ exists to
prevent.
All share: (a) resources are runtime-spawned, not config-declared; (b)
ownership is derived from creation, not from a static ACL entry; (c) the
resource set churns continuously (instances start and stop constantly),
so any model requiring operator config edits per resource is a non-starter;
(d) the resource's lifecycle is bound to a process the call protocol
itself is managing, so ownership state and spawn/teardown must be coupled,
not two separate operator workflows.
The cost of Option 2 is a `check` signature change — a one-way door,
every call site and test updates. Per the project's decision principle
(implementation workload is a non-issue relative to semantic correctness
and long-term clarity; "path of least resistance compounded over many
decisions is strictly dominated"), this is implementation cost, not a
semantic cost, and does not bias the choice.
Solving this inside the docker crate spec would re-solve it identically in
each of those crates and they would diverge. The model has to live in
core/call, once.
**Refinement that makes Option 2 work cleanly: `OperationSpec` declares
where the resource ID lives in the input.** `OperationSpec` gains a
`resource_id_path: Option<String>` — a JSON pointer into the operation
input, e.g. `"$.containerId"` for `docker/container/exec`. The dispatcher
extracts the resource ID from the input using the spec-declared path,
passes it to `check`, and `check` asks the provider "does this identity
own `<resource_type>/<resource_id>` with action `<resource_action>`?" —
a single targeted lookup, not a whole-resource-set pull. The fit with
JSON Schema is load-bearing, not incidental: `OperationSpec.input_schema`
is already a JSON Schema, so `resource_id_path` is a pointer *within* an
existing schema on the same spec. The `OperationSpec` becomes fully
self-describing for authorization — what resource type, what action, and
*which input field* drives the resource lookup. No per-namespace
conventions, no handler-level knowledge, no "the dispatcher just knows."
The contract is on the spec, where it belongs.
**What the current model does, and why it doesn't fit.** Today
`Identity.resources: HashMap<String, Vec<String>>` is populated on two
paths, both static: from `PeerEntry.resources` (fingerprint/auth-token,
ADR-030) or from `CompositionAuthority.resources` (composition, ADR-015 /
ADR-022). `AccessControl::check`
(`crates/alknet-call/src/registry/spec.rs:59108`) does a literal
`identity.resources.get(resource_type)` lookup and a string-equals match
on `resource_action`. There is no notion of "the resource set is dynamic"
in that function — it reads a map built when the identity was resolved.
`ConfigIdentityProvider` reads from `ArcSwap<DynamicConfig>` (hot-reloadable
config), and `IdentityStore` (ADR-035) adds async `put_peer` /
`update_peer` / `remove_peer` — but those are *administrative* mutations
(operator-grade peer-record management), not "peer X just spawned resource
Y at runtime; record that Y is owned by X." The resource ownership path is
a different concern from peer-record management, even if they end up
sharing storage.
- **Open for the ADR.** The decisions above settle the storage shape and
the integration point. The ADR must still address:
**The one-way door.** Whether `Identity.resources` stays static-config-
sourced or gains a dynamic-ownership supplement (and what the trait shape
for that supplement is) determines what `AccessControl` declarations the
docker/tty/runner/fleet specs can make, what every `AccessControl::check`
call site reads, and what every consumer of `Identity` assumes. This is
core, not per-crate. The *mechanism* (a new `ResourceOwnershipProvider`
trait vs. extending `IdentityStore` vs. labels/ownership-tags on the
spawned resources themselves vs. a capability-style model) is two-way —
additive, reversible per the ADR-009 definition — but the model-level
decision is one-way.
1. **No-specific-resource operations (the `list` case).** Operations
with `resource_type` set but `resource_id_path` absent — e.g.
`docker/container/list`, which doesn't reference a specific container.
There the question is "does this peer have *any* resource of this
type?" rather than "does this peer own *this* resource?" Option 2
handles this naturally (check asks the provider "any resources of
type T for this identity?" when no specific ID is present), but the
exact semantics need to be pinned: does the ACL gate the whole call
(allow/deny), or does the handler filter the result to owned
containers (allow + filter)? The former is the scope-gating path; the
latter is the result-filtering path. They compose (scope-gate the
call, then filter the result), but the ADR should state which is the
default and how a spec declares which it wants. `list` is the case
that forces this; `exec`/`inspect`/`stop` against a specific
container are the clean case.
**Known consumers of the resolution.** Any spec that declares an
`AccessControl` with `resource_type`/`resource_action` against a
runtime-spawned resource set needs this resolved first. Today that's the
docker, tty, opencode-runner, and `alknet-container` specs; future
runner-shaped crates (any GPU-job runner, any "spawn a process and expose
it over the call protocol" crate) inherit the same requirement. The
resolution should make the model available in core/call so those specs
declare ACLs against it directly, rather than each crate inventing a
per-crate ownership layer.
2. **Teardown coupling.** The ownership store's write path (revoke on
teardown) must be coupled to the spawned resource's lifecycle, not
left to operator workflows. When a container dies or is removed, the
ownership entry must be revoked — otherwise the store accumulates
stale entries and an ACL check could reference a resource that no
longer exists. The coupling mechanism (the docker handler explicitly
calls revoke on container exit, vs. a background reaper, vs. TTL-based
expiry) is two-way-door mechanism work, but the ADR should state the
coupling requirement and the default mechanism.
3. **Fleet representation (spoke resources on the hub).** When a worker
spoke spawns a container and exposes it to the hub over the call
protocol, the hub's ownership store needs to represent "peer X owns
resource R" for routing/ACL on the hub side. Whether the spoke pushes
ownership records to the hub on spawn, the hub derives them from
`from_call`-discovered operations, or the spoke owns the ACL decision
and the hub forwards — is a real question with cross-node state
implications. The POC summary's §6 head-worker/machine-node model
frames the topology; this question is where that topology meets the
ownership model. Likely the most consequential of the three open
questions.
4. **Composition interaction.** ADR-015/022 populate
`CompositionAuthority.resources` for internal composition calls. With
dynamic ownership, an internal composition that targets a runtime-
spawned resource (a handler composing `docker/container/exec` against
a specific container) needs the composition authority to be checkable
against the ownership store too, not just against the static
`CompositionAuthority.resources` map. Whether `CompositionAuthority`
grows a dynamic-ownership path parallel to `Identity`, or composition
always runs under the caller's ownership, or some third option — needs
to be stated so the privilege model stays coherent with the ownership
model.
These are genuine open questions, not deferred decisions — the ADR must
answer them. They were surfaced by choosing Option 2 + `resource_id_path`
rather than by leaving the integration point undecided; recording them
here so the ADR drafting starts from a known set of specifics to work
out, not from a blank page.
**Phase 0 may be warranted.** Given the one-way door at the model level
and the number of plausible mechanisms (each with different tradeoffs
around consistency, teardown coupling, fleet-scale state, and how a
remote spoke's spawned resources are represented on the hub), a targeted
research/POC pass before the ADR is likely the right sequencing — but
that's a decision to make once the candidate mechanisms are enumerated,
not something this OQ pre-commits.
- **Cross-references**: ADR-009 (door-type-as-deferral anti-pattern),
ADR-015, ADR-022 (the static `CompositionAuthority.resources` model this
questions), ADR-030, ADR-035 (`IdentityStore` — administrative peer
mutations, a different concern from runtime resource ownership),
extends — see open question 4), ADR-030, ADR-033 (repo/adapter pattern —
reused for the ownership store), ADR-035 (`IdentityStore`
administrative peer mutations, a different concern from runtime resource
ownership, but the sync-read + ArcSwap + honker-NOTIFY shape is reused),
[auth.md](crates/core/auth.md) (`Identity.resources`, `AccessControl::check`
interaction),
interaction — both under edit by this decision),
[operation-registry.md](crates/call/operation-registry.md) (`AccessControl`,
`OperationSpec`),
`OperationSpec``resource_id_path` addition),
[alknet-docker POC summary](../../research/alknet-docker/poc-summary.md)
§"Open Unknowns" #3