From 79d8561bb4d3e267eb9fed1d810e1a3bd741fd8a Mon Sep 17 00:00:00 2001 From: "glm-5.2" Date: Thu, 25 Jun 2026 12:44:49 +0000 Subject: [PATCH] =?UTF-8?q?docs(research):=20alknet-call=20completion=20ga?= =?UTF-8?q?p=20analysis=20=E2=80=94=20CallClient=20+=20from=5Fcall=20+=20O?= =?UTF-8?q?perationAdapter?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Gap analysis for completing alknet-call: the server-side core (~5.7k lines, 159 tests) is implemented, but the client side (CallClient), the bilateral exchange mechanism (from_call), and the adapter contract (OperationAdapter trait) are specced in ADR-017 and unimplemented. Records: implementation state (verified against src/), 5 decisions needed (peer-scoped registry filtering as the load-bearing one), the settled adapter location map (trait + from_call + from_jsonschema in alknet-call; from_openapi/ from_mcp in alknet-http), the no-env-vars invariant (Capabilities → from_openapi handler → HTTP header), and the exchange-of-operations runner pattern with dispatch as the concrete downstream consumer. --- .../alknet-call-completion/gap-analysis.md | 410 ++++++++++++++++++ 1 file changed, 410 insertions(+) create mode 100644 docs/research/alknet-call-completion/gap-analysis.md diff --git a/docs/research/alknet-call-completion/gap-analysis.md b/docs/research/alknet-call-completion/gap-analysis.md new file mode 100644 index 0000000..03ec23c --- /dev/null +++ b/docs/research/alknet-call-completion/gap-analysis.md @@ -0,0 +1,410 @@ +--- +status: draft +last_updated: 2026-06-25 +--- + +# alknet-call Completion — Gap Analysis + +This document captures the gap between the existing alknet-call architecture +(ADRs 005/007/012/014/015/016/017/022/023/024, specs in +`docs/architecture/crates/call/`) and the current implementation +(`crates/alknet-call/src/`), the decisions needed before implementation can +proceed, and the downstream crates this completion unblocks. + +Unlike the alknet-ssh phase-0 findings (a true exploration doc for a crate with +no existing architecture), this is a **gap analysis + decision record** for +completing existing architecture. The specs are largely settled; the work is +implementing what's specced and resolving a small number of decisions the specs +left as two-way doors or didn't address. + +## Implementation State (Verified) + +The call protocol's server-side core is implemented and tested (159 tests, +passing). What's missing is the **client side** and the **adapter contract**. + +### Implemented (~5,773 lines, 159 tests) + +| Component | File | Lines | Status | +|-----------|------|-------|--------| +| `CallAdapter` (ProtocolHandler for `alknet/call`) | `protocol/adapter.rs` | 1,051 | Done | +| `CallConnection` (Layer 2 overlay, call/subscribe/abort) | `protocol/connection.rs` | 780 | **Partial** — see below | +| Wire framing (`EventEnvelope`, `FrameFramedReader/Writer`) | `protocol/wire.rs` | 544 | Done | +| `PendingRequestMap` (ID-based correlation) | `protocol/pending.rs` | 584 | Done | +| Abort cascade | `protocol/abort.rs` | 393 | Done | +| `OperationRegistry`, `HandlerRegistration`, builder | `registry/registration.rs` | 734 | Done | +| `OperationSpec`, `AccessControl`, `Visibility`, `ErrorDefinition` | `registry/spec.rs` | 321 | Done | +| `OperationContext`, `ScopedOperationEnv`, `AbortPolicy` | `registry/context.rs` | 178 | Done | +| `OperationEnv` trait, `CompositeOperationEnv`, `LocalOperationEnv` | `registry/env.rs` | 598 | Done | +| Service discovery (`services/list`, `services/schema` specs + handlers) | `registry/discovery.rs` | 557 | Done | + +### Not implemented (specced in ADR-017, absent from `src/`) + +| Component | Spec location | Priority | Unblocks | +|-----------|--------------|----------|---------| +| **`CallClient`** (outbound connection opener) | ADR-017 §1 | **Critical** | Runner pattern, bilateral exchange, every downstream consumer | +| **`from_call`** adapter (discover + register remote ops) | ADR-017 §3 | **Critical** (depends on `CallClient`) | Bilateral registry exchange, container-service pattern | +| **`OperationAdapter` trait** | ADR-017 §5 | **Enabling** | alknet-http's `from_openapi`/`from_mcp` implementations | +| **`from_jsonschema`** (schema-only registration, no handler) | ADR-017 §5 | Medium | Type validation, composition graph construction without runtime | + +### Partially implemented + +**`CallConnection`** (`protocol/connection.rs:34`) exists and implements the +Layer 2 overlay (`register_imported`, `register_imported_all`, `overlay_env`), +the `call()` / `subscribe()` / `abort()` outbound-call API, and the +`OverlayOperationEnv` trait impl. It is constructed via +`CallConnection::new(connection: Connection)` — meaning it wraps a `Connection` +that was *already established* by the `CallAdapter`'s accept path. + +What's **missing** is the path that *opens* a connection and constructs a +`CallConnection` from the client side: `CallClient::connect(addr, credentials)`. +The `CallConnection` type itself is ready; the `CallClient` that produces it is +not. This confirms ADR-017's design: the dispatch loop is shared, and the +client is the connection-establishment half, not a parallel protocol +implementation. + +## Decisions Needed + +These are the points the specs either left as two-way doors or didn't address. +Each is tagged with door type per ADR-009. Resolving these is the prerequisite +for implementation. + +### DC-1: `CallClient` registry scope — share global vs peer-scoped subset +*(One-way door — security dimension; ADR-017 Consequences flags this)* + +ADR-017 §1 says `CallClient` "has its own operation registry to dispatch +incoming calls from the remote side." The Consequences section flags the +security dimension explicitly: "Sharing the global registry with a `CallClient` +exposes local capabilities to the remote peer... A peer-scoped subset must +filter by capability remote-safety, not just operation name." + +Three options: + +- **(a) Share the global registry** — the remote peer can call any `External` + operation. Simplest. But per ADR-017's Consequences, this exposes the local + node's `Capabilities` to the remote peer's calls: `OperationContext.capabilities` + is populated from the local `HandlerRegistration.capabilities`, so the local + node's API keys get used for the remote peer's call. This is a + capability-exposure decision, not just a dispatch decision. +- **(b) Peer-scoped subset** — the `CallClient` holds a filtered view of the + global registry, exposing only operations whose `Capabilities` are marked + remote-safe. Requires a "remote-safe" flag on `HandlerRegistration` or on + `Capabilities` entries (which don't exist today). +- **(c) Separate registry per `CallClient`** — the `CallClient` has its own + registry, populated explicitly at construction. Most restrictive, most + explicit, most boilerplate. + +**Recommendation**: **(b) peer-scoped subset** as the v1 default, with (a) as +an explicit opt-in for trusted peers. Rationale: the runner pattern (worker +connects to hub) and the dispatch pattern (hub connects to worker) both +involve semi-trusted peers where exposing all local capabilities is wrong by +default. The "remote-safe" marking is the new concept this introduces — likely +a `Visibility::External`-adjacent flag or a `Capabilities` entry annotation. +This needs an ADR (likely an amendment to ADR-017 or a new ADR-028) because it +adds a concept to the registration bundle. The exact shape is a two-way door; +the *existence* of the filtering is the one-way door. + +### DC-2: `from_call` re-import on reconnection +*(Two-way door — ADR-017 Assumption 4)* + +ADR-017 Assumption 4: "If the remote operation changes (new schema, renamed), +the imported spec is stale until re-import. The assumption is that re-import +happens on reconnection or is triggered explicitly. Hot-swapping imported +specs is a two-way door." + +The question: does `from_call` run automatically on every (re)connection, or +only on explicit trigger? Auto-re-import on reconnect is simpler for the +runner pattern (worker reconnects → hub re-discovers worker's ops +automatically). Explicit trigger is safer (no surprise registry mutations). + +**Recommendation**: **auto-re-import on connection establishment** for the +v1 default. The runner pattern is the primary use case, and runners +reconnecting is the common case — making it explicit adds friction without +clear benefit. The overlay is per-connection (Layer 2, ADR-024), so a +stale overlay dies with the connection; re-import on reconnect is naturally +scoped. Explicit re-import can be added later as a `CallConnection::refresh()` +method if needed. This is a two-way door — record the default, don't spend an +ADR. + +### DC-3: `from_call` namespace collision handling +*(Two-way door — ADR-017 §3 mentions `FromCallConfig` prefix)* + +ADR-017 §3: `FromCallConfig` includes "An optional namespace prefix (to avoid +collisions when importing from multiple remote nodes)." The question is +whether the prefix is mandatory (always applied) or optional (default no +prefix, collision = last-wins or error). + +**Recommendation**: **optional prefix, default no prefix, collision = error**. +A node importing from two remotes that both expose `/container/exec` without +prefixes should fail loudly rather than silently overwrite. The operator adds +prefixes when they know they're importing from multiple sources. This matches +the "default-deny, explicit-allow" posture. Two-way door, no ADR needed. + +### DC-4: `OperationAdapter` trait error type +*(Two-way door — ADR-017 §5 says "specific trait signatures... are two-way +doors")* + +ADR-017 §5 shows the trait as `async fn import(&self) -> Vec`, +with no error type. A real implementation needs to handle failures (HTTP fetch +fails for `from_openapi`, remote unreachable for `from_call`, schema parse +error for `from_jsonschema`). + +**Recommendation**: the trait returns `Result, +AdapterError>` where `AdapterError` is a crate-level enum +(`DiscoveryFailed`, `SchemaParse`, `Transport`, `Unauthorized`). The spec's +omission of the error type was an implementation-detail two-way door; the +implementation fills it in. Record in the spec amendment, not a full ADR. + +### DC-5: `from_jsonschema` vs `from_call` separation +*(Confirmed — not a decision, but recorded for clarity)* + +These are distinct, not collapsible: + +| | `from_jsonschema` | `from_call` | +|---|---|---| +| Schema source | Provided directly (caller fetches, passes in) | Discovered over wire (`services/list` + `services/schema`) | +| Handler at call time | None (schema-only, `FromJsonSchema` provenance) | Forwards over QUIC (`FromCall` provenance, leaf) | +| Use case | Type validation, discovery, composition graph construction | Actually invoking remote operations | + +`from_call` = schema import (the `from_jsonschema`-shaped step) + forwarding +handler attachment. Keeping them separate preserves the "schema-only, no +execution" use case (type checking, safe composition planning without runtime). +This is confirmed architecture, not a decision to make. + +## Adapter Location Map (Settled) + +The decomposition principle: **the adapter trait lives where the types live +(alknet-call); the adapter implementations live where their transport +dependencies live.** + +``` +alknet-call (lean — no HTTP client, no HTTP server) +├── OperationAdapter trait (the contract — async, per ADR-017 §5) +├── from_call (QUIC — discovers remote ops via call protocol) +├── from_jsonschema (pure parse — caller fetches the doc, passes it in) +└── CallClient (outbound connection opener — the #1 gap) + +alknet-http (owns HTTP server + HTTP client — separate crate, separate Phase 0) +├── ProtocolHandler for h2/http1.1/h3 (axum server — inbound HTTP) +├── from_openapi (parse OpenAPI doc + reqwest forwarding handler) +├── to_openapi (generate OpenAPI doc from local registry) +├── from_mcp (feature-gated) (import remote MCP tools over streamable HTTP — reqwest) +└── to_mcp (feature-gated) (expose local ops as MCP tools over streamable HTTP — axum) + +Not built: MCP stdio transport + — stdio = spawn arbitrary executable = built-in RCE ("download untrusted MCP servers") + — streamable HTTP is the only supported MCP transport in alknet + — recorded as an explicit security position, not a feature gap +``` + +**Why this works**: alknet-call never sees the HTTP client. The +`from_openapi`/`from_mcp` forwarding handlers are opaque `Arc` +from the registry's perspective — constructed by `alknet_http::from_openapi()` +at registration time, stored in `HandlerRegistration`, dispatched by the +`CallAdapter` which doesn't know reqwest is involved. alknet-call stays lean +(no reqwest, no axum); alknet-http owns both HTTP directions. + +**ADR-003 dependency note**: alknet-http implementing `from_openapi`/`from_mcp` +means alknet-http depends on alknet-call (for `OperationSpec`, `Handler`, +`HandlerRegistration`, `OperationAdapter`). ADR-003's rule is "no handler crate +depends on another handler crate" — but alknet-call is both a handler *and* the +protocol foundation that alknet-agent and alknet-napi already consume. alknet-http +depending on alknet-call is "HTTP uses the call protocol types," not "HTTP depends +on SSH." This is within the spirit of ADR-003 (alknet-call is protocol-foundation, +not a peer handler), but should be noted explicitly in the alknet-http spec +and possibly as a one-line amendment to ADR-003 clarifying that alknet-call is a +protocol-foundation crate. + +## The No-Env-Vars Invariant (Architectural Mechanism) + +This is the architectural fix for the env-var problem in downstream consumers +like aisdk (the Rust port of Vercel's AI SDK at `/workspace/aisdk/`, 75 +providers all reading `std::env::var("OPENAI_API_KEY")` in their `Default` +impls). The fix is **not** to modify aisdk — it's that the env-var path is +never taken because the assembly layer never calls `Default::default()`. + +The credential injection path: + +``` +vault (seed) + → assembly layer (derive + decrypt at startup, per ADR-014/019/025) + → Capabilities (non-serializable, zeroized, immutable — ADR-014) + → HandlerRegistration.capabilities (ADR-022, the registration bundle) + → OperationContext.capabilities (per-request, populated by dispatch + path from the bundle — ADR-022 §6) + → from_openapi handler reads context.capabilities.get("openai") + → injects into HTTP Authorization header + → reqwest request goes out with vault-derived credential +``` + +The `from_openapi`/`from_mcp` forwarding handler (living in alknet-http) is the +**credential injection point**. It reads from `context.capabilities`, not from +`std::env::var`. aisdk's `Default` impls reading env vars are simply never +called — the assembly layer constructs providers with vault-derived +credentials through the builder API, or the provider's HTTP calls are routed +through `from_openapi` operations that carry the credential in `Capabilities`. + +**This must be a spec-level invariant in alknet-call, not a runtime convention.** +The dispatch path (`build_root_context` and `OperationEnv::invoke()` per +ADR-022 §5) populates `OperationContext.capabilities` from the registration +bundle. The invariant is: *no handler reads outbound credentials from any +source other than `OperationContext.capabilities`.* This is already the +architectural intent of ADR-014; the completion work should make it an explicit, +documented invariant that the `from_openapi`/`from_mcp` handler implementations +(in alknet-http) are verified against. + +## The "Exchange of Operations" Pattern (Runner / Container Service) + +This is the canonical downstream pattern alknet-call completion unblocks, made +explicit here so Phase 1 specs can reference it. Concrete example: the +container service at `/workspace/@alkdev/dispatch` (axum + russh SSH client for +"reverse git runner" over Docker/vast.ai) gets rewritten as a call-protocol +service. + +### Bilateral exchange + +``` +Container service (runs on a vast.ai/docker instance): + Defines Local ops: /container/exec, /container/list, /container/logs... + (real handlers — calls bollard or vast.ai API) + Connects to hub as a CallClient (outbound connection — runner pattern) + +Hub (central server): + Runs CallAdapter (server) on alknet/call (already implemented) + When the container service connects: + hub runs from_call → discovers /container/* via services/list + services/schema + registers them as FromCall provenance (leaf, forwarding handlers) in the + connection's Layer 2 overlay (ADR-024) + Now the hub (or anything connected to the hub) can call /container/exec + The from_call handler forwards over the connection back to the container service + +Bilateral: the container service ALSO runs from_call against the hub, + discovers the hub's External ops, and can call them. + Connection direction (container → hub) is independent of call direction + (both can call each other) per ADR-017 §2. +``` + +### What this requires + +1. **`CallClient`** — the container service uses it to open the outbound + connection to the hub. This is the #1 gap. +2. **`from_call`** — both sides run it to populate their Layer 2 overlays with + the other side's `External` ops. This is the #2 gap. +3. **`OperationAdapter` trait** — `from_call` implements it. This is the #3 gap + (enabling, not blocking — `from_call` can be built as a free function before + the trait exists, but the trait is needed for alknet-http's adapters). + +### Why the container service doesn't need alknet-ssh + +The current dispatch uses SSH (`channel_open_direct_tcpip`) as the transport +for the "connect back to hub" pattern. Under the call protocol, the container +service is a `CallClient` that dials the hub's `alknet/call` ALPN directly over +QUIC — no SSH in the loop. SSH port forwarding becomes the *transitional* +mechanism for targets that can't run a call-protocol client (the alknet-ssh +phase-0 findings document this transition). Once the container service runs a +`CallClient`, SSH is out of the path entirely. + +This is the "dev runner" pattern: a call-protocol client that connects back to +a hub and exposes the core dev tools (bash, fs, etc.) as operations. The other +tools (web search, etc.) plug into the call protocol as additional operations. +The agent service (alknet-agent, downstream) is the consumer that orchestrates +these via `env.invoke()`. + +## Implementation Priority Order + +Based on the gap analysis and the downstream unblock chain: + +1. **`CallClient`** (critical) — outbound connection opener. Without it, no + runner, no container service, no bilateral exchange. Reuses the existing + `CallConnection` (which is already implemented) for the dispatch loop; adds + only the connection-establishment + credential-handling half. This is the + single highest-value piece of work in the entire alknet-call completion. + +2. **`from_call`** (critical, depends on `CallClient`) — discovers remote ops + via `services/list` + `services/schema`, constructs `HandlerRegistration` + bundles with `FromCall` provenance, registers them in the connection's + Layer 2 overlay via `CallConnection::register_imported_all()`. The + discovery mechanism (`services/list` / `services/schema` specs + handlers) + is already implemented in `registry/discovery.rs`; `from_call` is the + client-side consumer of that discovery API. + +3. **`OperationAdapter` trait** (enabling) — the async trait + (`async fn import(&self) -> Result, AdapterError>`) + that `from_call`, `from_openapi`, `from_mcp`, `from_jsonschema` all + implement. Needed before alknet-http's adapter implementations can be built. + Small, standalone, unblocks alknet-http Phase 1. + +4. **`from_jsonschema`** (medium, standalone) — schema-only registration, no + handler. Useful for validation/discovery without execution. Distinct from + `from_call` (no forwarding behavior). Small. + +5. **DC-1 resolution** (peer-scoped registry filtering) — the security + dimension of `CallClient`'s registry. Can be addressed in parallel with #1 + (it's a filtering layer on the registry the `CallClient` exposes, not a + blocker for the connection-establishment work). Needs an ADR. + +## What This Completion Unblocks + +| Downstream crate | What it needs from alknet-call | Status without completion | +|-------------------|-------------------------------|--------------------------| +| alknet-http | `OperationAdapter` trait (to implement `from_openapi`/`from_mcp`) | Blocked — can't define HTTP-backed adapters without the trait | +| alknet-ssh | Stable alknet-call types (no adapter dependency) | Not blocked — ssh depends on alknet-core, not alknet-call's adapters. Can proceed in parallel. | +| alknet-agent | `CallClient` (tool dispatch), `from_call` (remote tool import), `OperationAdapter` (provider adapters) | Blocked on `CallClient` + `from_call` | +| Container service (dispatch rewrite) | `CallClient` + `from_call` | Blocked — this is the primary consumer | +| Runner pattern (dev runner, opencode runner) | `CallClient` + `from_call` | Blocked — the runner IS a `CallClient` | +| alknet-napi | `CallClient` (Node.js calls remote ops) | Blocked — NAPI projects `CallClient` to JS | + +## Open Questions to Carry into Phase 1 + +- **OQ-CALL-01 (peer-scoped registry filtering shape)**: the exact mechanism + for marking `Capabilities` entries or `HandlerRegistration`s as remote-safe + (DC-1). Needs an ADR. The *existence* of filtering is one-way; the shape is + two-way. +- **OQ-CALL-02 (`OperationAdapter` error type)**: `AdapterError` enum shape + (DC-4). Two-way door; record in spec amendment. +- **OQ-CALL-03 (`from_call` re-import trigger)**: auto-on-reconnect vs + explicit (DC-2). Two-way door; recommend auto-on-reconnect as default. +- **OQ-CALL-04 (namespace collision behavior)**: error on collision (DC-3). + Two-way door; recommend error as default. + +## Next Steps + +1. **Resolve DC-1** (peer-scoped registry filtering) — this is the one decision + that needs an ADR before `CallClient` can be implemented correctly. The + others (DC-2, DC-3, DC-4) are two-way-door defaults that can be set in the + spec amendment and revisited during implementation. +2. **Amend the call spec** (`call-protocol.md`, `operation-registry.md`) to + capture: the `CallClient` gap, the adapter location map, the no-env-vars + invariant, the exchange-of-operations pattern, and the DC-2/3/4 defaults. +3. **Implement `CallClient`** — the highest-value piece. Reuses `CallConnection` + for the dispatch loop; adds connection establishment + credentials. +4. **Implement `from_call`** — consumes the already-implemented + `services/list` + `services/schema` discovery API. +5. **Implement `OperationAdapter` trait** — small, unblocks alknet-http. +6. **Implement `from_jsonschema`** — small, standalone. + +## References + +- `docs/architecture/decisions/017-call-protocol-client-and-adapter-contract.md` — + the client/adapter contract (specced, partially unimplemented) +- `docs/architecture/decisions/022-handler-registration-provenance-and-composition-authority.md` — + registration bundle, provenance, composition authority +- `docs/architecture/decisions/024-operation-registry-layering.md` — + Layer 0/1/2 overlay model +- `docs/architecture/decisions/014-secret-material-flow-and-capability-injection.md` — + the no-env-vars invariant's foundation +- `docs/architecture/crates/call/call-protocol.md` — `CallConnection`, Layer 2 + overlay, `compose_root_env` +- `docs/architecture/crates/call/operation-registry.md` — adapter provenance, + `Capabilities` injection +- `crates/alknet-call/src/` — implementation (verified state above) +- `/workspace/@alkdev/operations/` — TypeScript prior art (`from_openapi.ts`, + `from_mcp.ts`, `from_schema.ts`, `scanner.ts`) +- `/workspace/@alkdev/dispatch/` — concrete downstream consumer (container + service / "reverse git runner") this completion unblocks +- `/workspace/aisdk/` — downstream consumer (Rust port of Vercel AI SDK); the + no-env-vars invariant makes its `std::env::var` reads unreachable +- `/workspace/rust-sdk/` — MCP Rust SDK (rmcp); streamable HTTP transport for + alknet-http's `from_mcp`/`to_mcp` (separate crate, separate Phase 0) +- `docs/research/alknet-ssh/phase-0-findings.md` — alknet-ssh Phase 0; confirms + ssh depends on alknet-core not alknet-call's adapters, so it proceeds in + parallel with this completion \ No newline at end of file