Files

glm-5.2 79d8561bb4 docs(research): alknet-call completion gap analysis — CallClient + from_call + OperationAdapter

Gap analysis for completing alknet-call: the server-side core (~5.7k lines,
159 tests) is implemented, but the client side (CallClient), the bilateral
exchange mechanism (from_call), and the adapter contract (OperationAdapter
trait) are specced in ADR-017 and unimplemented.

Records: implementation state (verified against src/), 5 decisions needed
(peer-scoped registry filtering as the load-bearing one), the settled adapter
location map (trait + from_call + from_jsonschema in alknet-call; from_openapi/
from_mcp in alknet-http), the no-env-vars invariant (Capabilities → from_openapi
handler → HTTP header), and the exchange-of-operations runner pattern with
dispatch as the concrete downstream consumer.

2026-06-25 12:44:49 +00:00

23 KiB

Raw Permalink Blame History

status, last_updated

status	last_updated
draft	2026-06-25

alknet-call Completion — Gap Analysis

This document captures the gap between the existing alknet-call architecture (ADRs 005/007/012/014/015/016/017/022/023/024, specs in docs/architecture/crates/call/) and the current implementation (crates/alknet-call/src/), the decisions needed before implementation can proceed, and the downstream crates this completion unblocks.

Unlike the alknet-ssh phase-0 findings (a true exploration doc for a crate with no existing architecture), this is a gap analysis + decision record for completing existing architecture. The specs are largely settled; the work is implementing what's specced and resolving a small number of decisions the specs left as two-way doors or didn't address.

Implementation State (Verified)

The call protocol's server-side core is implemented and tested (159 tests, passing). What's missing is the client side and the adapter contract.

Implemented (~5,773 lines, 159 tests)

Component	File	Lines	Status
`CallAdapter` (ProtocolHandler for `alknet/call`)	`protocol/adapter.rs`	1,051	Done
`CallConnection` (Layer 2 overlay, call/subscribe/abort)	`protocol/connection.rs`	780	Partial — see below
Wire framing (`EventEnvelope`, `FrameFramedReader/Writer`)	`protocol/wire.rs`	544	Done
`PendingRequestMap` (ID-based correlation)	`protocol/pending.rs`	584	Done
Abort cascade	`protocol/abort.rs`	393	Done
`OperationRegistry`, `HandlerRegistration`, builder	`registry/registration.rs`	734	Done
`OperationSpec`, `AccessControl`, `Visibility`, `ErrorDefinition`	`registry/spec.rs`	321	Done
`OperationContext`, `ScopedOperationEnv`, `AbortPolicy`	`registry/context.rs`	178	Done
`OperationEnv` trait, `CompositeOperationEnv`, `LocalOperationEnv`	`registry/env.rs`	598	Done
Service discovery (`services/list`, `services/schema` specs + handlers)	`registry/discovery.rs`	557	Done

Not implemented (specced in ADR-017, absent from `src/`)

Component	Spec location	Priority	Unblocks
`CallClient` (outbound connection opener)	ADR-017 §1	Critical	Runner pattern, bilateral exchange, every downstream consumer
`from_call` adapter (discover + register remote ops)	ADR-017 §3	Critical (depends on `CallClient`)	Bilateral registry exchange, container-service pattern
`OperationAdapter` trait	ADR-017 §5	Enabling	alknet-http's `from_openapi`/`from_mcp` implementations
`from_jsonschema` (schema-only registration, no handler)	ADR-017 §5	Medium	Type validation, composition graph construction without runtime

Partially implemented

CallConnection (protocol/connection.rs:34) exists and implements the Layer 2 overlay (register_imported, register_imported_all, overlay_env), the call() / subscribe() / abort() outbound-call API, and the OverlayOperationEnv trait impl. It is constructed via CallConnection::new(connection: Connection) — meaning it wraps a Connection that was already established by the CallAdapter's accept path.

What's missing is the path that opens a connection and constructs a CallConnection from the client side: CallClient::connect(addr, credentials). The CallConnection type itself is ready; the CallClient that produces it is not. This confirms ADR-017's design: the dispatch loop is shared, and the client is the connection-establishment half, not a parallel protocol implementation.

Decisions Needed

These are the points the specs either left as two-way doors or didn't address. Each is tagged with door type per ADR-009. Resolving these is the prerequisite for implementation.

DC-1: `CallClient` registry scope — share global vs peer-scoped subset

(One-way door — security dimension; ADR-017 Consequences flags this)

ADR-017 §1 says CallClient "has its own operation registry to dispatch incoming calls from the remote side." The Consequences section flags the security dimension explicitly: "Sharing the global registry with a CallClient exposes local capabilities to the remote peer... A peer-scoped subset must filter by capability remote-safety, not just operation name."

Three options:

(a) Share the global registry — the remote peer can call any External operation. Simplest. But per ADR-017's Consequences, this exposes the local node's Capabilities to the remote peer's calls: OperationContext.capabilities is populated from the local HandlerRegistration.capabilities, so the local node's API keys get used for the remote peer's call. This is a capability-exposure decision, not just a dispatch decision.
(b) Peer-scoped subset — the CallClient holds a filtered view of the global registry, exposing only operations whose Capabilities are marked remote-safe. Requires a "remote-safe" flag on HandlerRegistration or on Capabilities entries (which don't exist today).
(c) Separate registry per CallClient — the CallClient has its own registry, populated explicitly at construction. Most restrictive, most explicit, most boilerplate.

Recommendation: (b) peer-scoped subset as the v1 default, with (a) as an explicit opt-in for trusted peers. Rationale: the runner pattern (worker connects to hub) and the dispatch pattern (hub connects to worker) both involve semi-trusted peers where exposing all local capabilities is wrong by default. The "remote-safe" marking is the new concept this introduces — likely a Visibility::External-adjacent flag or a Capabilities entry annotation. This needs an ADR (likely an amendment to ADR-017 or a new ADR-028) because it adds a concept to the registration bundle. The exact shape is a two-way door; the existence of the filtering is the one-way door.

DC-2: `from_call` re-import on reconnection

(Two-way door — ADR-017 Assumption 4)

ADR-017 Assumption 4: "If the remote operation changes (new schema, renamed), the imported spec is stale until re-import. The assumption is that re-import happens on reconnection or is triggered explicitly. Hot-swapping imported specs is a two-way door."

The question: does from_call run automatically on every (re)connection, or only on explicit trigger? Auto-re-import on reconnect is simpler for the runner pattern (worker reconnects → hub re-discovers worker's ops automatically). Explicit trigger is safer (no surprise registry mutations).

Recommendation: auto-re-import on connection establishment for the v1 default. The runner pattern is the primary use case, and runners reconnecting is the common case — making it explicit adds friction without clear benefit. The overlay is per-connection (Layer 2, ADR-024), so a stale overlay dies with the connection; re-import on reconnect is naturally scoped. Explicit re-import can be added later as a CallConnection::refresh() method if needed. This is a two-way door — record the default, don't spend an ADR.

DC-3: `from_call` namespace collision handling

(Two-way door — ADR-017 §3 mentions FromCallConfig prefix)

ADR-017 §3: FromCallConfig includes "An optional namespace prefix (to avoid collisions when importing from multiple remote nodes)." The question is whether the prefix is mandatory (always applied) or optional (default no prefix, collision = last-wins or error).

Recommendation: optional prefix, default no prefix, collision = error. A node importing from two remotes that both expose /container/exec without prefixes should fail loudly rather than silently overwrite. The operator adds prefixes when they know they're importing from multiple sources. This matches the "default-deny, explicit-allow" posture. Two-way door, no ADR needed.

DC-4: `OperationAdapter` trait error type

(Two-way door — ADR-017 §5 says "specific trait signatures... are two-way doors")

ADR-017 §5 shows the trait as async fn import(&self) -> Vec<HandlerRegistration>, with no error type. A real implementation needs to handle failures (HTTP fetch fails for from_openapi, remote unreachable for from_call, schema parse error for from_jsonschema).

Recommendation: the trait returns Result<Vec<HandlerRegistration>, AdapterError> where AdapterError is a crate-level enum (DiscoveryFailed, SchemaParse, Transport, Unauthorized). The spec's omission of the error type was an implementation-detail two-way door; the implementation fills it in. Record in the spec amendment, not a full ADR.

DC-5: `from_jsonschema` vs `from_call` separation

(Confirmed — not a decision, but recorded for clarity)

These are distinct, not collapsible:

	`from_jsonschema`	`from_call`
Schema source	Provided directly (caller fetches, passes in)	Discovered over wire (`services/list` + `services/schema`)
Handler at call time	None (schema-only, `FromJsonSchema` provenance)	Forwards over QUIC (`FromCall` provenance, leaf)
Use case	Type validation, discovery, composition graph construction	Actually invoking remote operations

from_call = schema import (the from_jsonschema-shaped step) + forwarding handler attachment. Keeping them separate preserves the "schema-only, no execution" use case (type checking, safe composition planning without runtime). This is confirmed architecture, not a decision to make.

Adapter Location Map (Settled)

The decomposition principle: the adapter trait lives where the types live (alknet-call); the adapter implementations live where their transport dependencies live.

alknet-call (lean — no HTTP client, no HTTP server)
├── OperationAdapter trait          (the contract — async, per ADR-017 §5)
├── from_call                     (QUIC — discovers remote ops via call protocol)
├── from_jsonschema               (pure parse — caller fetches the doc, passes it in)
└── CallClient                    (outbound connection opener — the #1 gap)

alknet-http (owns HTTP server + HTTP client — separate crate, separate Phase 0)
├── ProtocolHandler for h2/http1.1/h3   (axum server — inbound HTTP)
├── from_openapi                   (parse OpenAPI doc + reqwest forwarding handler)
├── to_openapi                     (generate OpenAPI doc from local registry)
├── from_mcp  (feature-gated)       (import remote MCP tools over streamable HTTP — reqwest)
└── to_mcp    (feature-gated)       (expose local ops as MCP tools over streamable HTTP — axum)

Not built: MCP stdio transport
  — stdio = spawn arbitrary executable = built-in RCE ("download untrusted MCP servers")
  — streamable HTTP is the only supported MCP transport in alknet
  — recorded as an explicit security position, not a feature gap

Why this works: alknet-call never sees the HTTP client. The from_openapi/from_mcp forwarding handlers are opaque Arc<dyn Handler> from the registry's perspective — constructed by alknet_http::from_openapi() at registration time, stored in HandlerRegistration, dispatched by the CallAdapter which doesn't know reqwest is involved. alknet-call stays lean (no reqwest, no axum); alknet-http owns both HTTP directions.

ADR-003 dependency note: alknet-http implementing from_openapi/from_mcp means alknet-http depends on alknet-call (for OperationSpec, Handler, HandlerRegistration, OperationAdapter). ADR-003's rule is "no handler crate depends on another handler crate" — but alknet-call is both a handler and the protocol foundation that alknet-agent and alknet-napi already consume. alknet-http depending on alknet-call is "HTTP uses the call protocol types," not "HTTP depends on SSH." This is within the spirit of ADR-003 (alknet-call is protocol-foundation, not a peer handler), but should be noted explicitly in the alknet-http spec and possibly as a one-line amendment to ADR-003 clarifying that alknet-call is a protocol-foundation crate.

The No-Env-Vars Invariant (Architectural Mechanism)

This is the architectural fix for the env-var problem in downstream consumers like aisdk (the Rust port of Vercel's AI SDK at /workspace/aisdk/, 75 providers all reading std::env::var("OPENAI_API_KEY") in their Default impls). The fix is not to modify aisdk — it's that the env-var path is never taken because the assembly layer never calls Default::default().

The credential injection path:

vault (seed)
  → assembly layer (derive + decrypt at startup, per ADR-014/019/025)
    → Capabilities (non-serializable, zeroized, immutable — ADR-014)
      → HandlerRegistration.capabilities (ADR-022, the registration bundle)
        → OperationContext.capabilities (per-request, populated by dispatch
          path from the bundle — ADR-022 §6)
          → from_openapi handler reads context.capabilities.get("openai")
            → injects into HTTP Authorization header
              → reqwest request goes out with vault-derived credential

The from_openapi/from_mcp forwarding handler (living in alknet-http) is the credential injection point. It reads from context.capabilities, not from std::env::var. aisdk's Default impls reading env vars are simply never called — the assembly layer constructs providers with vault-derived credentials through the builder API, or the provider's HTTP calls are routed through from_openapi operations that carry the credential in Capabilities.

This must be a spec-level invariant in alknet-call, not a runtime convention. The dispatch path (build_root_context and OperationEnv::invoke() per ADR-022 §5) populates OperationContext.capabilities from the registration bundle. The invariant is: no handler reads outbound credentials from any source other than OperationContext.capabilities. This is already the architectural intent of ADR-014; the completion work should make it an explicit, documented invariant that the from_openapi/from_mcp handler implementations (in alknet-http) are verified against.

The "Exchange of Operations" Pattern (Runner / Container Service)

This is the canonical downstream pattern alknet-call completion unblocks, made explicit here so Phase 1 specs can reference it. Concrete example: the container service at /workspace/@alkdev/dispatch (axum + russh SSH client for "reverse git runner" over Docker/vast.ai) gets rewritten as a call-protocol service.

Bilateral exchange

Container service (runs on a vast.ai/docker instance):
  Defines Local ops: /container/exec, /container/list, /container/logs...
  (real handlers — calls bollard or vast.ai API)
  Connects to hub as a CallClient (outbound connection — runner pattern)

Hub (central server):
  Runs CallAdapter (server) on alknet/call (already implemented)
  When the container service connects:
    hub runs from_call → discovers /container/* via services/list + services/schema
    registers them as FromCall provenance (leaf, forwarding handlers) in the
    connection's Layer 2 overlay (ADR-024)
  Now the hub (or anything connected to the hub) can call /container/exec
  The from_call handler forwards over the connection back to the container service

Bilateral: the container service ALSO runs from_call against the hub,
  discovers the hub's External ops, and can call them.
  Connection direction (container → hub) is independent of call direction
  (both can call each other) per ADR-017 §2.

What this requires

CallClient — the container service uses it to open the outbound connection to the hub. This is the #1 gap.
from_call — both sides run it to populate their Layer 2 overlays with the other side's External ops. This is the #2 gap.
OperationAdapter trait — from_call implements it. This is the #3 gap (enabling, not blocking — from_call can be built as a free function before the trait exists, but the trait is needed for alknet-http's adapters).

Why the container service doesn't need alknet-ssh

The current dispatch uses SSH (channel_open_direct_tcpip) as the transport for the "connect back to hub" pattern. Under the call protocol, the container service is a CallClient that dials the hub's alknet/call ALPN directly over QUIC — no SSH in the loop. SSH port forwarding becomes the transitional mechanism for targets that can't run a call-protocol client (the alknet-ssh phase-0 findings document this transition). Once the container service runs a CallClient, SSH is out of the path entirely.

This is the "dev runner" pattern: a call-protocol client that connects back to a hub and exposes the core dev tools (bash, fs, etc.) as operations. The other tools (web search, etc.) plug into the call protocol as additional operations. The agent service (alknet-agent, downstream) is the consumer that orchestrates these via env.invoke().

Implementation Priority Order

Based on the gap analysis and the downstream unblock chain:

CallClient (critical) — outbound connection opener. Without it, no runner, no container service, no bilateral exchange. Reuses the existing CallConnection (which is already implemented) for the dispatch loop; adds only the connection-establishment + credential-handling half. This is the single highest-value piece of work in the entire alknet-call completion.
from_call (critical, depends on CallClient) — discovers remote ops via services/list + services/schema, constructs HandlerRegistration bundles with FromCall provenance, registers them in the connection's Layer 2 overlay via CallConnection::register_imported_all(). The discovery mechanism (services/list / services/schema specs + handlers) is already implemented in registry/discovery.rs; from_call is the client-side consumer of that discovery API.
OperationAdapter trait (enabling) — the async trait (async fn import(&self) -> Result<Vec<HandlerRegistration>, AdapterError>) that from_call, from_openapi, from_mcp, from_jsonschema all implement. Needed before alknet-http's adapter implementations can be built. Small, standalone, unblocks alknet-http Phase 1.
from_jsonschema (medium, standalone) — schema-only registration, no handler. Useful for validation/discovery without execution. Distinct from from_call (no forwarding behavior). Small.
DC-1 resolution (peer-scoped registry filtering) — the security dimension of CallClient's registry. Can be addressed in parallel with #1 (it's a filtering layer on the registry the CallClient exposes, not a blocker for the connection-establishment work). Needs an ADR.

What This Completion Unblocks

Downstream crate	What it needs from alknet-call	Status without completion
alknet-http	`OperationAdapter` trait (to implement `from_openapi`/`from_mcp`)	Blocked — can't define HTTP-backed adapters without the trait
alknet-ssh	Stable alknet-call types (no adapter dependency)	Not blocked — ssh depends on alknet-core, not alknet-call's adapters. Can proceed in parallel.
alknet-agent	`CallClient` (tool dispatch), `from_call` (remote tool import), `OperationAdapter` (provider adapters)	Blocked on `CallClient` + `from_call`
Container service (dispatch rewrite)	`CallClient` + `from_call`	Blocked — this is the primary consumer
Runner pattern (dev runner, opencode runner)	`CallClient` + `from_call`	Blocked — the runner IS a `CallClient`
alknet-napi	`CallClient` (Node.js calls remote ops)	Blocked — NAPI projects `CallClient` to JS

Open Questions to Carry into Phase 1

OQ-CALL-01 (peer-scoped registry filtering shape): the exact mechanism for marking Capabilities entries or HandlerRegistrations as remote-safe (DC-1). Needs an ADR. The existence of filtering is one-way; the shape is two-way.
OQ-CALL-02 (OperationAdapter error type): AdapterError enum shape (DC-4). Two-way door; record in spec amendment.
OQ-CALL-03 (from_call re-import trigger): auto-on-reconnect vs explicit (DC-2). Two-way door; recommend auto-on-reconnect as default.
OQ-CALL-04 (namespace collision behavior): error on collision (DC-3). Two-way door; recommend error as default.

Next Steps

Resolve DC-1 (peer-scoped registry filtering) — this is the one decision that needs an ADR before CallClient can be implemented correctly. The others (DC-2, DC-3, DC-4) are two-way-door defaults that can be set in the spec amendment and revisited during implementation.
Amend the call spec (call-protocol.md, operation-registry.md) to capture: the CallClient gap, the adapter location map, the no-env-vars invariant, the exchange-of-operations pattern, and the DC-2/3/4 defaults.
Implement CallClient — the highest-value piece. Reuses CallConnection for the dispatch loop; adds connection establishment + credentials.
Implement from_call — consumes the already-implemented services/list + services/schema discovery API.
Implement OperationAdapter trait — small, unblocks alknet-http.
Implement from_jsonschema — small, standalone.

References

docs/architecture/decisions/017-call-protocol-client-and-adapter-contract.md — the client/adapter contract (specced, partially unimplemented)
docs/architecture/decisions/022-handler-registration-provenance-and-composition-authority.md — registration bundle, provenance, composition authority
docs/architecture/decisions/024-operation-registry-layering.md — Layer 0/1/2 overlay model
docs/architecture/decisions/014-secret-material-flow-and-capability-injection.md — the no-env-vars invariant's foundation
docs/architecture/crates/call/call-protocol.md — CallConnection, Layer 2 overlay, compose_root_env
docs/architecture/crates/call/operation-registry.md — adapter provenance, Capabilities injection
crates/alknet-call/src/ — implementation (verified state above)
/workspace/@alkdev/operations/ — TypeScript prior art (from_openapi.ts, from_mcp.ts, from_schema.ts, scanner.ts)
/workspace/@alkdev/dispatch/ — concrete downstream consumer (container service / "reverse git runner") this completion unblocks
/workspace/aisdk/ — downstream consumer (Rust port of Vercel AI SDK); the no-env-vars invariant makes its std::env::var reads unreachable
/workspace/rust-sdk/ — MCP Rust SDK (rmcp); streamable HTTP transport for alknet-http's from_mcp/to_mcp (separate crate, separate Phase 0)
docs/research/alknet-ssh/phase-0-findings.md — alknet-ssh Phase 0; confirms ssh depends on alknet-core not alknet-call's adapters, so it proceeds in parallel with this completion

23 KiB Raw Permalink Blame History