docs: write Phase 0 architecture foundation — ADRs 026-034, spec docs, and task updates

Phase 0a — ADRs (9 new):
- ADR-026: Transport/interface separation (three-layer model)
- ADR-027: Crate decomposition (core, secret, storage, flowgraph, napi, CLI)
- ADR-028: Auth as irpc service (AuthProtocol behind feature flag)
- ADR-029: Identity as core type (Identity + IdentityProvider in alknet-core)
- ADR-030: Static/dynamic config split (ArcSwap, ConfigReloadHandle)
- ADR-031: Forwarding policy (rule-based allow/deny, TransportKind-aware)
- ADR-032: Event boundary discipline (domain, irpc, call protocol boundaries)
- ADR-033: OperationEnv universal composition (three dispatch paths)
- ADR-034: Head/worker terminology (replace hub/spoke)

Phase 0b — New spec documents (7):
- identity.md, services.md, interface.md, configuration.md,
  storage.md, flowgraph.md, secret-service.md

Updated existing docs:
- auth.md: reference identity.md for canonical definitions, add AuthProtocol
- open-questions.md: resolve OQ-12, OQ-16, OQ-18, OQ-22, OQ-23-25
- README.md: add all new docs, ADRs 026-034

Marked 19 architecture tasks as completed.
This commit is contained in:
2026-06-07 09:32:58 +00:00
parent 84f16d66e7
commit 19b3d3a078
38 changed files with 2750 additions and 101 deletions

View File

@@ -0,0 +1,162 @@
# ADR-026: Transport/Interface Separation (Three-Layer Model)
## Status
Accepted
## Context
In the current architecture, SSH is deeply embedded in the server handler. The
`ServerHandler` owns auth, channel management, and proxy logic — all mixed
together. This makes it impossible to run the call protocol over any transport
that doesn't speak SSH, such as:
- **DNS** — encoding call protocol frames as DNS TXT queries/responses for
censorship resistance
- **Raw framing** — 4-byte length prefix + JSON `EventEnvelope` without SSH
wrapping, for local service mesh or browser-to-head direct communication
- **WebTransport** — running call protocol over QUIC streams (browsers can't do
SSH key exchange)
The DNS control channel concept from research (`core.md`) currently conflates
"DNS as a transport that moves bytes" with "SSH sessions over those bytes." But
SSH is not a transport — it's a protocol layer that sits *on top of* a
transport. Separating them enables the DNS control channel to carry call
protocol events directly, without wrapping SSH inside DNS queries.
The same separation enables raw framing (no SSH overhead) for trusted local
networks, and WebTransport direct call protocol for browser clients.
## Decision
**Establish a three-layer model:**
### Layer 1: Transport
Produces byte streams. A `Transport` still produces
`AsyncRead + AsyncWrite + Unpin + Send`. This layer is unchanged from ADR-001.
```rust
#[async_trait]
pub trait Transport: Send + Sync + 'static {
type Stream: AsyncRead + AsyncWrite + Unpin + Send + 'static;
async fn connect(&self) -> Result<Self::Stream>;
fn describe(&self) -> String;
}
```
Transports: TCP, TLS, iroh, DNS (as byte carrier), WebTransport (future).
### Layer 2: Interface
Consumes a `Transport::Stream` and produces call protocol sessions. An
interface is what SSH currently does: wrap a byte stream in session semantics.
```rust
#[async_trait]
pub trait Interface: Send + Sync + 'static {
type Session;
async fn accept(stream: TransportStream, config: &InterfaceConfig) -> Result<Self::Session>;
}
```
Interfaces:
- **SSH interface** — wraps existing `ServerHandler` logic. SSH handshake, auth,
channel multiplexing. The call protocol runs over a reserved SSH channel
(`alknet-control:0`).
- **Raw framing interface** — 4-byte big-endian length prefix + JSON
`EventEnvelope`. No SSH overhead. Direct call protocol over the transport
stream.
- **DNS control channel** — a (DNS transport, raw framing interface) pair that
encodes/decodes `EventEnvelope` frames as DNS query/response pairs.
### Layer 3: Protocol
Carries semantics. Call protocol events, operation registry, service calls.
The protocol is agnostic to both the transport and the interface below it. It
receives `EventEnvelope` frames from whatever interface produced them.
### Connection Model
A **connection** is always a (Transport, Interface) pair. The valid combinations are enumerated:
| Transport | Interface | Use case |
|-----------|-----------|----------|
| TLS | SSH | Standard alknet tunnel |
| TCP | SSH | Plain SSH tunnel |
| iroh | SSH | P2P SSH tunnel |
| DNS | raw framing | DNS control channel |
| WebTransport | SSH | Browser SSH tunnel (future) |
| WebTransport | raw framing | Browser call protocol (future) |
| TCP | raw framing | Direct call protocol, local mesh |
**The DNS control channel carries call protocol frames directly — it does NOT
wrap SSH inside DNS.** This is explicit because the research originally
conflated "SSH tunneling over DNS" with "DNS as a transport for call protocol."
The (DNS, raw framing) pair sends `EventEnvelope` frames as DNS TXT
queries/responses — no SSH involved.
### `TransportKind` Enum
The `TransportKind` enum (currently `Tcp | Tls | Iroh`) gains `Dns` and
`WebTransport` variants. Initially these are tags only — no acceptor
implementation. The full DNS and WebTransport implementations are Phase 4 work
per the integration plan.
```rust
pub enum TransportKind {
Tcp,
Tls { server_name: Option<String> },
Iroh { endpoint_id: String },
Dns { domain: String },
WebTransport { host: String },
}
```
### ServerHandler Refactor
The existing `ServerHandler` is refactored into `SshInterface`. The interface
abstraction means the server's accept loop becomes:
```rust
// Pseudocode
let (transport, interface) = listener_config;
let stream = transport.accept().await?;
let session = interface.accept(stream, &config).await?;
// session produces call protocol events
```
The call protocol handler is interface-agnostic — it receives `EventEnvelope`
frames from any interface. Auth, forwarding policy, and operation routing happen
at Layer 3, not inside the SSH handler.
## Consequences
- **Positive**: Enables DNS control channel without SSH wrapping. The (DNS,
raw framing) pair is a clean (Transport, Interface) combination.
- **Positive**: Enables raw framing for local service mesh. No SSH overhead for
trusted networks.
- **Positive**: SSH becomes pluggable. The same call protocol handler works with
any interface.
- **Positive**: `ServerHandler` is refactored into `SshInterface` — a smaller,
more focused component that only handles SSH session management.
- **Positive**: Future WebTransport and WebSocket interfaces are additive — they
implement the `Interface` trait without touching SSH code.
- **Negative**: This is the most invasive code change in Phase 1
(integration-plan, Phase 1.8). SSH auth, channel management, and proxy logic
are currently tangled in `ServerHandler`. Extracting them requires careful
refactoring to maintain existing behavior.
- **Negative**: The `Interface` trait is new and untested. The design must
accommodate both SSH's channel multiplexing and raw framing's single-stream
model through the same abstraction.
## References
- [research/core.md](../../research/core.md) — Transport layer, DNS transport section
- [research/integration-plan.md](../../research/integration-plan.md) — Phase 1.8, three-layer model
- [transport.md](../transport.md) — Current Transport trait (unchanged at Layer 1)
- [server.md](../server.md) — Current ServerHandler (will become SshInterface)
- [ADR-001](001-pluggable-transport.md) — Transport trait produces stream (unchanged)
- [ADR-004](004-ssh-over-transport.md) — SSH runs over transport (reinforced by Layer 2)
- [ADR-024](024-bidirectional-call-protocol.md) — Bidirectional call protocol (Layer 3)

View File

@@ -0,0 +1,150 @@
# ADR-027: Crate Decomposition
## Status
Accepted
## Context
alknet-core currently contains everything: transport, SSH, auth, config, the
call protocol handler, and the server accept loop. As the project grows to
include SQLite-backed identity, HD key derivation, and metagraph storage, core
would need to depend on rusqlite, bip39, petgraph, and other heavy dependencies
— unacceptable for a library crate that CLI users embed.
Different deployment topologies need different subsets:
- A minimal CLI tunnel only needs core, transport, and auth types
- A head node needs SQLite-backed identity and the secret service
- A flowgraph visualization tool only needs petgraph operations
Circular dependencies must be avoided. alknet-storage implements
alknet-core's `IdentityProvider` trait, so alknet-core cannot depend on
alknet-storage. alknet-storage references alknet-secret's `EncryptedData` wire
format, but not as a crate dependency.
## Decision
**Decompose the project into six crates with a strict acyclic dependency graph.**
### Crate Structure
1. **alknet-core** — Transport, SSH, call protocol, config, auth types, identity,
`OperationSpec`, `Interface` trait. The foundational crate that everything
else depends on (by type, not by crate dep in some cases).
- *Depends on*: russh, tokio, irpc (feature-gated), serde, arc-swap
- *Does NOT depend on*: alknet-secret, alknet-storage, alknet-flowgraph
2. **alknet-secret** — BIP39 mnemonic generation, SLIP-0010 Ed25519 HD key
derivation, AES-256-GCM encryption, `SecretProtocol` irpc service.
- *Depends on*: bip39, ed25519-bip32 (or rust-bip32-ed25519), aes-gcm, sha2,
irpc
- *Does NOT depend on*: alknet-core, alknet-storage
3. **alknet-storage** — SQLite-backed metagraph, identity tables, ACL graph,
honker integration, `StorageProtocol` irpc service.
- *Depends on*: rusqlite (via honker), honker, petgraph, jsonschema, irpc
- *Does NOT depend on alknet-core* (but implements alknet-core's
`IdentityProvider` trait via the trait, not a crate dep)
- *Does NOT depend on alknet-secret* (but references `EncryptedData` type
format for wire compatibility)
4. **alknet-flowgraph**`FlowGraph<N,E>` over petgraph, operation graph, call
graph, type compatibility checking.
- *Depends on*: petgraph, serde, jsonschema, thiserror
- *Does NOT depend on*: alknet-core, alknet-storage, alknet-secret
5. **alknet-napi** — Node.js native addon. Exposes alknet-core to Node.js.
- *Depends on*: alknet-core
- *Does NOT depend on*: alknet-secret, alknet-storage, alknet-flowgraph
6. **alknet** (CLI binary) — Assembles everything.
- *Depends on*: alknet-core, alknet-secret (feature), alknet-storage (feature),
alknet-flowgraph (feature), toml
### Dependency Graph
```
alknet-secret
/ \
/ \
alknet-core ←──── ←── alknet-storage
↑ \ /
│ alknet-flowgraph
alknet-napi
alknet (CLI binary — assembles everything)
```
### Narrow Interface Points
Three types serve as the narrow interface points between crates:
1. **`Identity`** — Defined in `alknet_core::auth`. Used by auth handler,
forwarding policy, and call protocol. alknet-storage implements
`IdentityProvider` to produce instances.
2. **`IdentityProvider`** — Trait defined in `alknet_core::auth`. Implemented by
`ConfigIdentityProvider` (in core) and `StorageIdentityProvider` (in
alknet-storage). The CLI/NAPI layer wires the concrete implementation.
3. **`OperationSpec`** — Defined in `alknet_core::call`. Used by the operation
registry and by alknet-flowgraph for type compatibility checking. The bridge
is serialization — flowgraph serializes to JSON, storage persists it.
### irpc Feature Flag
irpc is a feature flag in alknet-core. When disabled, auth and config go through
`IdentityProvider` and `ConfigReloadHandle` directly — no irpc overhead. Nodes
that only do SSH tunneling don't need the service layer.
In alknet-secret and alknet-storage, irpc is an independent dependency, not
feature-gated. These crates always define irpc service protocols because they
are used in production deployments where the service layer is active.
### alknet-storage's Relationship to alknet-core
alknet-storage does NOT depend on alknet-core as a crate. Instead:
- alknet-storage defines its own `IdentityProvider` impl that matches
alknet-core's trait signature. The trait is re-exported or defined locally
with `#[cfg(feature = "alknet-core")]` interop.
- In practice, the CLI binary crate depends on both and wires them together.
alknet-storage provides `StorageIdentityProvider`; alknet-core takes
`impl IdentityProvider`.
### alknet-storage's Relationship to alknet-secret
alknet-storage does NOT depend on alknet-secret as a crate. Instead:
- alknet-storage and alknet-secret share the `EncryptedData` wire format (key
version, salt, IV, ciphertext). This is a type-level compatibility, not a
crate dependency.
- alknet-secret encrypts; alknet-storage stores the encrypted blob in a
`SecretNode` in the metagraph. The bridge is serialization.
## Consequences
- **Positive**: Core is lean. No database, no crypto, no petgraph. CLI users
get a small binary.
- **Positive**: Services are pluggable. alknet-secret and alknet-storage can be
swapped for alternative implementations.
- **Positive**: No circular dependencies. The dependency graph is a DAG.
- **Positive**: Deployment topology determines which crates to include. A CLI
tunnel uses only alknet-core. A head node uses everything.
- **Positive**: irpc is feature-gated in core. Minimal deployments don't pay for
service layer overhead.
- **Negative**: `IdentityProvider` trait interop between alknet-core and
alknet-storage requires careful versioning. If the trait signature changes,
both crates must update.
- **Negative**: `EncryptedData` wire format compatibility between alknet-secret
and alknet-storage is implicit (not enforced by the type system). A shared
types crate could be extracted if needed, but adds another crate dependency.
## References
- [research/integration-plan.md](../../research/integration-plan.md) — Phase 2, dependency graph
- [research/core.md](../../research/core.md) — alknet-core contents
- [research/services.md](../../research/services.md) — Service protocols
- [research/storage.md](../../research/storage.md) — alknet-storage contents
- [research/flow.md](../../research/flow.md) — alknet-flowgraph contents
- [ADR-029](029-identity-core-type.md) — Identity as core type (narrow interface point)

View File

@@ -0,0 +1,146 @@
# ADR-028: Auth as irpc Service
## Status
Accepted
## Context
For head nodes serving many users, in-memory key lookup via `ArcSwap<DynamicConfig>`
doesn't scale. Loading all authorized keys into RAM and atomic-swapping the
entire set on each reload works for small deployments but requires holding every
key in memory. For production deployments with hundreds or thousands of users,
auth verification should query a database on demand rather than holding all keys
in memory.
The current `ArcSwap<DynamicConfig>` approach works for CLI and single-node
setups. What's needed is an async boundary that allows auth verification to go
through a service — locally via channels for minimal deployments, or via irpc
for production deployments where auth runs on a separate process or node.
The critical design point: callers go through the `IdentityProvider` trait
(ADR-029). The irpc service is one way to satisfy the trait. Both paths produce
the same result — an `Identity` or rejection. The trait is the contract; the
service is an implementation path.
## Decision
**Auth verification is provided via an irpc service protocol, with
`IdentityProvider` as the interface contract and `ConfigIdentityProvider`
(ArcSwap-backed) as the default implementation.**
### IdentityProvider Trait (ADR-029) — The Contract
Callers depend on `IdentityProvider`, not on any concrete implementation:
```rust
pub trait IdentityProvider: Send + Sync + 'static {
fn resolve_from_fingerprint(&self, fingerprint: &str) -> Option<Identity>;
fn resolve_from_token(&self, token: &AuthToken) -> Option<Identity>;
}
```
### ConfigIdentityProvider — Default Implementation
Reads from `ArcSwap<DynamicConfig.auth>`. No database needed. Every authorized
key gets a default scope set. This is the default for CLI and single-node
deployments.
### AuthProtocol irpc Service — Behind Feature Flag
```rust
#[rpc_requests(message = AuthMessage)]
#[derive(Debug, Serialize, Deserialize)]
enum AuthProtocol {
#[rpc(tx=oneshot::Sender<AuthResult>)]
#[wrap(VerifyPubkey)]
VerifyPubkey { fingerprint: String, key_data: Vec<u8> },
#[rpc(tx=oneshot::Sender<AuthResult>)]
#[wrap(VerifyToken)]
VerifyToken { token_bytes: Vec<u8>, timestamp: u64 },
#[rpc(tx=oneshot::Sender<()>)]
#[wrap(ReloadKeys)]
ReloadKeys,
#[rpc(tx=oneshot::Sender<bool>)]
#[wrap(CheckAccess)]
CheckAccess { identity: Identity, operation: String },
}
enum AuthResult {
Ok(Identity),
Denied(String),
}
```
The `AuthProtocol` is behind the `irpc` feature flag in alknet-core. Nodes
that only do SSH tunneling don't need the service layer overhead. When the
feature is disabled, auth goes through `IdentityProvider` directly.
### AuthServiceImpl
Two implementations exist:
- **ConfigAuthService** — backed by `ConfigIdentityProvider` (ArcSwap path).
Wraps the trait in an irpc service for deployments that use the service layer
but don't have SQLite.
- **StorageAuthService** — backed by SQLite `peer_credentials` and `api_keys`
tables (in alknet-storage). Queries on demand. Can maintain an LRU cache for
hot fingerprints. This is the production implementation.
Both produce the same `AuthResult` — an `Identity` or a denial. Callers don't
know or care which backend is running.
### Integration with IdentityProvider
The irpc service and the trait compose. A caller goes through `IdentityProvider`,
which may internally delegate to the irpc service, or may satisfy the request
locally via `ConfigIdentityProvider`. The deployment topology determines the
path:
- **Minimal (CLI, single-node)**: `ConfigIdentityProvider` reads from
`ArcSwap<DynamicConfig>`. No irpc overhead.
- **Production with local auth**: `AuthServiceImpl` wraps
`StorageIdentityProvider` locally. The handler calls `IdentityProvider` which
routes to the local irpc service.
- **Distributed auth**: Handler on a worker node calls `IdentityProvider` which
routes to a remote auth irpc service over QUIC.
### ConfigService Integration
`AuthProtocol::ReloadKeys` triggers reload of the dynamic config's auth section.
For the `ConfigIdentityProvider` path, this is equivalent to
`ConfigReloadHandle::reload()`. For the `StorageIdentityProvider` path, this
refreshes the LRU cache. Both update atomically — ongoing connections are
unaffected, new connections pick up changes.
## Consequences
- **Positive**: Minimal deployments use `ArcSwap` without irpc overhead. No
database dependency for CLI users.
- **Positive**: Production deployments wire `StorageIdentityProvider` behind the
irpc service. Auth scales to thousands of users without loading all keys into
memory.
- **Positive**: The `IdentityProvider` trait is the only contract callers depend
on. This keeps alknet-core lean and testable.
- **Positive**: Feature flag (`irpc`) keeps core lean for deployments that don't
need the service layer.
- **Positive**: Both paths produce identical `Identity` results. Behavioral
parity is enforced by the shared `Identity` type.
- **Negative**: Two implementations must be kept in sync. `ConfigIdentityProvider`
and `StorageIdentityProvider` must produce the same `Identity` for the same
input. Integration tests should verify this.
- **Negative**: The `irpc` feature flag adds conditional compilation complexity.
The core must compile and work without it, and the service layer must work
with it enabled.
## References
- [research/services.md](../../research/services.md) — AuthService, AuthProtocol definition
- [auth.md](../auth.md) — IdentityProvider trait, Identity struct
- [research/configuration.md](../../research/configuration.md) — Auth service approach
- [research/integration-plan.md](../../research/integration-plan.md) — Phase 1.4
- [ADR-029](029-identity-core-type.md) — Identity as core type
- [ADR-027](027-crate-decomposition.md) — Crate decomposition

View File

@@ -0,0 +1,107 @@
# ADR-029: Identity as Core Type
## Status
Accepted
## Context
The `Identity` struct and `IdentityProvider` trait are needed by auth,
forwarding policy, and call protocol — three different subsystems in
alknet-core. Without placing them in core, these subsystems would each define
their own identity type, leading to duplication and conversion boilerplate.
The constraint: alknet-core must not depend on alknet-storage or any database.
The `IdentityProvider` trait must be in core so that the handler can resolve
identities without knowing whether the backing store is a config file or a
SQLite database. External crates provide implementations.
Earlier research defined `Identity` inconsistently: `{node_id, fingerprint,
scopes}` in services.md and `{id, scopes, resources}` in auth.md. The unified
model uses `{id, scopes, resources}` where `id` serves as both fingerprint (for
key-based auth from config) and account UUID (for database-backed auth).
## Decision
**`Identity` struct and `IdentityProvider` trait live in `alknet_core::auth`.**
### Identity Struct
```rust
pub struct Identity {
pub id: String, // Fingerprint (config auth) or account UUID (database auth)
pub scopes: Vec<String>, // e.g., ["relay:connect", "service:gitea:read"]
pub resources: HashMap<String, Vec<String>>, // e.g., {"service": ["gitea", "registry"]}
}
```
The `id` field serves dual purpose: when using config-based authentication
(`ConfigIdentityProvider`), it holds the Ed25519 key fingerprint. When using
database-backed authentication (`StorageIdentityProvider`), it holds the account
UUID from the `accounts` table. This keeps the type simple while accommodating
both auth paths.
The `scopes` field provides authorization scope strings used by
`ForwardingPolicy` and `AccessControl` in `OperationSpec`. The `resources`
field provides resource-level authorization beyond what scopes offer (e.g., which
services this identity can access).
### IdentityProvider Trait
```rust
pub trait IdentityProvider: Send + Sync + 'static {
fn resolve_from_fingerprint(&self, fingerprint: &str) -> Option<Identity>;
fn resolve_from_token(&self, token: &AuthToken) -> Option<Identity>;
}
```
The trait is the contract. Callers (auth handler, forwarding policy, call
protocol) depend on `IdentityProvider` — not on any concrete implementation.
### Default and Production Implementations
- **`ConfigIdentityProvider`** (in alknet-core) — reads from
`ArcSwap<DynamicConfig.auth>`. Every authorized key gets a default scope set.
No database needed. This is the default for minimal deployments.
- **`StorageIdentityProvider`** (in alknet-storage) — backed by SQLite
`peer_credentials` and `api_keys` tables plus the ACL graph. Resolves
fingerprint → account → organization membership → effective scopes. This is
the production implementation for head nodes.
alknet-core never depends on alknet-storage. The trait relationship is:
alknet-core *defines* the trait, alknet-storage *implements* it. The CLI or
NAPI assembly layer wires the concrete implementation.
### Why Not in alknet-storage?
If `Identity` lived in alknet-storage, alknet-core would need to depend on
alknet-storage to use the type — creating a circular dependency (since
alknet-storage implements alknet-core's `IdentityProvider` trait). Placing the
type and trait in core breaks the cycle.
## Consequences
- **Positive**: alknet-core has no database dependency. Auth, forwarding, and
call protocol all use the same `Identity` type.
- **Positive**: alknet-storage implements the core trait. The CLI/NAPI layer
wires the concrete implementation. Deployment topology determines which impl
to use.
- **Positive**: The `id` field serves dual purpose (fingerprint or UUID),
avoiding separate types for config-based and database-based auth.
- **Positive**: `ForwardingPolicy` and `AccessControl` can reference scopes from
`Identity` without knowing where they came from.
- **Negative**: Two implementations of `IdentityProvider` exist — `Config` and
`Storage`. Both must produce identical `Identity` results for the same input.
Tests should verify behavioral parity.
- **Negative**: The trait abstraction adds a level of indirection for the
minimal (config-only) deployment path. The cost is negligible — the
`ConfigIdentityProvider` is a simple `ArcSwap` dereference.
## References
- [auth.md](../auth.md) — IdentityProvider trait, Identity struct, unified auth
- [research/services.md](../../research/services.md) — AuthService, Identity section
- [research/integration-plan.md](../../research/integration-plan.md) — Phase 1.2
- [ADR-023](023-unified-auth-shared-key-material.md) — Unified auth with shared key material
- [ADR-028](028-auth-irpc-service.md) — Auth as irpc service
- [OQ-18](../open-questions.md) — IdentityProvider owns scopes

View File

@@ -0,0 +1,159 @@
# ADR-030: Static/Dynamic Configuration Split
## Status
Accepted
## Context
Alknet's configuration is loaded once at startup and never changes. This causes
three specific failures:
1. **No hot reload of authentication credentials.** Adding or removing an
authorized key requires restarting the server process. In head/worker
deployments where keys are managed via a database, the process must be
restarted every time a key is added, revoked, or rotated. This is
operationally unacceptable.
2. **No port forwarding access control.** Any authenticated client can open a
`direct-tcpip` channel to any destination. There is no policy governing
which hosts, ports, or alknet control channels a client may access. A
compromised key grants unrestricted network access through the tunnel.
3. **No structured configuration beyond CLI flags.** ADR-011 chose
programmatic-first configuration for the alpha — correct at the time. But as
alknet moves toward publishable releases, operators need config files for
reproducible deployments, and the NAPI layer needs programmatic reload
capability that `ServeOptions` doesn't currently support.
Not all configuration should be reloadable. Transport-level settings (listen
address, TLS certificates, host key) require socket/TLS renegotiation to change
at runtime — effectively a restart. Auth and forwarding policy can change
atomically without disrupting existing connections.
## Decision
**Split configuration into `StaticConfig` and `DynamicConfig`.**
### StaticConfig
Immutable after startup. Constructed from `ServeOptions` (the builder pattern is
preserved). Contains everything that affects socket binding, TLS handshakes, or
SSH session negotiation:
- Transport mode, listen address
- TLS config (cert, key)
- iroh config (relay URL)
- Stealth mode flag
- Host key, host key algorithm
- Max auth attempts, max connections per IP
- Proxy config
Changing any of these requires a restart.
### DynamicConfig
Hot-reloadable at runtime via `ArcSwap<DynamicConfig>`. Contains everything
checked per-connection or per-channel:
- `AuthPolicy` — authorized keys, certificate authorities, token config
- `ForwardingPolicy` — allow/deny rules for channel targets (ADR-031)
- `RateLimitConfig` — rate limiting parameters
`ArcSwap` provides lock-free reads on the hot path (every `auth_publickey()` and
every `channel_open_direct_tcpip()` call does an `Arc` dereference — zero cost
compared to the current approach). Writes are atomic: `store()` swaps the
pointer. Existing connections finish with their current config; new connections
get the new config.
### ConfigReloadHandle
```rust
pub struct ConfigReloadHandle {
dynamic: Arc<ArcSwap<DynamicConfig>>,
}
impl ConfigReloadHandle {
pub fn reload(&self, new_config: DynamicConfig) { ... }
}
```
The handle is obtained from `Server::run()` and passed to NAPI or the CLI.
### ConfigService
The `ConfigService` wraps `ArcSwap<DynamicConfig>` reloads behind an irpc
protocol (behind the `irpc` feature flag) for production deployments that use
the service layer. For minimal deployments (CLI, single-node), direct
`ConfigReloadHandle::reload()` is sufficient.
### TOML Config File
An optional TOML config file covers static config plus initial auth/forwarding
paths. This **amends** ADR-011 (does not supersede it) — the programmatic-first
API remains primary. The config file is a convenience input format:
```toml
[server]
transport = "tls"
listen = "0.0.0.0:443"
stealth = false
max_connections_per_ip = 5
max_auth_attempts = 3
[server.tls]
cert = "/etc/alknet/tls/cert.pem"
key = "/etc/alknet/tls/key.pem"
[auth]
host_key = "/etc/alknet/ssh/host_key"
[forwarding]
default = "deny"
```
### NAPI Reload API
```typescript
interface AlknetServer {
reloadAuth(auth: { authorizedKeys?: Buffer, certAuthority?: Buffer }): void;
reloadForwarding(policy: ForwardingPolicyConfig): void;
reloadAll(config: DynamicConfig): void;
}
```
The NAPI layer parses key data and constructs a new `DynamicConfig`, then calls
`ConfigReloadHandle::reload()`.
### Client Configuration
Client configuration stays as `ConnectOptions` — no `ArcSwap` needed. Client
config is almost entirely static (which server to connect to, which key to use).
## Consequences
- **Positive**: Auth credentials and forwarding policy can be reloaded without
restarting the server. Adding a key via `reloadAuth()` takes effect on the
next connection attempt.
- **Positive**: ADR-011's programmatic-first intent is preserved. The TOML
config file is an optional convenience layer, not a replacement for
`ServeOptions`.
- **Positive**: `ArcSwap` provides zero-cost reads on the hot path. Every auth
check and every channel open is a single `Arc` dereference.
- **Positive**: The `ConfigService` irpc protocol (behind feature flag) allows
production deployments to integrate config reload into their service mesh
without taking a direct dependency on `DynamicConfig` internals.
- **Positive**: Forwarding policy is now part of `DynamicConfig` — operators can
restrict access per identity, per destination, per transport (ADR-031).
- **Negative**: Two config structs where there was one. The split is clean
(transport vs. policy) but adds surface area.
- **Negative**: Config file introduces `toml` as a dependency in the CLI crate.
This is acceptable for a CLI binary.
## References
- [research/configuration.md](../../research/configuration.md) — Full analysis
- [ADR-011](011-no-ssh-config-programmatic-api.md) — Programmatic-first API (amended, not superseded)
- [ADR-031](031-forwarding-policy.md) — Forwarding policy (part of DynamicConfig)
- [ADR-029](029-identity-core-type.md) — Identity as core type (DynamicConfig.auth uses IdentityProvider)
- [integration-plan.md](../../research/integration-plan.md) — Phase 1.1

View File

@@ -0,0 +1,138 @@
# ADR-031: Forwarding Policy
## Status
Accepted
## Context
Currently, any authenticated client can open a `direct-tcpip` SSH channel to
any destination. The only gate is authentication — once authenticated, a client
has unrestricted network access through the tunnel. This is a security gap: a
compromised key grants unrestricted access.
Operators need the ability to:
- Restrict which hosts and ports authenticated clients can access
- Apply different rules to different principals (key fingerprints, accounts)
- Restrict WebTransport clients to alknet control channels only
- Set a default policy (allow-all for migration compatibility, deny-all for
production)
## Decision
**Add `ForwardingPolicy` as part of `DynamicConfig` (reloadable without
restart).**
### Type Definitions
```rust
pub struct ForwardingPolicy {
pub default: ForwardingAction,
pub rules: Vec<ForwardingRule>,
}
pub struct ForwardingRule {
pub target: TargetPattern,
pub action: ForwardingAction,
pub principals: Vec<String>, // Empty = matches all
pub transports: Vec<TransportKind>, // Empty = matches all
}
pub enum ForwardingAction {
Allow,
Deny,
}
pub enum TargetPattern {
Any,
Host(String), // "localhost", "*.example.com"
Cidr(IpNetwork), // "10.0.0.0/8"
PortRange(String, Range<u16>), // "localhost", ports 8080-8090
AlknetPrefix, // Matches alknet-* control channels
}
```
### Rule Evaluation
Rules are evaluated in order. First match wins. If no rule matches, the default
applies. This supports both allowlist and blocklist semantics:
- **Allowlist**: `default: Deny`, then explicit Allow rules for permitted
destinations.
- **Blocklist**: `default: Allow`, then explicit Deny rules for blocked
destinations.
### Principals
Each rule can specify which principals it applies to. A principal is an
`Identity.id` (fingerprint or UUID) or a scope from `Identity.scopes`. When the
rule's `principals` field is empty, it matches all identities.
This connects to the `IdentityProvider` trait (ADR-029): when a client
authenticates, the `Identity` is resolved, and the forwarding policy checks
rules against `Identity.id` and `Identity.scopes`.
### TransportKind-Aware Rules
Each rule can specify which `TransportKind` it applies to. This enables
transport-specific restrictions — for example, WebTransport clients can be
restricted to `alknet-*` control channels only:
```rust
ForwardingRule {
target: TargetPattern::AlknetPrefix,
action: ForwardingAction::Allow,
principals: vec![],
transports: vec![TransportKind::WebTransport { host: "*".into() }],
}
```
### Where the Policy Check Happens
The forwarding policy check occurs in `channel_open_direct_tcpip` before the
proxy task is spawned. The current behavior (no check) is equivalent to
`ForwardingPolicy::allow_all()` — default Allow, no rules. This preserves
backward compatibility during migration.
### DynamicConfig Integration
`ForwardingPolicy` is part of `DynamicConfig` and reloadable via
`ConfigReloadHandle::reload()` or NAPI's `reloadForwarding()`. Changes take
effect on the next channel open — existing connections continue with their
current policy.
### OQ Resolutions
- **OQ-12** (Per-user forwarding scope vs global rules): Resolved. Start with
global rules + principal matching from `Identity.scopes`. Per-user scope
from `peer_credentials.metadata.scopes` via `IdentityProvider`.
- **OQ-16** (Transport-specific forwarding): Resolved. Add `TransportKind`
match in `ForwardingRule`. WebTransport clients can be restricted.
- **OQ-18** (Source of Identity.scopes): Resolved by ADR-029 and this ADR.
`IdentityProvider` owns scopes. `ForwardingPolicy` consumes them.
## Consequences
- **Positive**: Operators can restrict access per identity, per destination, per
transport. A compromised key no longer grants unrestricted network access.
- **Positive**: Default-allow preserves current behavior during migration. Switch
to default-deny for production deployments.
- **Positive**: Policy is reloadable without restart. Adding a rule via
`reloadForwarding()` takes effect on the next channel open.
- **Positive**: `TransportKind`-aware rules enable transport-specific
restrictions (e.g., WebTransport clients restricted to alknet-* channels).
- **Negative**: Another check in the hot path (every `channel_open_direct_tcpip`
call). The cost is a linear scan of rules — acceptable for small rule sets.
Large rule sets should use compiled matchers (future optimization).
- **Negative**: `TargetPattern` string matching is lenient. Host patterns like
`*.example.com` require careful implementation to prevent bypasses. The
`glob` or `globset` crate can handle this correctly.
## References
- [research/configuration.md](../../research/configuration.md) — ForwardingPolicy section
- [auth.md](../auth.md) — Identity.scopes and IdentityProvider
- [open-questions.md](../open-questions.md) — OQ-12, OQ-16, OQ-18
- [ADR-029](029-identity-core-type.md) — Identity as core type
- [ADR-030](030-static-dynamic-config-split.md) — DynamicConfig (ForwardingPolicy is part of it)
- [integration-plan.md](../../research/integration-plan.md) — Phase 1.3

View File

@@ -0,0 +1,96 @@
# ADR-032: Event Boundary Discipline
## Status
Accepted
## Context
The research identified three distinct communication patterns in the system, and
conflating them is a known anti-pattern in event-driven architectures:
1. **Domain events** (Honker streams) — Internal to the service that owns that
data. Used for state reconstruction within the service's own boundaries.
Examples: `nodes:created`, `edges:deleted`, `accounts:updated`.
2. **irpc service calls** — Synchronous request-response within a node or
cluster. Internal to the system. Examples: `AuthProtocol::VerifyPubkey`,
`SecretProtocol::DeriveEd25519`, `ConfigProtocol::ReloadForwarding`.
3. **Call protocol events** (`EventEnvelope`) — Asynchronous integration events
that cross node boundaries. External to the system. Examples:
`call.requested`, `call.responded`, `call.completed`, `call.aborted`.
Without a hard constraint, it's tempting to have one service subscribe directly
to another service's Honker streams. This leads to:
- **Leaky event store**: Service A reads Service B's domain events directly,
coupling A to B's internal state representation. When B changes its schema, A
breaks.
- **Boomerang coupling**: An integration event is too thin, causing the
consumer to call back to the source service synchronously to get details. This
negates the benefit of async communication.
- **Fat notification trap**: A notification event carries full entity state,
when it should use state transfer instead.
## Decision
**Event boundary discipline is a hard architectural constraint, not a
suggestion.**
1. **Domain events stay within the owning service.** A Honker stream published
by the storage service (`nodes:created`) is for the storage service's own
state reconstruction. No other service reads these stream events directly.
2. **irpc service calls are synchronous and internal.** They never cross node
boundaries. They are request-response, not events. They should not be used
as a substitute for integration events.
3. **Call protocol events are the only events that cross node boundaries.**
`EventEnvelope` frames are the integration boundary. When a domain event
needs to be communicated to another node, it must be projected into a call
protocol event.
4. **Projection from domain events to integration events is required when
crossing boundaries.** A service that owns a Honker stream must project
relevant state changes into `EventEnvelope` frames before they leave the
node. The projection strips internal details and produces a versioned,
stable integration event.
This discipline applies at three levels:
```
Call Protocol (Layer 3, external, JSON)
└── irpc Service (Layer 3, internal, postcard)
└── Honker Streams (Domain events, within service boundary)
```
A call protocol handler MAY call an irpc service internally (e.g.,
`/head/auth/verify` calls `AuthProtocol::VerifyPubkey`). The irpc service MAY
use Honker streams for its own state management. But domain events never
propagate beyond the service boundary without projection.
## Consequences
- **Positive**: Prevents leaky event stores. Services are independently
deployable and their internal schemas can evolve without breaking consumers.
- **Positive**: Honker and irpc are implementation details, not cross-boundary
contracts. The call protocol's `EventEnvelope` is the only stable, versioned
contract that other nodes depend on.
- **Positive**: Clear ownership. Each service owns its Honker streams and can
change them freely. Integration events are a deliberate, reviewed contract.
- **Positive**: Makes testing easier. Services can be tested in isolation with
mock domain events. Integration events are tested against the `EventEnvelope`
schema.
- **Negative**: Projection code is required. Every domain event that needs to
cross a boundary must be explicitly projected. This is deliberate — the
overhead ensures the integration contract is intentional.
- **Negative**: Developers must resist the temptation to subscribe directly to
Honker streams across services. Code review should catch this pattern.
## References
- [research/services.md](../../research/services.md) — Event boundary discipline section
- [research/storage.md](../../research/storage.md) — Honker integration, event boundaries
- [research/integration-plan.md](../../research/integration-plan.md) — ADR 032 entry
- [event_source_types.md](/workspace/research/event_sourcing/event_source_types.md) — Event-driven architecture patterns

View File

@@ -0,0 +1,130 @@
# ADR-033: OperationEnv as Universal Composition Mechanism
## Status
Accepted
## Context
The `@alkdev/operations` TypeScript package defines `OperationEnv` as a
universal composition mechanism. A handler receives `context.env[namespace][op](input)`
and can invoke any registered operation regardless of whether it runs locally, in
an irpc service on the same cluster, or on a remote node via call protocol.
The research documents define three dispatch paths:
1. **Local dispatch** — direct function call through the operation registry
2. **Service dispatch** — irpc protocol call to a service backend
3. **Remote dispatch** — call protocol `EventEnvelope` to a remote node
Without a formal decision, irpc services could be seen as a replacement for
OperationEnv or for the call protocol. They are not — irpc is one dispatch
backend for OperationEnv, not a replacement for anything. The call protocol is
another dispatch backend. OperationEnv unifies them from the handler's
perspective.
The three communication patterns in the system (ADR-032) are:
- Domain events (Honker streams) — internal to the owning service
- irpc service calls — synchronous, in-cluster
- Call protocol events — asynchronous, cross-node
irpc services and call protocol operations serve different scopes but must
compose cleanly through OperationEnv.
## Decision
**OperationEnv is the universal composition mechanism that all operation
handlers receive. It provides namespace + operation name → invoke with input,
return output, regardless of dispatch path.**
### OperationEnv Behavioral Contract
```rust
// The behavioral contract: given a namespace and operation name, invoke the
// operation with the given input and return the output. The handler neither
// knows nor cares whether the dispatch is local, via irpc, or via call protocol.
pub trait OperationEnv: Send + Sync {
fn invoke(&self, namespace: &str, operation: &str, input: Value) -> ResponseEnvelope;
}
```
The Rust implementation may use typed method dispatch or a registry behind the
scenes, but the handler-facing API must preserve this contract.
### Three Dispatch Paths
OperationEnv resolves each call to one of three dispatch backends:
| Path | Mechanism | Serialization | Scope |
|------|-----------|---------------|-------|
| Local | Direct function call through registry | None (in-process) | Same process |
| Service | irpc protocol enum dispatch | postcard (binary) | Same cluster |
| Remote | Call protocol `EventEnvelope` | JSON | Cross-node |
All three produce the same `ResponseEnvelope`. The handler always calls
`context.env.invoke("secrets", "derive", input)` and gets a `ResponseEnvelope`
back.
### Service Assembly
The deployment topology determines which dispatch path each operation uses:
```rust
// Minimal deployment (single node, all local)
let env = OperationEnv::local(local_registry);
// Production deployment (mix of local and remote)
let env = OperationEnv::new()
.local("auth", auth_registry) // Auth runs locally
.local("config", config_registry) // Config runs locally
.service("secrets", secret_irpc_client) // Secret service via irpc
.remote("worker-1", call_protocol_conn) // Worker-1 operations via call protocol
```
### irpc Services Are One Dispatch Backend
irpc services (`AuthProtocol`, `SecretProtocol`, `ConfigProtocol`) define the
wire format for in-cluster communication. They are Rust-to-Rust, type-safe,
and efficient. But they are not a replacement for OperationEnv or for the call
protocol. They are one dispatch backend.
An irpc service can be exposed as a call protocol operation:
`/head/auth/verify` receives a call protocol event and internally calls
`AuthProtocol::VerifyPubkey` via irpc. The layers compose:
```
Call Protocol (Layer 3, external, JSON)
└── irpc Service (Layer 3, internal, postcard)
└── Honker Streams (Domain events, within service boundary)
```
### Adapters Map to OperationEnv
HTTP (`POST /v1/{namespace}/{op}`), MCP (`tools/call`), DNS
(`{op}.{namespace}.alk.dev TXT?`), and call protocol
(`/call.requested`) all resolve through OperationEnv. This is what makes
operations universally composable across all interfaces.
## Consequences
- **Positive**: Handlers compose through a single interface. Adding a new
dispatch path (e.g., a new irpc service) doesn't change handler code.
- **Positive**: irpc and call protocol coexist naturally. The handler doesn't
know which path was taken.
- **Positive**: Adapters (MCP, HTTP, DNS) map to operations through the same
OperationEnv interface. One handler, multiple dispatch paths.
- **Positive**: Deployment topology determines dispatch, not code. Same handler
works locally, in-cluster, or cross-node.
- **Negative**: OperationEnv is a new abstraction that must coexist with the
existing call protocol handler pattern. The registry currently maps paths to
handlers; OperationEnv adds namespace-aware composition on top.
- **Negative**: The `@alkdev/operations` TypeScript `HashMap<String,
HashMap<String, fn>>` model needs idiomatic Rust translation. The behavioral
contract must match, but the implementation can differ.
## References
- [research/services.md](../../research/services.md) — OperationContext, OperationEnv
- [research/integration-plan.md](../../research/integration-plan.md) — Phase 1.5, OperationEnv wiring
- [ADR-032](032-event-boundary-discipline.md) — Event boundary discipline
- [ADR-024](024-bidirectional-call-protocol.md) — Bidirectional call protocol
- [ADR-025](025-handler-spec-separation.md) — Handler/spec separation

View File

@@ -0,0 +1,55 @@
# ADR-034: Head/Worker Terminology
## Status
Accepted
## Context
The project previously used hub/spoke terminology for describing node
relationships: a hub node that coordinates connections and spokes that connect to
it. This terminology implies a strict star topology where the hub is
fundamentally different from spokes.
In practice, a coordinating node can also execute operations (run services,
forward traffic). Any node can become a coordinator. The architecture supports
mesh topologies where nodes coordinate in a peer-to-peer fashion.
The research documents (`core.md`, `services.md`) and updated architecture
specs (`call-protocol.md`, `auth.md`, `napi-and-pubsub.md`, `open-questions.md`)
already use head/worker consistently. Existing ADRs (024, 025) retain their
original hub/spoke language because ADRs are historical records.
## Decision
**Use head/worker terminology throughout the project.**
- **Head node**: A node that coordinates — accepts connections, routes
operations, manages cluster state. A head is also a worker (it can execute
operations).
- **Worker node**: A node that connects to a head, registers its services, and
executes operations. Any worker can become a head.
- **Node**: Any participant in the network. Every node has an Ed25519 identity.
The terms hub and spoke are deprecated in all new specs, code, and
documentation. Existing ADRs retain their original language as historical
records — ADRs document what was decided at the time, not what the current
terminology is.
## Consequences
- **Positive**: Natural mesh formation. A head that is also a worker enables
multi-hop routing, redundancy, and distributed topologies without a
centralized authority.
- **Positive**: Consistency with integration plan and research documents.
- **Positive**: The terminology better reflects the architecture — there is no
single "hub" that's fundamentally different from "spokes."
- **Neutral**: Existing ADRs (024, 025) retain hub/spoke in their text. This is
intentional — ADRs are historical records.
## References
- [research/integration-plan.md](../../research/integration-plan.md) — Phase 0 ADR 034 entry, inconsistencies section
- [ADR-024](024-bidirectional-call-protocol.md) — Uses hub/spoke historically
- [ADR-025](025-handler-spec-separation.md) — Uses hub/spoke historically
- [research/core.md](../../research/core.md) — Head/worker terminology