docs: add configuration architecture research

Explore static/dynamic config split, hot-reloadable auth via ArcSwap, forwarding policy, multi-transport listeners, and config file format. Documents three problems: no auth hot-reload, no forwarding access control, no structured config beyond CLI flags. Key findings: - Static config (transport, TLS, host key) loaded once at startup - Dynamic config (auth, forwarding, rate limits) reloadable via ArcSwap - ForwardingPolicy with rule-based allow/deny, first-match evaluation - Multi-transport: Server spawns Vec<ListenerConfig> sharing auth config - WebTransport out of scope for now (requires separate auth model) - Proposes ADR-020 (static/dynamic split), ADR-021 (forwarding policy), ADR-022 (multi-transport listeners) Adds OQ-12 through OQ-17 to open-questions.md.
2026-06-04 09:40:58 +00:00
parent 5ffcf9232b
commit 41062d810e
3 changed files with 622 additions and 3 deletions
--- a/docs/research/configuration.md
+++ b/docs/research/configuration.md
@@ -0,0 +1,568 @@
+---
+status: draft
+last_updated: 2026-06-04
+phase: exploration
+---
+
+# Configuration Architecture
+
+## Problem
+
+Wraith's configuration is loaded once at startup and never changes. This has
+three specific failures:
+
+1. **No hot reload of authentication credentials.** Adding or removing an
+   authorized key requires restarting the server process. In a hub/spoke
+   deployment where keys are managed via a database (see
+   `@alkdev/storage`'s `peer_credentials` table), the wraith process must be
+   restarted every time a key is added, revoked, or rotated. This is
+   operationally unacceptable for a production service.
+
+2. **No port forwarding access control.** Any authenticated client can open a
+   `direct-tcpip` channel to any destination. There is no policy governing
+   which hosts, ports, or `wraith-*` control channels a client may access. This
+   is a security gap — a compromised key grants unrestricted network access
+   through the tunnel.
+
+3. **No structured configuration beyond CLI flags.** ADR-011 chose
+   programmatic-first configuration for the alpha. This was correct — it
+   avoided cross-platform path issues and kept the API surface small. But as
+   wraith moves toward publishable releases, operators need config files for
+   reproducible deployments, and the NAPI layer needs programmatic reload
+   capability that the current `ServeOptions` builder pattern doesn't support.
+
+### What's Not The Problem
+
+- This does not propose depending on Honker, SQLite, or any specific data
+  source at the `wraith-core` level. The core provides a reload mechanism;
+  data sources plug in from outside.
+- This does not propose file-watching (potential attack vector, unnecessary
+  complexity). CLI usage loads config once at startup. Programmatic usage
+  (NAPI, hub) calls reload explicitly.
+- This does not replace the existing `ServeOptions` builder pattern. It
+  generalizes it.
+
+## Analysis
+
+### Static vs Dynamic Configuration
+
+Not all configuration should be reloadable. Transport-level settings (listen
+address, TLS certificates, host key) require socket/TLS renegotiation to change
+at runtime — effectively a restart. Auth and forwarding policy can change
+atomically without disrupting existing connections.
+
+| Category | Examples | Reloadable? |
+|---|---|---|
+| Transport | listen addr, TLS cert/key, iroh relay, stealth mode | No — requires bind change |
+| Identity | host key, host key algorithm | No — requires SSH re-negotiation |
+| Auth | authorized keys, cert authorities | **Yes** — next auth check picks up changes |
+| Forwarding | allowed destinations, per-principal rules | **Yes** — next channel open picks up changes |
+| Rate limits | max connections per IP, max auth attempts | **Yes** — next check picks up changes |
+
+The split is clean: anything that affects the SSH handshake or socket binding
+is static. Anything that's checked per-connection or per-channel is dynamic.
+
+### Current Architecture
+
+```
+ServeOptions (builder) → Server::new()
+  ├─ Arc<server::Config>          (russh config, immutable)
+  ├─ Arc<ServerAuthConfig>        (keys + CAs, immutable after load)
+  ├─ Arc<ConnectionRateLimiter>   (mutable but not reloadable)
+  └─ ServerHandler::new(auth_config, ...)
+
+ServerHandler
+  ├─ auth_config: Arc<ServerAuthConfig>  ← shared, immutable
+  ├─ connection_limiter: Arc<ConnectionRateLimiter>
+  ├─ outbound_proxy: Option<ProxyConfig>
+  └─ (no forwarding policy field)
+```
+
+`auth_publickey()` reads from `self.auth_config` via `Arc` dereference. No
+path to update it.
+
+### Proposed Architecture
+
+Replace `Arc<ServerAuthConfig>` with a reloadable provider:
+
+```
+StaticConfig (Arc, loaded once)
+  ├─ transport mode, listen addr, TLS config, iroh config
+  ├─ stealth, proxy
+  ├─ host key
+  └─ max_auth_attempts, max_connections_per_ip
+
+DynamicConfig (Arc<ArcSwap<DynamicConfig>>, reloadable)
+  ├─ auth: ServerAuthConfig
+  ├─ forwarding: ForwardingPolicy
+  └─ rate_limits: RateLimitConfig
+
+ConfigReloadHandle (exposed to NAPI)
+  └─ reload(DynamicConfig)
+```
+
+`ArcSwap` provides lock-free reads on the hot path. Every `auth_publickey()`
+and `channel_open_direct_tcpip()` call does an `Arc` dereference — zero cost
+compared to the current approach. Writes are atomic: `store()` swaps the
+pointer. Existing connections finish with their current config, new connections
+get the new config.
+
+### Forwarding Policy
+
+Currently, `channel_open_direct_tcpip` in `handler.rs` spawns a proxy task for
+any destination. The only gate is authentication. A forwarding policy adds a
+check before the proxy spawn:
+
+```rust
+pub struct ForwardingPolicy {
+    default: ForwardingAction,
+    rules: Vec<ForwardingRule>,
+}
+
+pub struct ForwardingRule {
+    target: TargetPattern,
+    action: ForwardingAction,
+    principals: Vec<String>,
+}
+
+pub enum ForwardingAction { Allow, Deny }
+pub enum TargetPattern {
+    Any,
+    Host(String),
+    Cidr(IpNetwork),
+    PortRange(String, Range<u16>),
+    WraithPrefix,
+}
+```
+
+Rule evaluation: first match wins, default applies if no rule matches. This
+model maps to OpenSSH's `AllowTcpForwarding` + `PermitOpen` but is more
+expressive. It also maps to `peer_credentials.metadata.scopes` in `@alkdev/storage`
+— the hub can generate forwarding rules from stored scopes.
+
+Rule ordering matters. A deny-then-allow pattern gives blocklist semantics. An
+allow-then-deny pattern gives allowlist semantics. Both are useful. The
+default determines the fallback.
+
+### Configuration File Format
+
+ADR-011 chose "programmatic-first, no config file." This was correct for alpha.
+For publishable releases, a config file enables:
+
+- Reproducible deployments (version-controlled config)
+- Less verbose CLI invocations
+- Separate files for static and dynamic config (only static needs to be in the
+  config file; dynamic comes from the reload mechanism)
+
+TOML is the idiomatic Rust choice. The config file covers static config only —
+the same fields as `ServeOptions`. Dynamic config (auth, forwarding) comes from
+the reload mechanism, not from the file. This preserves ADR-011's intent: the
+core doesn't know about the data source for auth keys, it just provides a place
+to put them.
+
+```toml
+[server]
+transport = "tls"
+listen = "0.0.0.0:443"
+stealth = false
+max_connections_per_ip = 5
+max_auth_attempts = 3
+
+[server.tls]
+cert = "/etc/wraith/tls/cert.pem"
+key = "/etc/wraith/tls/key.pem"
+
+[server.iroh]
+relay = "https://relay.alk.dev"
+
+[auth]
+host_key = "/etc/wraith/ssh/host_key"
+
+[forwarding]
+default = "deny"
+
+[[forwarding.rules]]
+target = "localhost:*"
+action = "allow"
+
+[[forwarding.rules]]
+target = "wraith-*"
+action = "allow"
+
+[[forwarding.rules]]
+target = "*:22"
+action = "deny"
+```
+
+The `[[forwarding.rules]]` array syntax is TOML's array-of-tables pattern.
+Rules are evaluated in order; first match wins.
+
+### NAPI Reload API
+
+The NAPI layer exposes the reload handle:
+
+```typescript
+interface WraithServer {
+  reloadAuth(auth: { authorizedKeys?: Buffer, certAuthority?: Buffer }): void;
+  reloadForwarding(policy: ForwardingPolicyConfig): void;
+  reloadAll(config: DynamicConfig): void;
+}
+
+interface ForwardingPolicyConfig {
+  default: 'allow' | 'deny';
+  rules: ForwardingRuleConfig[];
+}
+
+interface ForwardingRuleConfig {
+  target: string;      // "localhost:*", "10.0.0.0/8:80", "wraith-*"
+  action: 'allow' | 'deny';
+  principals?: string[];  // default ["*"]
+}
+```
+
+The hub calls `server.reloadAuth(...)` after writing to `peer_credentials`.
+The NAPI layer parses the key data and constructs a new `DynamicConfig`, then
+calls the `ConfigReloadHandle`.
+
+### Client Configuration
+
+Client configuration is almost entirely static (which server to connect to,
+which key to use). The only potential dynamic config is key rotation, which is
+less urgent because clients don't serve. For now, client configuration stays
+as `ConnectOptions` — no `ArcSwap` needed.
+
+A config file for client connections could define named profiles:
+
+```toml
+[profiles.production]
+server = "hub.alk.dev:443"
+transport = "tls"
+identity = "/home/user/.ssh/id_ed25519"
+
+[profiles.staging]
+server = "staging.alk.dev:22"
+transport = "tcp"
+identity = "/home/user/.ssh/staging_key"
+```
+
+This is a convenience layer on top of `ConnectOptions`, not a replacement.
+
+### CLI vs Programmatic Behavior
+
+| Interface | Static config | Dynamic config | Reload mechanism |
+|---|---|---|---|
+| CLI | Flags + optional `--config` file | Loaded at startup from `--authorized-keys` | None (restart to change) |
+| Core Rust | `StaticConfig` struct | `ArcSwap<DynamicConfig>` | `ConfigReloadHandle::reload()` |
+| NAPI | `serve()` options | Same `ArcSwap` | `server.reloadAuth()`, `server.reloadForwarding()` |
+
+The CLI doesn't need a reload mechanism. When you're running wraith from the
+command line, restarting is fine. The reload mechanism exists for programmatic
+consumers that manage credentials in a database.
+
+### Multi-Transport Listeners
+
+A host may want to accept connections on multiple transports simultaneously:
+
+- TCP on port 22 (simple, direct SSH)
+- TLS on port 443 (stealth mode, corporate firewalls)
+- iroh QUIC (P2P, no port forwarding needed)
+- WebTransport on port 443 (browser clients, shares the HTTP/3 listener)
+
+Currently `ServeTransportMode` is a single enum and `Server::run()` takes one
+acceptor. To serve multiple transports, the architecture needs to change.
+
+**Option A: `Server` manages multiple listeners internally.**
+
+```rust
+pub struct Server {
+    // Shared state (one copy, shared across all listeners)
+    config: Arc<server::Config>,
+    dynamic_config: Arc<ArcSwap<DynamicConfig>>,
+    connection_limiter: Arc<ConnectionRateLimiter>,
+    outbound_proxy: Option<ProxyConfig>,
+    sessions: Arc<tokio::sync::Mutex<Vec<ActiveSession>>>,
+    shutdown_tx: tokio::sync::watch::Sender<bool>,
+    shutdown_rx: tokio::sync::watch::Receiver<bool>,
+
+    // Per-listener state
+    listeners: Vec<ListenerConfig>,
+}
+
+pub struct ListenerConfig {
+    transport: ServeTransportMode,
+    listen_addr: SocketAddr,
+    stealth: bool,
+    // Transport-specific config (TLS cert, iroh relay, etc.)
+    tls: Option<TlsConfig>,
+    iroh: Option<IrohConfig>,
+}
+```
+
+`Server::run()` spawns one accept loop per `ListenerConfig`. Each loop
+constructs its own acceptor and `ServerHandler` (with the appropriate
+`TransportKind` tag), but shares the auth config, connection limiter, and
+session list. Shutdown signal goes to all loops.
+
+**Option B: Caller manages multiple `Server` instances.**
+
+The caller creates N `Server` objects, each with its own transport. They share
+`Arc<ArcSwap<DynamicConfig>>` and `Arc<ConnectionRateLimiter>` explicitly.
+
+Option A is better because: shared shutdown, shared session tracking, single
+point for config reload. Option B puts coordination burden on the caller and
+makes graceful shutdown harder (N independent shutdown channels).
+
+**The TLS + WebTransport coexistence question.** Both TLS and WebTransport
+use port 443. WebTransport is HTTP/3 (QUIC), TLS on port 443 is typically
+TCP+TLS. They can share the port because they're different protocols — QUIC
+is UDP, TLS-over-TCP is TCP. The kernel routes by protocol. But if both are
+on 443, the stealth mode protocol detector needs to handle HTTP/3 as well:
+
+```
+Port 443:
+  TCP connection → TLS handshake → SSH (existing)
+  UDP "connection" → QUIC handshake → WebTransport → stream proxy
+```
+
+This is similar to how iroh-live-relay works: HTTP/3 listener accepts
+WebTransport sessions, each session opens bidirectional streams that map to
+internal services.
+
+**Config file for multi-transport:**
+
+```toml
+[[listeners]]
+transport = "tls"
+listen = "0.0.0.0:443"
+stealth = true
+
+[listeners.tls]
+cert = "/etc/wraith/tls/cert.pem"
+key = "/etc/wraith/tls/key.pem"
+
+[[listeners]]
+transport = "tcp"
+listen = "0.0.0.0:22"
+
+[[listeners]]
+transport = "iroh"
+iroh_relay = "https://relay.alk.dev"
+
+[[listeners]]
+transport = "webtransport"
+listen = "0.0.0.0:443"
+# WebTransport shares port 443 with TLS because QUIC is UDP, TLS is TCP
+
+[listeners.webtransport]
+cert = "/etc/wraith/tls/cert.pem"
+key = "/etc/wraith/tls/key.pem"
+```
+
+The `[[listeners]]` array-of-tables pattern means each listener is an
+independent config block. The `[auth]`, `[forwarding]`, and `[server]`
+sections at the top level are shared — they apply to all listeners.
+
+**NAPI multi-transport:**
+
+```typescript
+const server = await serve({
+  listeners: [
+    { transport: 'tls', listen: '0.0.0.0:443', stealth: true, tlsCert: '...', tlsKey: '...' },
+    { transport: 'tcp', listen: '0.0.0.0:22' },
+    { transport: 'iroh', irohRelay: 'https://relay.alk.dev' },
+  ],
+  hostKey: hostKeyBuffer,
+  authorizedKeys: keysBuffer,
+});
+```
+
+Single `WraithServer` object, single `reloadAuth()` call affects all
+listeners.
+
+### Transport Kind and WebTransport
+
+The `TransportKind` enum (currently `Tcp | Tls | Iroh`) tags each connection
+so the handler can behave differently per transport. Adding `WebTransport` to
+this enum is straightforward — WebTransport connections are identifiable at
+accept time. The handler behavior is the same (port forwarding only), but
+the tag enables transport-specific logging and future policy differences
+(e.g., WebTransport clients can only access `wraith-*` control channels).
+
+## Proposed Solution
+
+### Phase 1: Static/Dynamic Split
+
+1. Introduce `StaticConfig` and `DynamicConfig` structs
+2. Replace `Arc<ServerAuthConfig>` in `ServerHandler` with
+   `Arc<ArcSwap<DynamicConfig>>`
+3. Add `ConfigReloadHandle` with `reload(DynamicConfig)` method
+4. Expose `reloadAuth()` on the NAPI `WraithServer` object
+
+**Scope**: `wraith-core` auth module + `wraith-napi` serve module
+
+**Risk**: Low — internal refactor, no protocol changes
+
+### Phase 2: Forwarding Policy
+
+1. Add `ForwardingPolicy` to `DynamicConfig`
+2. Add policy check to `channel_open_direct_tcpip` before proxy spawn
+3. Expose `reloadForwarding()` on NAPI `WraithServer`
+
+**Scope**: `wraith-core` handler + `wraith-napi`
+
+**Risk**: Low — new check, default-allow preserves current behavior
+
+### Phase 3: Config File
+
+1. Add `--config <path>` CLI flag parsing TOML
+2. CLI flags override config file values (same precedence as cargo)
+3. Config file only covers static config + initial auth config path
+4. Add `serde` derive to `StaticConfig`
+
+**Scope**: `wraith-cli` (new binary crate) + `wraith-core` config module
+
+**Risk**: Medium — new dependency (`toml` crate), new CLI surface to validate
+
+### Phase 4: Client Profiles
+
+1. Add `[profiles]` section to client config file
+2. `--profile production` loads named profile
+3. CLI flags override profile values
+
+**Scope**: `wraith-cli`
+
+**Risk**: Low — convenience layer only
+
+### Phase 5: Multi-Transport Listeners
+
+1. Change `ServeTransportMode` from single enum to `Vec<ListenerConfig>`
+2. `Server::run()` spawns one accept loop per listener, sharing `DynamicConfig`
+3. Single shutdown signal drains all listeners
+4. Add `[[listeners]]` to config file format
+5. NAPI `serve()` accepts `listeners` array instead of single `transport`
+6. Add `WebTransport` to `TransportKind` enum (initially as a tag only;
+   actual WebTransport acceptor is a separate R&D phase)
+
+**Scope**: `wraith-core` serve.rs + `wraith-napi` + `wraith-cli`
+
+**Risk**: Medium — changes the primary API surface of `serve()`. Backwards
+compat via accepting both `transport: string` (single) and
+`listeners: array` (multi) in NAPI.
+
+## Open Questions
+
+- **OQ-CFG-01**: Should forwarding rules support per-user scope derived from
+  the authenticated key's metadata (e.g., `peer_credentials.metadata.scopes`)?
+  Or is a global rules table with principal matching sufficient?
+
+  Global rules with principal matching is simpler and covers most cases. Per-user
+  scope derived from certificates is more granular but requires the server to
+  maintain a mapping from key fingerprint to scope. This mapping comes from the
+  hub's database, not from the SSH protocol. Phase 2 starts with global rules;
+  per-user scope can be added as an extension.
+
+- **OQ-CFG-02**: Should the config file watch for changes and auto-reload?
+
+  No. File watching is a potential attack vector (symlink races, inotify
+  limitations on network filesystems). The CLI loads once at startup. The NAPI
+  layer reloads explicitly. This is the right model for a security-sensitive
+  tool.
+
+- **OQ-CFG-03**: Should `ArcSwap` be the reload primitive, or is `RwLock`
+  sufficient?
+
+  `ArcSwap` is the standard pattern for this in Rust network services
+  (`arc-swap` crate). It provides lock-free reads (the hot path) and atomic
+  writes. `RwLock` would also work but adds lock contention on reads. The
+  `arc-swap` dependency is small (~500 lines) and well-maintained. Prefer it.
+
+- **OQ-CFG-04**: Should TLS and WebTransport on the same port share a single
+  QUIC listener (like iroh Router's ALPN dispatch), or run as separate
+  listeners on the same port?
+
+  They can't conflict because QUIC is UDP and TLS-over-TCP is TCP — the
+  kernel routes by protocol, not by port number. They're naturally separate
+  listeners even on the same port. However, if iroh is also running on the
+  same host, the iroh endpoint already owns a QUIC listener. The WebTransport
+  listener needs its own. Options: (a) share the iroh endpoint's QUIC listener
+  with ALPN dispatch (reuses `from_endpoint` pattern), (b) separate QUIC
+  listeners on different ports, (c) bind both to 443/UDP — possible if
+  `SO_REUSEPORT` is used. Needs R&D; defer to WebTransport transport design
+  session.
+
+  **Update**: WebTransport is out of scope for the current configuration
+  work. It requires a fundamentally different authentication model (HTTP-level
+  API keys/session tokens vs SSH key-based auth). The `ServerHandler` only
+  knows SSH `auth_publickey`. WebTransport auth would need its own handler
+  path. This connects to the broader question of whether `DynamicConfig.auth`
+  should be transport-aware (see OQ-CFG-06). WebTransport transport design
+  is a separate R&D session.
+
+- **OQ-CFG-05**: Does `TransportKind::WebTransport` need any handler behavior
+  different from other transports?
+
+  Initially no — all transports get the same port-forwarding-only handler.
+  But WebTransport connections come from browsers, which have different trust
+  assumptions. A future forwarding policy might restrict WebTransport clients
+  to `wraith-*` control channels only (no arbitrary host:port forwarding).
+  This is a policy question, not a transport question. The `TransportKind` tag
+  on the handler enables transport-aware policy rules in `ForwardingPolicy`
+  without changing the handler. Defer to Phase 2 (forwarding policy design).
+
+- **OQ-CFG-06**: Should the auth layer be transport-aware?
+
+  Currently `DynamicConfig.auth` is `ServerAuthConfig` — SSH keys and CAs
+  only. This works for SSH over any transport (TCP, TLS, iroh) because SSH
+  carries its own auth protocol. But non-SSH transports (WebTransport,
+  WebSocket) use HTTP-level authentication (API keys, session tokens in
+  headers/query params). The auth question is: does the same `DynamicConfig`
+  serve both models, or does each transport carry its own auth config?
+
+  Option A: `AuthPolicy` contains both SSH auth and API key auth:
+  ```rust
+  pub struct AuthPolicy {
+      ssh: SshAuthConfig,           // for SSH-over-any-transport
+      api_keys: Option<ApiKeysConfig>,  // for non-SSH transports
+  }
+  ```
+
+  Option B: Auth is per-listener. Each `ListenerConfig` carries its own auth
+  config appropriate to its transport.
+
+  Option A is simpler for the initial implementation — the SSH auth path is
+  unchanged, and API key auth is additive. Option B is more flexible but
+  duplicates the shared auth state (keys should be reloadable once, not per
+  listener).
+
+  For now, the config architecture should accommodate Option A as a future
+  extension. Phase 1 implements `DynamicConfig` with SSH auth only. API key
+  auth is added when a non-SSH transport is implemented.
+
+## Decisions Required
+
+These decisions will be extracted into ADRs when the architecture is finalized:
+
+1. **ADR-020**: Static/dynamic config split, `ArcSwap<DynamicConfig>` for
+  hot-reloadable auth and forwarding policy. Supersedes ADR-011's "no config
+  file" — adds optional config file while preserving programmatic-first API.
+
+2. **ADR-021**: Forwarding policy with rule-based allow/deny. Default-allow
+  preserves current behavior during migration; default-deny for production
+  deployments.
+
+3. **ADR-022**: Multi-transport listeners. `Server` spawns multiple accept
+  loops sharing auth config, session state, and shutdown. Replaces single
+  `ServeTransportMode` with `Vec<ListenerConfig>`.
+
+## References
+
+- [ADR-011](../architecture/decisions/011-no-ssh-config-programmatic-api.md) — Programmatic-first API (superseded by ADR-020)
+- [ADR-012](../architecture/decisions/012-auth-ed25519-and-cert-authority.md) — Auth key format
+- [ADR-018](../architecture/decisions/018-control-channel-for-pubsub.md) — Control channel routing
+- `server/handler.rs` — Current `Arc<ServerAuthConfig>` usage
+- `server/serve.rs` — Current single-transport `Server::run()` accept loop
+- `auth/server_auth.rs` — `ServerAuthConfig` struct
+- `auth/keys.rs` — `KeySource` and key loading
+- `@alkdev/storage/docs/architecture/sqlite-host.md` — `peer_credentials` table schema
+- [wtransport](https://github.com/BiagioFesta/wtransport) — Rust WebTransport library (in `/workspace/wtransport`)
+- [arc-swap crate](https://docs.rs/arc-swap) — Lock-free read, atomic write for shared state