greenfield: clean slate for ALPN-as-service pivot

Delete old source crates (alknet-core, alknet, alknet-napi), old
architecture docs (ADRs, specs, open questions), old research docs
(phase2, event-sourcing, feasibility, etc.), old tasks, and obsolete
reference material (gitserver/MPL, honker, nats, rustfs, polyglot,
keystone, distributed-identity).

Keep: alknet-secret (standalone, compiles), pivot docs, iroh and ssh
references, rudolfs reference (MIT/Apache, fork candidate), ops docs,
sdd_process.md, and licenses.

Previous implementation preserved at /workspace/@alkdev/alknet-main/
for reference during porting.

Workspace compiles: cargo check + 14 tests pass for alknet-secret.
This commit is contained in:
2026-06-15 12:08:08 +00:00
parent d003a4f4ec
commit b5a4600d74
261 changed files with 138 additions and 53794 deletions

View File

@@ -1,26 +0,0 @@
# ADR-001: Pluggable Transport via AsyncRead+AsyncWrite Trait
## Status
Accepted
## Context
Alknet needs to support multiple transport modes (TCP, TLS, iroh) for SSH sessions. Each mode has different connection establishment logic but produces the same result: a bidirectional byte stream. Without an abstraction, each transport would need its own SSH connection code path.
russh's `client::connect_stream()` and `server::run_stream()` both accept `AsyncRead + AsyncWrite + Unpin + Send`, meaning SSH is already transport-agnostic at the API level. The design question is whether to enshrine this in alknet's own type system or handle each transport case-by-case.
## Decision
Define a `Transport` trait that produces `AsyncRead + AsyncWrite + Unpin + Send` streams. Each transport (TCP, TLS, iroh) implements this trait. The SSH layer calls `transport.connect()` and passes the result to `russh::client::connect_stream()`.
On the server side, define a `TransportAcceptor` trait that produces incoming streams. Each acceptor (TCP listener, TLS listener, iroh endpoint) implements this trait. The server calls `acceptor.accept()` and passes the result to `russh::server::run_stream()`.
This makes adding a new transport (e.g., WebSocket, QUIC directly) a matter of implementing the trait, not modifying SSH code.
## Consequences
- **Positive**: Clean separation between transport and protocol. Adding transports is additive. SSH code is transport-agnostic.
- **Positive**: Testing is simplified — mock transports can produce in-memory streams.
- **Negative**: Slight indirection for the single-transport case (just TCP). The trait boilerplate is minimal though.
- **Negative**: The trait must be object-safe if we want dynamic dispatch. Using `impl Trait` in function signatures avoids this but limits runtime transport selection. CLI-selected transport needs dynamic dispatch: `Box<dyn Transport<Stream = Box<dyn AsyncRead+AsyncWrite+Unpin+Send>>>`.
## References
- [transport.md](../transport.md)
- [Feasibility assessment §3](../../research/feasibility/ssh-tunnel-vpn-alternative-feasibility.md)

View File

@@ -1,30 +0,0 @@
# ADR-002: TUN Shim as Separate Process
## Status
Superseded by ADR-014
## Context
TUN interface creation requires root privileges or `CAP_NET_ADMIN` on Linux, Administrator on Windows, or platform-specific VPN APIs on macOS/iOS/Android. If the core alknet binary required these privileges, the attack surface of root-required code would include the entire SSH implementation, key handling, and transport negotiation.
The primary use cases (SOCKS5 proxy, port forwarding) need no privileges at all. Only the "route all traffic through TUN" use case needs root.
## Decision
The TUN functionality is a separate `alknet-tun` binary that:
1. Creates a TUN device (requires root / CAP_NET_ADMIN)
2. Reads IP packets from it
3. Forwards each connection to the core alknet's SOCKS5 port (127.0.0.1:1080)
4. Proxies bytes between TUN packets and SOCKS5 connections
The core `alknet connect` binary never needs root. The `alknet-tun` binary is ~200-500 lines and does nothing except TUN ↔ SOCKS5 forwarding.
## Consequences
- **Positive**: Root-required code surface is tiny and auditable.
- **Positive**: Core binary runs unprivileged. SOCKS5 and port forwarding work without any special permissions.
- **Positive**: TUN process can crash without affecting the SSH session (it just reconnects to SOCKS5).
- **Positive**: Matches the proven tun2proxy architecture.
- **Negative**: Two processes to manage instead of one. Requires process supervision (systemd, etc.).
- **Negative**: SOCKS5 adds a small latency overhead vs. direct TUN → SSH packet routing. This is acceptable for the security benefit.
## References
- [tun-shim.md](../tun-shim.md)
- [tun2proxy](https://github.com/tun2proxy/tun2proxy) — proven architecture for TUN → SOCKS5 proxy

View File

@@ -1,31 +0,0 @@
# ADR-003: iroh Stream via tokio::io::join
## Status
Accepted
## Context
iroh's QUIC implementation provides separate `RecvStream` (implements `AsyncRead`) and `SendStream` (implements `AsyncWrite`) for each bidirectional channel opened via `open_bi()` / `accept_bi()`. russh's `connect_stream()` and `run_stream()` require a single type implementing both `AsyncRead` and `AsyncWrite`.
Options considered:
1. `tokio::io::join(recv, send)` — Combines the two halves into `Join<RecvStream, SendStream>` which implements both traits.
2. Custom `IrohStream` wrapper — A struct with `recv` and `send` fields that delegates `AsyncRead` to `recv` and `AsyncWrite` to `send`.
3. Using iroh's `Connection` directly — Opening a new `open_bi()` for each SSH channel instead of running SSH over a single stream.
## Decision
Use `tokio::io::join(recv_stream, send_stream)` (Option 1).
One line of code, correct trait implementations, no custom types needed. The `Join<A, B>` type implements `AsyncRead` using `A` and `AsyncWrite` using `B`, which maps directly to iroh's split stream model.
If profiling later shows overhead (unlikely — it's just method dispatch), we can switch to a custom wrapper. But YAGNI until demonstrated.
Option 3 was rejected because it would require modifying russh to understand iroh connections. The whole point of the transport trait is that SSH doesn't know about iroh.
## Consequences
- **Positive**: Minimal code. One line to bridge iroh and russh.
- **Positive**: No custom types to maintain.
- **Positive**: Correct `AsyncRead` + `AsyncWrite` behavior — `Poll::Pending` on one half doesn't affect the other.
- **Negative**: None identified. The `Join` type is a standard tokio combinator with well-tested semantics.
## References
- [transport.md](../transport.md)
- [Feasibility assessment §11](../../research/feasibility/ssh-tunnel-vpn-alternative-feasibility.md)

View File

@@ -1,28 +0,0 @@
# ADR-004: SSH Runs Over Transport, Not Alongside
## Status
Accepted
## Context
There are two ways to structure the relationship between SSH and the transport layer:
1. **SSH over transport**: The transport produces one duplex stream. The entire SSH session (handshake, key exchange, channel multiplexing) runs over that single stream via `connect_stream()` / `run_stream()`. SSH has no direct network access.
2. **Transport alongside SSH**: SSH manages its own TCP connections via `connect()` / `run()`. The transport layer is an additional feature that wraps outgoing connections. SSH knows about the network.
## Decision
SSH runs over the transport (Option 1). The SSH layer never opens its own sockets or knows what transport it's on.
This is directly enabled by russh's `connect_stream()` and `run_stream()` APIs, which accept any `AsyncRead+AsyncWrite+Unpin+Send`. SSH's entire interaction with the network goes through the single stream produced by the transport.
## Consequences
- **Positive**: Adding a new transport requires implementing the `Transport` trait, not modifying SSH code.
- **Positive**: Testing is straightforward — mock transports produce in-memory streams.
- **Positive**: Security audit is clean — the SSH implementation has no network-facing code.
- **Positive**: The transport can be layered. Iroh connecting through a SOCKS5 proxy (which itself tunnels through alknet) is just a transport that calls out to a SOCKS5 library before establishing the QUIC connection.
- **Negative**: SSH keepalive and reconnection must be handled at the transport level. If the transport stream dies, the SSH session dies. Reconnection means establishing a new transport + new SSH session. There's no "SSH reconnects over the same transport" — you get a new session.
- **Negative**: Multiple SSH sessions over the same iroh connection require the iroh `Endpoint` (not stream) to be shared between sessions. The transport trait produces one stream per `connect()` call. The iroh `Endpoint` must be created externally and shared. (The `IrohTransport` struct holds an `Arc<Endpoint>`.)
## References
- [transport.md](../transport.md)
- [Feasibility assessment §3.4](../../research/feasibility/ssh-tunnel-vpn-alternative-feasibility.md)

View File

@@ -1,39 +0,0 @@
# ADR-005: SOCKS5 as Primary Interface, TUN as Add-on
## Status
Accepted
## Context
A "VPN-like" tool needs to route traffic. There are three approaches:
1. **TUN only**: Create a TUN interface, route all OS traffic through it. Full VPN experience but requires root.
2. **SOCKS5 only**: Local SOCKS5 proxy. Applications configure proxy settings. No root needed but application support varies.
3. **SOCKS5 primary, TUN add-on**: SOCKS5 is the core interface. TUN forwards to SOCKS5.
## Decision
SOCKS5 is the primary interface. TUN is a separate process that forwards to SOCKS5 (Option 3).
SOCKS5 is the core because:
- It requires no privileges
- `curl --socks5-hostname` works everywhere
- Browsers, most CLI tools, and many applications support SOCKS5
- SOCKS5h prevents DNS leaks by resolving names server-side
- It's the interface that the NAPI wrapper and pubsub adapter build on
- TUN is only needed for "route all traffic" use cases, which are a subset of users
TUN forwards to SOCKS5 rather than directly to SSH because:
- The SOCKS5 code already handles TCP connection establishment and bidirectional proxying
- TUN's job is just IP packet → SOCKS5 connection, not IP packet → SSH channel
- The `alknet-tun` binary stays minimal (~200-500 lines)
- No root code in the core binary
## Consequences
- **Positive**: Core binary is root-free. TUN functionality is provided by the external `tun2proxy` tool (ADR-014).
- **Positive**: SOCKS5 is testable without TUN — just `curl` against it.
- **Positive**: The TUN approach is validated by tun2proxy, a well-tested existing tool. No custom TUN code to maintain.
- **Negative**: VPN-like behavior requires running `tun2proxy` alongside `alknet connect` — two processes instead of one integrated binary.
- **Negative**: SOCKS5 doesn't capture UDP (except DNS via SOCKS5h). TUN mode via tun2proxy handles this separately.
## References
- [client.md](../client.md)
- [tun-shim.md](../tun-shim.md)

View File

@@ -1,38 +0,0 @@
# ADR-006: No Logging of Tunnel Destinations
## Status
Accepted
## Context
An SSH tunnel server sees every destination that clients connect to — hostnames, IP addresses, port numbers. This is extremely sensitive information. Logging it creates:
- **Privacy risks**: Tunnel destinations reveal what services users access (internal databases, APIs, etc.)
- **Legal concerns**: Server operators may be pressured to produce logs showing what clients accessed
- **Data retention liability**: Stored destination logs are an attack surface (data breaches, subpoenas)
However, the server does need to log some information for operational purposes — particularly for fail2ban integration to detect and block abusive connections.
## Decision
The server does NOT log:
- `channel_open_direct_tcpip` destinations (host, port)
- DNS resolutions performed by the server on behalf of clients
- Bytes transferred through tunnel channels
- Connection duration or throughput
The server DOES log (ADR-013):
- Auth attempts (remote_addr, user, key_fingerprint, accept/reject)
- Connection opened (remote_addr, transport kind)
- Connection closed (remote_addr, duration)
This separation ensures fail2ban has enough data to detect abusive IPs while destination privacy is maintained.
## Consequences
- **Positive**: Tunnel destinations are never written to disk or any observable log. This is the same guarantee OpenSSH makes with `LogLevel VERBOSE` or below.
- **Positive**: Reduces legal and privacy exposure for server operators.
- **Positive**: fail2ban can still work — it needs source IPs and auth failures, not destinations.
- **Negative**: Server operators cannot audit what destinations clients are accessing. If an operator needs this for compliance, they must implement it outside alknet (e.g., network-level logging at the target host).
- **Negative**: Debugging connectivity issues is harder without destination logs. Mitigated by client-side logging (the client knows what it's connecting to).
## References
- [server.md](../server.md)
- [ADR-013](013-fail2ban-friendly-logging.md) — what the server does log

View File

@@ -1,26 +0,0 @@
# ADR-007: NAPI Exposes Single Duplex Stream
## Status
Accepted
## Context
The NAPI wrapper for alknet could expose different granularity levels:
1. **Full SSH API**: Expose channel multiplexing, `open_direct_tcpip`, `tcpip_forward`, session management. The TypeScript layer would manage channels.
2. **Single duplex stream**: The NAPI wrapper establishes one SSH channel and returns it as a Node.js `Duplex` stream. TypeScript multiplexing (if needed) happens at the pubsub layer.
## Decision
Option 2: NAPI exposes a single duplex stream.
The NAPI wrapper's job is to get a reliable, authenticated byte stream from A to B. It handles transport (TCP/TLS/iroh), SSH authentication, and channel setup, then hands the caller a single `Duplex` stream that just works.
If the TypeScript consumer needs multiplexing (e.g., multiple concurrent tool calls over operations), pubsub handles that at the `EventEnvelope` level. Multiple `call.requested` / `call.responded` events flow over the same stream, distinguished by their `id` fields. This is how the existing WebSocket adapter works.
## Consequences
- **Positive**: Minimal NAPI surface — one function, one return type. Small binary, small FFI boundary.
- **Positive**: The TypeScript side doesn't need to understand SSH at all. It gets a stream and sends/receives `EventEnvelope` JSON.
- **Positive**: No need to expose russh types in NAPI. The SSH complexity stays in Rust.
- **Negative**: If a consumer wants multiple isolated channels (e.g., one for events, one for file transfer), they'd need multiple `connect()` calls (multiple SSH sessions). This is acceptable for the expected use case (pubsub events over a single stream).
## References
- [napi-and-pubsub.md](../napi-and-pubsub.md)

View File

@@ -1,38 +0,0 @@
# ADR-008: ACME/Let's Encrypt Certificate Provisioning
## Status
Accepted
## Context
TLS transport mode requires certificates. Manual certificate management is error-prone — users need to obtain, install, and renew certificates. Our production setup uses certbot with Let's Encrypt (documented in [certbot.md](../../research/ops/certbot.md)), which automates this via the ACME protocol.
There are two ACME flows:
1. **Domain-based**: Standard flow with DNS-01 or HTTP-01 challenge. Certificate is tied to a domain name, auto-renews via certbot/systemd timer. Requires port 80 or DNS access for challenges.
2. **IP-based**: Short-lived certificates via TLS-ALPN-01 challenge on port 443. No domain needed, but cert is short-lived (days, not months). Simpler for quick setups but requires the ACME client to run continuously.
Both flows are important for alknet's usability. Without ACME, TLS mode requires manual cert setup — a significant barrier for users who want "SSH over port 443" for censorship resistance.
## Decision
Support both ACME certificate provisioning paths:
1. **Domain-based ACME** (`--acme-domain <domain>`): Standard certbot-style flow. Certificate is domain-bound, auto-renews. The server runs a challenge responder (HTTP-01 on port 80 or TLS-ALPN-01 on port 443) during certificate issuance/renewal.
2. **IP-based ACME**: Short-lived certs for servers without a domain. Uses TLS-ALPN-01 challenge on port 443. Lower burden but certs expire frequently.
3. **Manual certs** (`--tls-cert` / `--tls-key`): Always supported for users with existing certificates or specific PKI setups.
The implementation should use the `rustls-acme` crate (or similar pure-Rust ACME client) to avoid an external certbot dependency. This keeps alknet self-contained as a single binary.
## Consequences
- **Positive**: Users can run `alknet serve --transport tls --acme-domain example.com` and get working TLS with zero manual cert management.
- **Positive**: IP-based ACME covers the quick-setup case without requiring a domain.
- **Positive**: Consistent with our production infrastructure (certbot + Let's Encrypt is already our standard).
- **Negative**: ACME adds complexity to the server binary (challenge responder, cert store, renewal timer).
- **Negative**: IP-based short-lived certs require more frequent renewal handling.
- **Negative**: Binary size increases with ACME support (rustls-acme dependency). Consider making ACME a feature flag (`acme`).
## References
- [server.md](../server.md)
- [OQ-01](../open-questions.md) — resolved by this ADR
- [OQ-07](../open-questions.md) — resolved by this ADR
- Production certbot setup: [certbot.md](../../research/ops/certbot.md)

View File

@@ -1,28 +0,0 @@
# ADR-009: Default iroh Relay with Override
## Status
Accepted
## Context
iroh requires a relay server for NAT traversal and initial connection establishment. The n0 project provides free relay servers (`https://relay.iroh.network/`) that work out of the box. However, relying on a third-party service creates a dependency:
- n0's relay could change terms, rate-limit, or go down
- Production deployments may want self-hosted relays for reliability and privacy
- The relay URL is a configuration point that should be explicit
Conversely, requiring users to set up a relay server before they can use iroh transport is a significant friction point for testing and quick starts.
## Decision
Default to n0's relay servers. Allow override via `--iroh-relay <url>` CLI flag. Document self-hosted relay setup in project documentation.
This matches iroh's own defaults — n0's relay is the standard starting point. Users who need production reliability self-host.
## Consequences
- **Positive**: Zero-config iroh transport for testing and development. `alknet serve --transport iroh` just works.
- **Positive**: Self-hosting is a single flag override, not a complex setup requirement.
- **Negative**: Default depends on n0's infrastructure. If n0's relay is down, default iroh connections fail (but this is the same experience as every iroh user).
- **Negative**: Privacy-conscious users must remember to `--iroh-relay` to avoid n0. Mitigated by documentation.
## References
- [transport.md](../transport.md)
- [OQ-02](../open-questions.md) — resolved by this ADR

View File

@@ -1,33 +0,0 @@
# ADR-010: Transport Chaining in CLI
## Status
Accepted
## Context
Transport chaining allows combining iroh with an upstream proxy, e.g.:
```bash
alknet connect --transport iroh --proxy socks5://127.0.0.1:1080
```
This routes iroh's outbound TCP connections through a SOCKS5 proxy, which could itself be another alknet instance. This is important for:
- Nested tunnel topologies
- Environments where iroh needs to go through an existing proxy
- Composing transports in flexible ways
iroh's `Endpoint::builder` supports proxy configuration natively. The implementation is straightforward — pass the proxy URL to iroh's builder.
## Decision
Support `--transport iroh --proxy socks5://...` natively in the CLI. This works because iroh's endpoint builder accepts a proxy configuration, so the implementation is minimal: parse the proxy URL and pass it to the endpoint builder.
For other transport combinations (TCP+TLS is already implicit — TLS wraps TCP), the `--proxy` flag applies to outbound connections from the SSH client or iroh endpoint.
## Consequences
- **Positive**: Flexible transport composition without requiring separate manual configuration.
- **Positive**: Matches user expectation from the overview doc's transport chaining example.
- **Positive**: Implementation is minimal — iroh already supports proxy config.
- **Negative**: Slightly more CLI surface area (`--proxy` interaction with `--transport`).
## References
- [transport.md](../transport.md)
- [OQ-05](../open-questions.md) — resolved by this ADR

View File

@@ -1,38 +0,0 @@
# ADR-011: Programmatic-First API, No File-Based Config
## Status
Accepted
## Context
The client and server both need configuration (host addresses, keys, transport options, etc.). There are several approaches:
1. **Read `~/.ssh/config`**: Parse OpenSSH config for default host/key/port. Reduces CLI verbosity for frequent connections.
2. **Custom config file**: Alknet-specific config file (TOML/YAML) with host definitions.
3. **Programmatic API only**: Configuration comes from CLI flags or the library API. No file parsing. `~/.ssh/` path conventions are cross-platform trouble (`~` expansion, Windows paths, etc.).
4. **Hybrid**: `--config` flag pointing to a alknet-specific config file, but no OpenSSH config parsing.
## Decision
Option 3: Programmatic-first API. Configuration is provided via:
- **CLI**: explicit flags (`--server`, `--identity`, `--transport`, etc.)
- **Library API**: `alknet_core::client::ConnectOptions` and `alknet_core::server::ServeOptions` structs, constructable programmatically
- **Environment variables**: for a few convenience defaults (e.g., `ALKNET_SERVER`, `ALKNET_IDENTITY`)
No `~/.ssh/config` parsing, no alknet-specific config files. This approach:
- Avoids cross-platform path issues (`~` expansion, Windows `USERPROFILE`, etc.)
- Makes the library API clean and straightforward for programmatic consumers (NAPI wrapper, pubsub)
- Keeps the CLI simple and explicit — no hidden behavior from config files
- Matches the design principle that the library crate (`alknet-core`) is the primary interface
If users want config-file behavior in the future, it can be added as a separate layer that populates the options structs. But the core doesn't need to know about files.
## Consequences
- **Positive**: Clean library API — `ConnectOptions` and `ServeOptions` are plain Rust structs.
- **Positive**: No cross-platform path issues in the core library.
- **Positive**: Explicit CLI — no hidden settings from a config file the user forgot about.
- **Positive**: NAPI wrapper can construct options programmatically without file I/O.
- **Negative**: Users must type full connection flags each time. Mitigated by shell aliases or environment variables.
- **Negative**: No config file convenience. Users coming from `ssh config` may find this inconvenient.
## References
- [client.md](../client.md)
- [OQ-06](../open-questions.md) — resolved by this ADR

View File

@@ -1,42 +0,0 @@
# ADR-012: Ed25519 Keys + OpenSSH Certificate Authority, No Password Auth
## Status
Accepted
## Context
SSH authentication has several options:
- **Ed25519 public key**: The default, already specified. Each user has a keypair; the server has an `authorized_keys` file.
- **Password authentication**: Convenient for quick setups but less secure (susceptible to brute force, credential reuse).
- **OpenSSH certificate authority (cert-authority)**: A CA signs user certificates. The server trusts the CA instead of individual keys. Much easier for multi-user setups — add one CA line to `authorized_keys` instead of every user's public key. Also supports certificate expiry and restrictions.
The question is which auth methods to support and prioritize.
## Decision
**Primary: Ed25519 public key** (already specified, no change).
**Important: OpenSSH certificate authority**. Support `cert-authority` entries in `authorized_keys` files. When a user presents a certificate signed by a trusted CA, the server validates the certificate (signature, expiry, permissions) and accepts it. This is critical for multi-user deployments where managing individual keys is impractical.
**Not supported: Password authentication over SSH channels.** Password auth over an SSH tunnel (i.e., the SOCKS5 proxy requiring a password) is not in scope. Password auth over SSH itself is rejected because:
- It's less secure than key-based auth
- It's susceptible to brute force (fail2ban can mitigate, but keys eliminate the problem)
- It's not needed when cert-authority provides easy multi-user management
- If a local SOCKS5 proxy is desired with its own auth, that's a separate concern
The server's `authorized_keys` file format follows OpenSSH conventions:
- Regular keys: `ssh-ed25519 AAAA... user@host`
- CA trusts: `cert-authority ssh-ed25519 AAAA... CA name`
- Principals: `cert-authority,permit-port-forwarding ssh-ed25519 AAAA... CA name`
## Consequences
- **Positive**: Multi-user deployments are manageable — one CA entry instead of N key entries.
- **Positive**: Certificates can carry expiry dates and restrictions (permit-port-forwarding, no-pty, source-address).
- **Positive**: No password brute force risk. fail2ban still needed for connection-level abuse, but not for auth-level password guessing.
- **Positive**: `russh` supports OpenSSH certificate verification natively.
- **Negative**: Setting up a CA requires initial key management tooling (`ssh-keygen -s`).
- **Negative**: Users who want a quick "just let me in" experience need to generate keys first. Not a significant barrier for the target audience (self-hosting, ops).
## References
- [client.md](../client.md)
- [server.md](../server.md)
- [OQ-04](../open-questions.md) — resolved by this ADR

View File

@@ -1,39 +0,0 @@
# ADR-013: Fail2ban-Friendly Server Logging
## Status
Accepted
## Context
The server needs to handle abuse on public-facing deployments. Our production infrastructure uses fail2ban on Linux (documented in [fail2ban.md](../../research/ops/fail2ban.md)) with nftables and systemd journal backend. fail2ban needs structured, parseable logs to identify abusive IP addresses.
However, fail2ban is Linux-specific. On other platforms (macOS, Windows, BSD), users need a different approach to reject abusive connections. The server should provide enough logging for fail2ban on Linux and enough built-in protection for other platforms.
## Decision
The server logs connection and authentication events at `INFO` level with structured fields, and provides a configurable connection rate limiter as a built-in defense.
**Logging** (for fail2ban integration on Linux):
- Log auth attempts: `level=INFO, msg="auth attempt", remote_addr=<ip>, user=<user>, key_fingerprint=<sha256>, result=<accept|reject>`
- Log new connections: `level=INFO, msg="connection opened", remote_addr=<ip>, transport=<tcp|tls|iroh>`
- Log disconnections: `level=INFO, msg="connection closed", remote_addr=<ip>, duration=<secs>`
- Do NOT log: channel open targets, DNS resolutions, bytes transferred
This matches what fail2ban needs: source IP + failure indicator. Our existing fail2ban setup filters on similar fields for SSH and nginx.
**Built-in rate limiting** (for all platforms):
- `--max-connections-per-ip <n>` (default: 0 = unlimited) — reject new connections from an IP that already has N active connections
- `--max-auth-attempts <n>` (default: 10) — disconnect after N failed auth attempts from one connection
- Rate limiting happens at the SSH layer, before channels are opened
This ensures that even without fail2ban, the server rejects obviously abusive connections.
## Consequences
- **Positive**: fail2ban can parse alknet logs the same way it parses SSH and nginx logs on our production systems.
- **Positive**: Built-in rate limiting provides protection on platforms without fail2ban.
- **Positive**: No privacy-sensitive data in logs (no tunnel destinations).
- **Negative**: Slightly more code in the server for connection tracking per IP.
- **Negative**: Users with custom fail2ban filters need to write regex for alknet's log format (documented examples provided).
## References
- [server.md](../server.md)
- [OQ-08](../open-questions.md) — resolved by this ADR
- Production fail2ban setup: [fail2ban.md](../../research/ops/fail2ban.md)

View File

@@ -1,41 +0,0 @@
# ADR-014: Defer TUN Implementation, Recommend Local SOCKS5 + tun2proxy
## Status
Accepted
## Context
The original plan included a TUN shim (`alknet-tun`) as Phase 3 — a separate root-requiring process that creates a TUN device and forwards IP packets through alknet's SOCKS5 port. This would provide VPN-like "route all traffic" behavior.
However, TUN implementation has significant complexities:
- Platform differences (Linux TUN, macOS utun, Windows wintun.dll)
- TCP reconstruction in userspace (smoltcp or tun2proxy's ip-stack)
- Virtual DNS handling
- Root/CAP_NET_ADMIN requirements
- TUN is easy to get wrong and hard to debug
The core SOCKS5 interface already works for the vast majority of use cases. For users who truly need VPN-like "route all traffic" behavior, `tun2proxy` is an existing, well-tested tool that does exactly this: creates a TUN device and routes traffic through a SOCKS5 proxy.
## Decision
Defer TUN implementation entirely. Remove `alknet-tun` from the architecture. Instead:
1. **Core interface**: alknet's local SOCKS5 proxy (always available, no root required)
2. **VPN-like behavior**: Users who need it run `tun2proxy --proxy socks5://127.0.0.1:1080` alongside `alknet connect`
3. **Documentation**: Recommend tun2proxy in the README/wiki for "route all traffic" use cases
This removes TUN from the project scope while still providing a path to VPN-like behavior. If demand justifies it later, `alknet-tun` can be added as a thin wrapper around tun2proxy's pattern.
The `tun` feature flag and `alknet-tun` binary are removed from the architecture. The `tun-rs` dependency is removed.
## Consequences
- **Positive**: Significantly reduces project scope and complexity. No TUN code to write, test, or maintain across platforms.
- **Positive**: tun2proxy is already well-tested for this exact use case.
- **Positive**: Core binary remains unprivileged. No root code anywhere in the project.
- **Positive**: Cleaner architecture — alknet only does SSH tunneling + SOCKS5. tun2proxy does TUN.
- **Negative**: Users need two tools instead of one for VPN-like behavior. Mitigated by documentation.
- **Negative**: tun2proxy is an external dependency in practice, though it's widely available in package managers.
- **Negative**: No first-class Windows/macOS TUN story. tun2proxy handles these platforms but users need to install it separately.
## References
- [tun-shim.md](../tun-shim.md) — this spec is now deprecated
- [ADR-002](002-tun-separate-process.md) — superseded; TUN is no longer in scope
- [ADR-005](005-socks5-before-tun.md) — SOCKS5 is still the primary interface; TUN forwarding is now external

View File

@@ -1,27 +0,0 @@
# ADR-015: napi-rs for FFI Bridge
## Status
Accepted
## Context
The NAPI wrapper needs a Rust-to-Node.js bridge. Two main options:
1. **napi-rs**: The standard for Rust → Node.js native addons. Mature, well-documented, large ecosystem. Produces `.node` binaries for specific platforms. Good build tooling (`napi` CLI). Used by major projects (swc, rspack, biome).
2. **uniffi**: Mozilla's FFI bridge supporting multiple targets (Python, Swift, Kotlin, Node.js). Broader target reach but less mature for Node.js specifically. The Node.js binding is relatively new.
The primary consumer is TypeScript/Node.js — specifically the `@alkdev/pubsub` event target system. The broader alkdev ecosystem (pubsub, operations) is TypeScript-first. While future Python or mobile consumers are imaginable, they are not in scope.
## Decision
Use napi-rs. It's the standard for Node.js native addons, has the best documentation and tooling, and matches our primary consumer (TypeScript/Node.js). If future Python or mobile consumers are needed, uniffi can be added as a separate FFI layer — the Rust core library doesn't change, only the binding layer does.
## Consequences
- **Positive**: Best-in-class Node.js native addon support. Mature, well-documented, widely used.
- **Positive**: `napi` CLI handles building, cross-compilation, and npm package publishing.
- **Positive**: Async support via `napi-rs`'s `AsyncTask` and thread-safe functions.
- **Negative**: Only targets Node.js. Python/Swift/Kotlin require a separate FFI bridge (uniffi or similar).
- **Negative**: `.node` binaries are platform-specific. Need CI matrix for linux-x64, linux-arm64, macos-x64, macos-arm64, win32-x64.
## References
- [napi-and-pubsub.md](../napi-and-pubsub.md)
- [OQ-11](../open-questions.md) — resolved by this ADR

View File

@@ -1,40 +0,0 @@
# ADR-016: NAPI Exposes Both connect() and serve()
## Status
Accepted
## Context
The NAPI wrapper needs to provide TypeScript/Node.js consumers with access to alknet's functionality. The primary use case is `@alkdev/pubsub`'s event target system, which needs both directions:
1. **connect()**: Establish a client connection to a alknet server. Used by workers/spokes that need to tunnel events through a alknet server.
2. **serve()**: Start a alknet server from Node.js. Used by hubs that want to accept alknet connections and route events.
The previous decision (ADR-007) was to expose only `connect()` for MVP, deferring `serve()`. However, the pubsub integration requires both: a spoke needs `connect()` to reach a hub, and a hub could use `serve()` to accept connections without running a separate `alknet serve` process.
More importantly, both `connect()` and `serve()` are fundamental operations of the alknet library. Since the NAPI wrapper is a thin layer over `alknet-core`, exposing both is straightforward — they're just Rust functions behind `#[napi]` attributes.
## Decision
The NAPI wrapper exposes both `connect()` and `serve()` from the start:
```typescript
// @alkdev/alknet
function connect(options: AlknetConnectOptions): Promise<Duplex>;
function serve(options: AlknetServeOptions): Promise<AlknetServer>;
```
- `connect()` returns a `Duplex` stream (as per ADR-007)
- `serve()` returns a `AlknetServer` object with a `close()` method and events for new connections
The NAPI layer is transport-agnostic — it doesn't know about pubsub's `EventEnvelope`. The pubsub event target adapter wraps the `Duplex` stream to implement `TypedEventTarget`. This separation ensures the NAPI wrapper is reusable for any stream-based protocol, not just pubsub.
## Consequences
- **Positive**: Pubsub can use both directions without running a separate binary for the server side.
- **Positive**: The NAPI wrapper becomes a complete bridge — any Node.js process can be either a client or server.
- **Positive**: Implementation is still minimal — `serve()` is just `alknet_core::server::run()` behind `#[napi]`.
- **Negative**: Slightly larger API surface (two functions + `AlknetServer` type instead of just `connect()`).
- **Negative**: Server-side NAPI needs to handle multiple concurrent connections, which adds complexity to `AlknetServer`.
## References
- [napi-and-pubsub.md](../napi-and-pubsub.md)
- [ADR-007](007-napi-single-stream.md) — still valid; NAPI exposes single streams, but now from both sides
- [OQ-10](../open-questions.md) — resolved by this ADR

View File

@@ -1,30 +0,0 @@
# ADR-017: Stealth Mode — Protocol Multiplexing on Port 443
## Status
Accepted
## Context
When running a alknet server with TLS transport on port 443, the server should be indistinguishable from a regular HTTPS web server to port scanners and deep packet inspection (DPI) systems. This is important for censorship circumvention — if SSH traffic on port 443 is detectable, it can be blocked.
After the TLS handshake completes, the server sees a raw byte stream. SSH protocol identification starts with `SSH-2.0-`, while HTTP starts with HTTP method verbs (GET, POST, etc.). The server can inspect the first bytes to determine the protocol.
## Decision
When `--stealth` is enabled with TLS transport:
1. After completing the TLS handshake, peek at the first few bytes of the connection
2. If the connection starts with `SSH-2.0-`, proceed with SSH session via `server::run_stream()`
3. If the connection starts with anything else (HTTP, random data), respond with `HTTP/1.1 404 Not Found\r\nServer: nginx\r\n\r\n` and close the connection
This makes the server appear as an nginx web server returning 404 errors to all non-SSH connections. Scanners and DPI systems see a typical HTTPS site with no SSH exposure.
The fake response uses `Server: nginx` headers to match the most common web server profile.
## Consequences
- **Positive**: TLS+alknet servers on port 443 are indistinguishable from ordinary HTTPS sites to automated scanners.
- **Positive**: Simple implementation — just peek at the first bytes and branch.
- **Positive**: Consistent with censorship circumvention best practices.
- **Negative**: Legitimate HTTPS traffic to the same port gets a 404. If the same IP needs to serve real web content, use a reverse proxy (nginx/haproxy) in front that routes by SNI or path.
- **Negative**: The `--stealth` flag only applies to TLS transport. It has no effect on TCP or iroh transports.
## References
- [server.md](../server.md)

View File

@@ -1,38 +0,0 @@
# ADR-018: Control Channel for PubSub over SSH
## Status
Accepted
## Context
The NAPI wrapper and pubsub integration need a way to use alknet's SSH channel as a data plane for event routing. When a `alknet connect` client opens an SSH session to a server, the `direct_tcpip` channel type is used to reach specific TCP targets (host:port).
For the pubsub use case, the client needs a dedicated bidirectional stream to the server's event bus — not a TCP connection to a random host. There are several approaches:
1. **Special destination**: Use `direct_tcpip` with a reserved destination (e.g., `alknet-control:0`) that the server recognizes and routes internally instead of connecting to a TCP target.
2. **Port forwarding**: The server runs a pubsub hub on a specific port (e.g., 9736) and the client uses normal port forwarding (`-L 9736:hub:9736`).
3. **Custom channel type**: Define a new SSH channel type beyond `direct_tcpip` and `forwarded_tcpip`.
## Decision
Use approach 1: a reserved `direct_tcpip` destination string. When the server receives a `channel_open_direct_tcpip` request for `alknet-control:0`:
1. The `channel_open_direct_tcpip` handler detects the special target via string matching
2. Instead of connecting to a TCP target, it bridges the channel to the local pubsub event bus
3. `EventEnvelope` JSON flows bidirectionally over the SSH channel
The destination string `alknet-control` is reserved. Regular TCP targets are hostnames or IP addresses, so there is no collision risk.
Approach 2 (port forwarding to a specific port) is still supported as an alternative — the client can use `--forward 9736:localhost:9736` if the server runs a pubsub hub on that port. But the control channel approach is simpler and doesn't require a separate listening port.
Approach 3 (custom channel type) was rejected because russh's `direct_tcpip` handler is well-understood and adding custom channel types requires modifying russh.
## Consequences
- **Positive**: Simple implementation — just string matching in the server's `channel_open_direct_tcpip` handler.
- **Positive**: No separate port or service needs to run on the server. The control channel is built into alknet.
- **Positive**: Compatible with the NAPI wrapper's single-duplex-stream model.
- **Positive**: Port forwarding to a specific port is still available as an alternative.
- **Negative**: The string `alknet-control` is a magic constant. It should be defined as a constant in the crate.
- **Negative**: Regular TCP destinations accidentally matching `alknet-control` would be misrouted. Mitigated by reserving the entire `alknet-` prefix namespace.
## References
- [napi-and-pubsub.md](../napi-and-pubsub.md)
- [server.md](../server.md)

View File

@@ -1,42 +0,0 @@
# ADR-019: `--proxy` Has Different Semantics on Client vs Server
## Status
Accepted
## Context
The `--proxy` CLI flag appears on both `alknet connect` (client) and `alknet serve` (server), but the two sides proxy fundamentally different things:
- **Client**: `--proxy` routes the *transport connection* through the proxy. For example, `alknet connect --transport iroh --proxy socks5://127.0.0.1:1080` means the iroh endpoint's outbound TCP connections go through the specified SOCKS5 proxy before reaching the iroh relay. The proxy wraps the transport layer.
- **Server**: `--proxy` routes *outbound target connections* through the proxy. For example, `alknet serve --proxy socks5://127.0.0.1:9050` means when an SSH client opens a `direct_tcpip` channel to `db.internal:5432`, the server connects to that target through the specified proxy. The proxy wraps the data-plane connections.
Using the same flag name for both is intentional — from the user's perspective, both mean "route traffic through a proxy." But the layer at which the proxy operates differs, and this needs to be explicit so implementers don't confuse the two.
ADR-010 addressed transport chaining for the client side only. The server-side outbound proxy behavior has no ADR. This ADR documents both semantics and the rationale for sharing the flag name.
## Decision
The `--proxy` flag uses the same name on client and server, with documented different semantics:
| Side | Flag | What gets proxied | Example |
|------|------|-------------------|---------|
| Client | `--proxy` | Transport connection (outbound to server/relay) | `--transport iroh --proxy socks5://...` → iroh endpoint connects through proxy |
| Server | `--proxy` | Outbound target connections (data plane) | `--proxy socks5://...` → direct_tcpip targets reached through proxy |
On the **client**, `--proxy` affects the transport layer. It only applies to transports that make outbound TCP connections (iroh through a proxy, TLS through a proxy). For plain TCP transport, `--proxy` has no meaningful effect since the transport is already a direct TCP connection — use the SOCKS5 server instead.
On the **server**, `--proxy` affects the data plane. All `channel_open_direct_tcpip` outbound connections are routed through the proxy, regardless of transport mode.
This is not a naming collision — it's the same conceptual operation ("route through a proxy") at different layers. The shared name avoids forcing users to learn two proxy flags.
## Consequences
- **Positive**: One flag name (`--proxy`) instead of two. Users already understand "proxy" as "route through this."
- **Positive**: Client-side proxy is minimal implementation — iroh's endpoint builder accepts proxy config natively.
- **Positive**: Server-side proxy is straightforward — all outbound TCP from channel handlers goes through the proxy.
- **Negative**: Implementers must read the correct spec (client vs server) to understand what `--proxy` does for their side. This is mitigated by CLI help text that clearly describes the behavior per side.
- **Negative**: On the client, `--proxy` with `--transport tcp` is effectively a no-op (the transport is already a direct TCP connection to the server). The CLI should handle this case gracefully.
## References
- [ADR-010](010-transport-chaining-cli.md) — client-side transport chaining
- [transport.md](../transport.md) — transport layer spec
- [client.md](../client.md) — client CLI
- [server.md](../server.md) — server outbound proxy

View File

@@ -1,85 +0,0 @@
# ADR-023: Unified Authentication with Shared Key Material
## Status
Accepted
## Context
Alknet currently authenticates connections exclusively through SSH public key
auth in the SSH handshake. This works for SSH-over-any-transport (TCP, TLS,
iroh) because SSH carries its own auth protocol. But WebTransport and other
HTTP-level transports cannot perform SSH key exchange — browsers speak HTTP/3,
not SSH.
Without unification, non-SSH transports would need a completely separate
identity system (API keys, JWTs, session tokens). This creates two problems:
(1) operators manage two key sets with two rotation mechanisms, and (2) the
same person connecting via SSH and WebTransport appears as two different
identities.
The `IdentityProvider` trait is needed to decouple alknet-core from any
specific identity storage (config file vs. database). Without it, alknet-core
would either hardcode config-file-based auth or take a database dependency —
neither is acceptable for a library crate.
## Decision
**Unified authentication**: The same Ed25519 key material (`authorized_keys`
and `cert_authorities`) is shared across both SSH auth and token auth. The
presentation differs per transport, but the verification result (an
`Identity` with scopes) is the same.
**Token auth for non-SSH transports**: WebTransport clients present a signed
timestamp token in the CONNECT request URL:
```
AuthToken = base64url(key_id || timestamp || signature)
key_id = SHA-256 fingerprint of the Ed25519 public key (32 bytes)
timestamp = Unix seconds, big-endian u64 (8 bytes)
signature = Ed25519 sign(key_id || timestamp_bytes, private_key)
```
Server extracts the fingerprint, looks it up in the same `authorized_keys`
set, verifies the signature, and checks the timestamp window (default ±300s).
**`IdentityProvider` trait**: Decouples alknet-core from identity storage. The
trait resolves a fingerprint or token to an `Identity`. Default implementation
loads from `DynamicConfig.auth` (no database). Hub implementation can back it
with `@alkdev/storage`.
**`TokenKeySource::Shared`**: The token auth uses the same authorized keys set
as SSH auth by default. Deployments that want separate access control can use
`TokenKeySource::Separate` with a distinct key set.
**Replay protection via timestamps**: V1 uses timestamp-only (no server state).
Zero-replay can be added later via a nonce challenge-response without changing
the key material.
## Consequences
- **Positive**: One key set, one rotation, one `reloadAuth()` call. Adding a
key to `authorized_keys` immediately grants access via both SSH and
WebTransport.
- **Positive**: `IdentityProvider` trait makes alknet-core independent of any
specific database. Default: config file. Hub: `@alkdev/storage`.
- **Positive**: Browser clients can authenticate using Ed25519 keys via
SubtleCrypto (Chrome 105+, Firefox 130+, Safari 17+). Deno supports it
natively.
- **Positive**: No JWT library dependency. The token is a simple Ed25519
signature over a fixed structure — same primitives SSH already uses.
- **Negative**: V1 has a replay window (±300s). An attacker who intercepts a
QUIC packet can replay the token within the window. Acceptable because QUIC
interception is the same threat level as connection hijacking.
- **Negative**: Certificate authority tokens are not supported in v1. CA
verification requires the full OpenSSH certificate structure, which doesn't
fit in a signed timestamp.
- **Negative**: Browser-side key management is less ergonomic than SSH key
files. The private key must be imported into SubtleCrypto. This is a UI/UX
concern, not a protocol concern.
## References
- [auth.md](../auth.md) — Full auth architecture spec
- [ADR-012](012-auth-ed25519-and-cert-authority.md) — Ed25519 + cert-authority auth
- [OQ-17](../open-questions.md) — Transport-aware auth (resolved by this ADR)
- [configuration.md](../../research/configuration.md) — OQ-CFG-04, OQ-CFG-06 (resolved)

View File

@@ -1,63 +0,0 @@
# ADR-024: Bidirectional Call Protocol
## Status
Accepted
## Context
The alknet control channel (ADR-018) routes from client → server's event bus.
This is unidirectional: clients can send events to the server, but the server
cannot call operations on the client. In the hub/spoke model, spokes (dev env
containers) connect to a hub and expose operations (fs, bash, search) that the
hub invokes. The hub needs to call *spoke* operations.
Additionally, the current control channel provides no request/response semantics.
Every consumer that needs call/response reinvents the pending-request correlation.
## Decision
The call protocol is bidirectional. Both sides can send `call.requested` and
receive `call.responded`. The protocol uses `EventEnvelope` wire format (4-byte
BE length prefix + JSON) — the same as `@alkdev/pubsub`.
Five event types: `call.requested`, `call.responded`, `call.completed`,
`call.aborted`, `call.error`.
A call is a subscribe that resolves after one event. Both use `call.requested`
with correlated `requestId`. `PendingRequestMap` in core provides correlation.
Operation names use slash-based paths: `/{spoke}/{service}/{op}`. The first
path segment routes the call to the correct connected node. The hub's registry
maps spoke prefixes to connections. This mirrors iroh's ALPN dispatch: the
first segment is the routing key, remaining path dispatches within the node.
Core-provided operations use short paths without a spoke prefix
(`/services/list`, `/services/schema`). Spoke operations are prefixed
(`/dev1/fs/readFile`).
This generalizes ADR-018's control channel: the `alknet-*` destination becomes
a transport for `EventEnvelope` frames with call protocol semantics, instead of
raw pubsub dispatch.
## Consequences
- **Positive**: Hub can invoke operations on spokes. Dev env containers
expose fs, bash, search — the hub calls them as needed.
- **Positive**: Browser clients can expose custom UDFs. Any connected participant
can both call and serve operations.
- **Positive**: Built-in request/response correlation. One `PendingRequestMap`
in core serves all consumers.
- **Positive**: Slash-based paths align with URL routing, OpenAPI, MCP, and
iroh's ALPN dispatch. First segment = routing key.
- **Positive**: Multiple spokes exposing the same service (two dev envs both
exposing `/fs/*`) are naturally differentiated by the spoke prefix.
- **Negative**: The `PendingRequestMap` adds in-memory state. Entries must be
cleaned up on timeout or connection close.
- **Negative**: The hub must maintain a routing table mapping spoke identities
to connections, with registration on connect and cleanup on disconnect.
## References
- [call-protocol.md](../call-protocol.md) — Full call protocol spec
- [ADR-018](018-control-channel-for-pubsub.md) — Control channel (generalized)
- [napi-and-pubsub.md](../napi-and-pubsub.md) — NAPI wrapper and pubsub adapter

View File

@@ -1,73 +0,0 @@
# ADR-025: Handler/Spec Separation for Downstream Service Registration
## Status
Accepted
## Context
The current control channel (ADR-018) is hardcoded: `alknet-control:0` bridges
to the local pubsub event bus. If NAPI wants to expose `fs.readFile` or
`bash.exec` as callable operations, it has no way to register these with core's
channel routing. The NAPI handler would need to intercept channel data outside
of core.
For the hub/spoke model, spokes register their operations with the hub when
they connect. The hub's registry must include both hub-local operations and
remote operations exposed by spokes.
## Decision
Operation specs and handlers are separated from core. Core provides:
1. `OperationSpec` — describes what an operation does (name, type, input/output
schemas, access control)
2. `OperationHandler` — implements the operation logic
3. `OperationRegistry` — maps paths to specs + handlers
4. Built-in operations: `/services/list`, `/services/schema`
Downstream consumers register their own operations:
```rust
// NAPI layer registers dev env tools
registry.register(OperationSpec { name: "/fs/readFile", ... }, fs_read_handler);
registry.register(OperationSpec { name: "/bash/exec", ... }, bash_exec_handler);
// Browser client registers a custom UDF
registry.register(OperationSpec { name: "/notify/alert", ... }, notify_handler);
```
Operation names use slash-based paths: `/{spoke}/{service}/{op}`. The first
segment routes to the node. The `namespace` field on `OperationSpec` is
derived from the second path segment (`service`).
When spoke operations are registered with the hub, the hub adds the spoke
prefix: a spoke that registers `/fs/readFile` as "dev1" becomes addressable as
`/dev1/fs/readFile` in the hub's routing table.
The `/services/list` operation returns all registered specs. The
`/services/schema` operation returns the spec for a specific operation. These
are read-only — no admin operations.
## Consequences
- **Positive**: NAPI, Python, and any downstream consumer can register
operations without modifying core.
- **Positive**: Service discovery is built in. Clients query `/services/list`
to learn what operations a hub offers.
- **Positive**: Spoke prefix naturally differentiates multiple spokes exposing
the same service (dev1 vs dev2).
- **Positive**: `AccessControl` on each `OperationSpec` enables per-operation
authorization. Higher-risk operations (shell, filesystem write) can require
tighter scopes.
- **Positive**: Schema exposure enables MCP adapter generation. OperationSpec
maps directly to MCP tool definitions.
- **Negative**: The registry adds complexity. Core now owns `OperationSpec`,
`OperationRegistry`, and `PendingRequestMap`.
- **Negative**: Namespace collisions between downstream consumers are possible.
The spoke prefix mitigates this: `/dev1/fs/readFile` vs `/dev2/fs/readFile`.
## References
- [call-protocol.md](../call-protocol.md) — Full call protocol spec
- [ADR-018](018-control-channel-for-pubsub.md) — Control channel (generalized)
- `@alkdev/operations` — TypeScript `OperationSpec`, `CallHandler`, registry

View File

@@ -1,162 +0,0 @@
# ADR-026: Transport/Interface Separation (Three-Layer Model)
## Status
Accepted
## Context
In the current architecture, SSH is deeply embedded in the server handler. The
`ServerHandler` owns auth, channel management, and proxy logic — all mixed
together. This makes it impossible to run the call protocol over any transport
that doesn't speak SSH, such as:
- **DNS** — encoding call protocol frames as DNS TXT queries/responses for
censorship resistance
- **Raw framing** — 4-byte length prefix + JSON `EventEnvelope` without SSH
wrapping, for local service mesh or browser-to-head direct communication
- **WebTransport** — running call protocol over QUIC streams (browsers can't do
SSH key exchange)
The DNS control channel concept from research (`core.md`) currently conflates
"DNS as a transport that moves bytes" with "SSH sessions over those bytes." But
SSH is not a transport — it's a protocol layer that sits *on top of* a
transport. Separating them enables the DNS control channel to carry call
protocol events directly, without wrapping SSH inside DNS queries.
The same separation enables raw framing (no SSH overhead) for trusted local
networks, and WebTransport direct call protocol for browser clients.
## Decision
**Establish a three-layer model:**
### Layer 1: Transport
Produces byte streams. A `Transport` still produces
`AsyncRead + AsyncWrite + Unpin + Send`. This layer is unchanged from ADR-001.
```rust
#[async_trait]
pub trait Transport: Send + Sync + 'static {
type Stream: AsyncRead + AsyncWrite + Unpin + Send + 'static;
async fn connect(&self) -> Result<Self::Stream>;
fn describe(&self) -> String;
}
```
Transports: TCP, TLS, iroh, DNS (as byte carrier), WebTransport (future).
### Layer 2: Interface
Consumes a `Transport::Stream` and produces call protocol sessions. An
interface is what SSH currently does: wrap a byte stream in session semantics.
```rust
#[async_trait]
pub trait Interface: Send + Sync + 'static {
type Session;
async fn accept(stream: TransportStream, config: &InterfaceConfig) -> Result<Self::Session>;
}
```
Interfaces:
- **SSH interface** — wraps existing `ServerHandler` logic. SSH handshake, auth,
channel multiplexing. The call protocol runs over a reserved SSH channel
(`alknet-control:0`).
- **Raw framing interface** — 4-byte big-endian length prefix + JSON
`EventEnvelope`. No SSH overhead. Direct call protocol over the transport
stream.
- **DNS control channel** — a (DNS transport, raw framing interface) pair that
encodes/decodes `EventEnvelope` frames as DNS query/response pairs.
### Layer 3: Protocol
Carries semantics. Call protocol events, operation registry, service calls.
The protocol is agnostic to both the transport and the interface below it. It
receives `EventEnvelope` frames from whatever interface produced them.
### Connection Model
A **connection** is always a (Transport, Interface) pair. The valid combinations are enumerated:
| Transport | Interface | Use case |
|-----------|-----------|----------|
| TLS | SSH | Standard alknet tunnel |
| TCP | SSH | Plain SSH tunnel |
| iroh | SSH | P2P SSH tunnel |
| DNS | raw framing | DNS control channel |
| WebTransport | SSH | Browser SSH tunnel (future) |
| WebTransport | raw framing | Browser call protocol (future) |
| TCP | raw framing | Direct call protocol, local mesh |
**The DNS control channel carries call protocol frames directly — it does NOT
wrap SSH inside DNS.** This is explicit because the research originally
conflated "SSH tunneling over DNS" with "DNS as a transport for call protocol."
The (DNS, raw framing) pair sends `EventEnvelope` frames as DNS TXT
queries/responses — no SSH involved.
### `TransportKind` Enum
The `TransportKind` enum (currently `Tcp | Tls | Iroh`) gains `Dns` and
`WebTransport` variants. Initially these are tags only — no acceptor
implementation. The full DNS and WebTransport implementations are Phase 4 work
per the integration plan.
```rust
pub enum TransportKind {
Tcp,
Tls { server_name: Option<String> },
Iroh { endpoint_id: String },
Dns { domain: String },
WebTransport { host: String },
}
```
### ServerHandler Refactor
The existing `ServerHandler` is refactored into `SshInterface`. The interface
abstraction means the server's accept loop becomes:
```rust
// Pseudocode
let (transport, interface) = listener_config;
let stream = transport.accept().await?;
let session = interface.accept(stream, &config).await?;
// session produces call protocol events
```
The call protocol handler is interface-agnostic — it receives `EventEnvelope`
frames from any interface. Auth, forwarding policy, and operation routing happen
at Layer 3, not inside the SSH handler.
## Consequences
- **Positive**: Enables DNS control channel without SSH wrapping. The (DNS,
raw framing) pair is a clean (Transport, Interface) combination.
- **Positive**: Enables raw framing for local service mesh. No SSH overhead for
trusted networks.
- **Positive**: SSH becomes pluggable. The same call protocol handler works with
any interface.
- **Positive**: `ServerHandler` is refactored into `SshInterface` — a smaller,
more focused component that only handles SSH session management.
- **Positive**: Future WebTransport and WebSocket interfaces are additive — they
implement the `Interface` trait without touching SSH code.
- **Negative**: This is the most invasive code change in Phase 1
(integration-plan, Phase 1.8). SSH auth, channel management, and proxy logic
are currently tangled in `ServerHandler`. Extracting them requires careful
refactoring to maintain existing behavior.
- **Negative**: The `Interface` trait is new and untested. The design must
accommodate both SSH's channel multiplexing and raw framing's single-stream
model through the same abstraction.
## References
- [research/core.md](../../research/core.md) — Transport layer, DNS transport section
- [research/integration-plan.md](../../research/integration-plan.md) — Phase 1.8, three-layer model
- [transport.md](../transport.md) — Current Transport trait (unchanged at Layer 1)
- [server.md](../server.md) — Current ServerHandler (will become SshInterface)
- [ADR-001](001-pluggable-transport.md) — Transport trait produces stream (unchanged)
- [ADR-004](004-ssh-over-transport.md) — SSH runs over transport (reinforced by Layer 2)
- [ADR-024](024-bidirectional-call-protocol.md) — Bidirectional call protocol (Layer 3)

View File

@@ -1,164 +0,0 @@
# ADR-027: Crate Decomposition
## Status
Accepted
## Context
alknet-core currently contains everything: transport, SSH, auth, config, the
call protocol handler, and the server accept loop. As the project grows to
include SQLite-backed identity, HD key derivation, and metagraph storage, core
would need to depend on rusqlite, bip39, petgraph, and other heavy dependencies
— unacceptable for a library crate that CLI users embed.
Different deployment topologies need different subsets:
- A minimal CLI tunnel only needs core, transport, and auth types
- A head node needs SQLite-backed identity and the secret service
- A flowgraph visualization tool only needs petgraph operations
Circular dependencies must be avoided. alknet-storage implements
alknet-core's `IdentityProvider` trait, so alknet-core cannot depend on
alknet-storage. alknet-storage references alknet-secret's `EncryptedData` wire
format, but not as a crate dependency.
## Decision
**Decompose the project into six crates with a strict acyclic dependency graph.**
### Crate Structure
1. **alknet-core** — Transport, SSH, call protocol, config, auth types, identity,
`OperationSpec`, `Interface` trait. The foundational crate that everything
else depends on (by type, not by crate dep in some cases).
- *Depends on*: russh, tokio, irpc (feature-gated), serde, arc-swap
- *Does NOT depend on*: alknet-secret, alknet-storage, alknet-flowgraph
2. **alknet-secret** — BIP39 mnemonic generation, SLIP-0010 Ed25519 HD key
derivation, AES-256-GCM encryption, `SecretProtocol` irpc service.
- *Depends on*: bip39, ed25519-bip32 (or rust-bip32-ed25519), aes-gcm, sha2,
irpc
- *Does NOT depend on*: alknet-core, alknet-storage
3. **alknet-storage** — SQLite-backed metagraph, identity tables, ACL graph,
honker integration, `StorageProtocol` irpc service.
- *Depends on*: rusqlite (via honker), honker, petgraph, jsonschema, irpc
- *Does NOT depend on alknet-core* (but implements alknet-core's
`IdentityProvider` trait via the trait, not a crate dep)
- *Does NOT depend on alknet-secret* (but references `EncryptedData` type
format for wire compatibility)
4. **alknet-flowgraph**`FlowGraph<N,E>` over petgraph, operation graph, call
graph, type compatibility checking.
- *Depends on*: petgraph, serde, jsonschema, thiserror
- *Does NOT depend on*: alknet-core, alknet-storage, alknet-secret
5. **alknet-napi** — Node.js native addon. Exposes alknet-core to Node.js.
- *Depends on*: alknet-core
- *Does NOT depend on*: alknet-secret, alknet-storage, alknet-flowgraph
6. **alknet** (CLI binary) — Assembles everything.
- *Depends on*: alknet-core, alknet-secret (feature), alknet-storage (feature),
alknet-flowgraph (feature), toml
### Dependency Graph
```
alknet-secret alknet-storage alknet-flowgraph
(standalone) (standalone) (standalone)
│ │ │
│ (feature flags │ (trait impl │ (type compat
│ in CLI binary) │ via CLI wire) │ via JSON)
▼ ▼ ▼
┌─────────────────────┐
│ alknet-core │
│ (transport, SSH, │
│ call protocol, │
│ Identity, Config) │
└─────────┬───────────┘
┌────────────┼────────────┐
▼ ▼ ▼
alknet-napi alknet (CLI binary — assembles everything)
```
All four library crates (core, secret, storage, flowgraph) are independent of
each other. Dependencies flow **upward** only. The CLI binary sits at the top
and wires concrete implementations together. alknet-storage implements
alknet-core's `IdentityProvider` trait without a crate dependency — the CLI
binary provides the bridge.
### Narrow Interface Points
Three types serve as the narrow interface points between crates:
1. **`Identity`** — Defined in `alknet_core::auth`. Used by auth handler,
forwarding policy, and call protocol. alknet-storage implements
`IdentityProvider` to produce instances.
2. **`IdentityProvider`** — Trait defined in `alknet_core::auth`. Implemented by
`ConfigIdentityProvider` (in core) and `StorageIdentityProvider` (in
alknet-storage). The CLI/NAPI layer wires the concrete implementation.
3. **`OperationSpec`** — Defined in `alknet_core::call`. Used by the operation
registry and by alknet-flowgraph for type compatibility checking. The bridge
is serialization — flowgraph serializes to JSON, storage persists it.
### irpc Feature Flag
irpc is a feature flag in alknet-core. When disabled, auth and config go through
`IdentityProvider` and `ConfigReloadHandle` directly — no irpc overhead. Nodes
that only do SSH tunneling don't need the service layer.
In alknet-secret and alknet-storage, irpc is an independent dependency, not
feature-gated. These crates always define irpc service protocols because they
are used in production deployments where the service layer is active.
### alknet-storage's Relationship to alknet-core
alknet-storage does NOT depend on alknet-core as a crate. Instead:
- alknet-storage defines its own `IdentityProvider` impl that matches
alknet-core's trait signature. The trait is re-exported or defined locally
with `#[cfg(feature = "alknet-core")]` interop.
- In practice, the CLI binary crate depends on both and wires them together.
alknet-storage provides `StorageIdentityProvider`; alknet-core takes
`impl IdentityProvider`.
### alknet-storage's Relationship to alknet-secret
alknet-storage does NOT depend on alknet-secret as a crate. Instead:
- alknet-storage and alknet-secret share the `EncryptedData` wire format (key
version, salt, IV, ciphertext). This is a type-level compatibility, not a
crate dependency.
- alknet-secret encrypts; alknet-storage stores the encrypted blob in a
`SecretNode` in the metagraph. The bridge is serialization.
## Consequences
- **Positive**: Core is lean. No database, no crypto, no petgraph. CLI users
get a small binary.
- **Positive**: Services are pluggable. alknet-secret and alknet-storage can be
swapped for alternative implementations.
- **Positive**: No circular dependencies. The dependency graph is a DAG.
- **Positive**: Deployment topology determines which crates to include. A CLI
tunnel uses only alknet-core. A head node uses everything.
- **Positive**: irpc is feature-gated in core. Minimal deployments don't pay for
service layer overhead.
- **Negative**: `IdentityProvider` trait interop between alknet-core and
alknet-storage requires careful versioning. If the trait signature changes,
both crates must update.
- **Negative**: `EncryptedData` wire format compatibility between alknet-secret
and alknet-storage is implicit (not enforced by the type system). A shared
types crate could be extracted if needed, but adds another crate dependency.
## References
- [research/integration-plan.md](../../research/integration-plan.md) — Phase 2, dependency graph
- [research/core.md](../../research/core.md) — alknet-core contents
- [research/services.md](../../research/services.md) — Service protocols
- [research/storage.md](../../research/storage.md) — alknet-storage contents
- [research/flow.md](../../research/flow.md) — alknet-flowgraph contents
- [ADR-028](028-auth-irpc-service.md) — Auth as irpc service (service protocol enabled by decomposition)
- [ADR-029](029-identity-core-type.md) — Identity as core type (narrow interface point)

View File

@@ -1,147 +0,0 @@
# ADR-028: Auth as irpc Service
## Status
Accepted
## Context
For head nodes serving many users, in-memory key lookup via `ArcSwap<DynamicConfig>`
doesn't scale. Loading all authorized keys into RAM and atomic-swapping the
entire set on each reload works for small deployments but requires holding every
key in memory. For production deployments with hundreds or thousands of users,
auth verification should query a database on demand rather than holding all keys
in memory.
The current `ArcSwap<DynamicConfig>` approach works for CLI and single-node
setups. What's needed is an async boundary that allows auth verification to go
through a service — locally via channels for minimal deployments, or via irpc
for production deployments where auth runs on a separate process or node.
The critical design point: callers go through the `IdentityProvider` trait
(ADR-029). The irpc service is one way to satisfy the trait. Both paths produce
the same result — an `Identity` or rejection. The trait is the contract; the
service is an implementation path.
## Decision
**Auth verification is provided via an irpc service protocol, with
`IdentityProvider` as the interface contract and `ConfigIdentityProvider`
(ArcSwap-backed) as the default implementation.**
### IdentityProvider Trait (ADR-029) — The Contract
Callers depend on `IdentityProvider`, not on any concrete implementation:
```rust
pub trait IdentityProvider: Send + Sync + 'static {
fn resolve_from_fingerprint(&self, fingerprint: &str) -> Option<Identity>;
fn resolve_from_token(&self, token: &AuthToken) -> Option<Identity>;
}
```
### ConfigIdentityProvider — Default Implementation
Reads from `ArcSwap<DynamicConfig.auth>`. No database needed. Every authorized
key gets a default scope set. This is the default for CLI and single-node
deployments.
### AuthProtocol irpc Service — Behind Feature Flag
```rust
#[rpc_requests(message = AuthMessage)]
#[derive(Debug, Serialize, Deserialize)]
enum AuthProtocol {
#[rpc(tx=oneshot::Sender<AuthResult>)]
#[wrap(VerifyPubkey)]
VerifyPubkey { fingerprint: String, key_data: Vec<u8> },
#[rpc(tx=oneshot::Sender<AuthResult>)]
#[wrap(VerifyToken)]
VerifyToken { token_bytes: Vec<u8>, timestamp: u64 },
#[rpc(tx=oneshot::Sender<()>)]
#[wrap(ReloadKeys)]
ReloadKeys,
#[rpc(tx=oneshot::Sender<bool>)]
#[wrap(CheckAccess)]
CheckAccess { identity: Identity, operation: String },
}
enum AuthResult {
Ok(Identity),
Denied(String),
}
```
The `AuthProtocol` is behind the `irpc` feature flag in alknet-core. Nodes
that only do SSH tunneling don't need the service layer overhead. When the
feature is disabled, auth goes through `IdentityProvider` directly.
### AuthServiceImpl
Two implementations exist (the second is a future phase):
- **ConfigAuthService** — backed by `ConfigIdentityProvider` (ArcSwap path).
Wraps the trait in an irpc service for deployments that use the service layer
but don't have SQLite. This is the Phase 1 path: it ships with alknet-core.
- **StorageAuthService** — backed by SQLite `peer_credentials` and `api_keys`
tables (in alknet-storage, not yet built). Queries on demand. Can maintain an
LRU cache for hot fingerprints. This is a Phase 2+ implementation — the
contract is defined here so alknet-storage can implement it later.
Both produce the same `AuthResult` — an `Identity` or a denial. Callers don't
know or care which backend is running.
### Integration with IdentityProvider
The irpc service and the trait compose. A caller goes through `IdentityProvider`,
which may internally delegate to the irpc service, or may satisfy the request
locally via `ConfigIdentityProvider`. The deployment topology determines the
path:
- **Minimal (CLI, single-node)**: `ConfigIdentityProvider` reads from
`ArcSwap<DynamicConfig>`. No irpc overhead.
- **Production with local auth**: `AuthServiceImpl` wraps
`StorageIdentityProvider` locally. The handler calls `IdentityProvider` which
routes to the local irpc service.
- **Distributed auth**: Handler on a worker node calls `IdentityProvider` which
routes to a remote auth irpc service over QUIC.
### ConfigService Integration
`AuthProtocol::ReloadKeys` triggers reload of the dynamic config's auth section.
For the `ConfigIdentityProvider` path, this is equivalent to
`ConfigReloadHandle::reload()`. For the `StorageIdentityProvider` path, this
refreshes the LRU cache. Both update atomically — ongoing connections are
unaffected, new connections pick up changes.
## Consequences
- **Positive**: Minimal deployments use `ArcSwap` without irpc overhead. No
database dependency for CLI users.
- **Positive**: Production deployments wire `StorageIdentityProvider` behind the
irpc service. Auth scales to thousands of users without loading all keys into
memory.
- **Positive**: The `IdentityProvider` trait is the only contract callers depend
on. This keeps alknet-core lean and testable.
- **Positive**: Feature flag (`irpc`) keeps core lean for deployments that don't
need the service layer.
- **Positive**: Both paths produce identical `Identity` results. Behavioral
parity is enforced by the shared `Identity` type.
- **Negative**: Two implementations must be kept in sync. `ConfigIdentityProvider`
and `StorageIdentityProvider` must produce the same `Identity` for the same
input. Integration tests should verify this.
- **Negative**: The `irpc` feature flag adds conditional compilation complexity.
The core must compile and work without it, and the service layer must work
with it enabled.
## References
- [research/services.md](../../research/services.md) — AuthService, AuthProtocol definition
- [auth.md](../auth.md) — IdentityProvider trait, Identity struct
- [research/configuration.md](../../research/configuration.md) — Auth service approach
- [research/integration-plan.md](../../research/integration-plan.md) — Phase 1.4
- [ADR-029](029-identity-core-type.md) — Identity as core type
- [ADR-027](027-crate-decomposition.md) — Crate decomposition

View File

@@ -1,107 +0,0 @@
# ADR-029: Identity as Core Type
## Status
Accepted
## Context
The `Identity` struct and `IdentityProvider` trait are needed by auth,
forwarding policy, and call protocol — three different subsystems in
alknet-core. Without placing them in core, these subsystems would each define
their own identity type, leading to duplication and conversion boilerplate.
The constraint: alknet-core must not depend on alknet-storage or any database.
The `IdentityProvider` trait must be in core so that the handler can resolve
identities without knowing whether the backing store is a config file or a
SQLite database. External crates provide implementations.
Earlier research defined `Identity` inconsistently: `{node_id, fingerprint,
scopes}` in services.md and `{id, scopes, resources}` in auth.md. The unified
model uses `{id, scopes, resources}` where `id` serves as both fingerprint (for
key-based auth from config) and account UUID (for database-backed auth).
## Decision
**`Identity` struct and `IdentityProvider` trait live in `alknet_core::auth`.**
### Identity Struct
```rust
pub struct Identity {
pub id: String, // Fingerprint (config auth) or account UUID (database auth)
pub scopes: Vec<String>, // e.g., ["relay:connect", "service:gitea:read"]
pub resources: HashMap<String, Vec<String>>, // e.g., {"service": ["gitea", "registry"]}
}
```
The `id` field serves dual purpose: when using config-based authentication
(`ConfigIdentityProvider`), it holds the Ed25519 key fingerprint. When using
database-backed authentication (`StorageIdentityProvider`), it holds the account
UUID from the `accounts` table. This keeps the type simple while accommodating
both auth paths.
The `scopes` field provides authorization scope strings used by
`ForwardingPolicy` and `AccessControl` in `OperationSpec`. The `resources`
field provides resource-level authorization beyond what scopes offer (e.g., which
services this identity can access).
### IdentityProvider Trait
```rust
pub trait IdentityProvider: Send + Sync + 'static {
fn resolve_from_fingerprint(&self, fingerprint: &str) -> Option<Identity>;
fn resolve_from_token(&self, token: &AuthToken) -> Option<Identity>;
}
```
The trait is the contract. Callers (auth handler, forwarding policy, call
protocol) depend on `IdentityProvider` — not on any concrete implementation.
### Default and Production Implementations
- **`ConfigIdentityProvider`** (in alknet-core) — reads from
`ArcSwap<DynamicConfig.auth>`. Every authorized key gets a default scope set.
No database needed. This is the default for minimal deployments.
- **`StorageIdentityProvider`** (in alknet-storage) — backed by SQLite
`peer_credentials` and `api_keys` tables plus the ACL graph. Resolves
fingerprint → account → organization membership → effective scopes. This is
the production implementation for head nodes.
alknet-core never depends on alknet-storage. The trait relationship is:
alknet-core *defines* the trait, alknet-storage *implements* it. The CLI or
NAPI assembly layer wires the concrete implementation.
### Why Not in alknet-storage?
If `Identity` lived in alknet-storage, alknet-core would need to depend on
alknet-storage to use the type — creating a circular dependency (since
alknet-storage implements alknet-core's `IdentityProvider` trait). Placing the
type and trait in core breaks the cycle.
## Consequences
- **Positive**: alknet-core has no database dependency. Auth, forwarding, and
call protocol all use the same `Identity` type.
- **Positive**: alknet-storage implements the core trait. The CLI/NAPI layer
wires the concrete implementation. Deployment topology determines which impl
to use.
- **Positive**: The `id` field serves dual purpose (fingerprint or UUID),
avoiding separate types for config-based and database-based auth.
- **Positive**: `ForwardingPolicy` and `AccessControl` can reference scopes from
`Identity` without knowing where they came from.
- **Negative**: Two implementations of `IdentityProvider` exist — `Config` and
`Storage`. Both must produce identical `Identity` results for the same input.
Tests should verify behavioral parity.
- **Negative**: The trait abstraction adds a level of indirection for the
minimal (config-only) deployment path. The cost is negligible — the
`ConfigIdentityProvider` is a simple `ArcSwap` dereference.
## References
- [auth.md](../auth.md) — IdentityProvider trait, Identity struct, unified auth
- [research/services.md](../../research/services.md) — AuthService, Identity section
- [research/integration-plan.md](../../research/integration-plan.md) — Phase 1.2
- [ADR-023](023-unified-auth-shared-key-material.md) — Unified auth with shared key material
- [ADR-028](028-auth-irpc-service.md) — Auth as irpc service
- [OQ-18](../open-questions.md) — IdentityProvider owns scopes

View File

@@ -1,159 +0,0 @@
# ADR-030: Static/Dynamic Configuration Split
## Status
Accepted
## Context
Alknet's configuration is loaded once at startup and never changes. This causes
three specific failures:
1. **No hot reload of authentication credentials.** Adding or removing an
authorized key requires restarting the server process. In head/worker
deployments where keys are managed via a database, the process must be
restarted every time a key is added, revoked, or rotated. This is
operationally unacceptable.
2. **No port forwarding access control.** Any authenticated client can open a
`direct-tcpip` channel to any destination. There is no policy governing
which hosts, ports, or alknet control channels a client may access. A
compromised key grants unrestricted network access through the tunnel.
3. **No structured configuration beyond CLI flags.** ADR-011 chose
programmatic-first configuration for the alpha — correct at the time. But as
alknet moves toward publishable releases, operators need config files for
reproducible deployments, and the NAPI layer needs programmatic reload
capability that `ServeOptions` doesn't currently support.
Not all configuration should be reloadable. Transport-level settings (listen
address, TLS certificates, host key) require socket/TLS renegotiation to change
at runtime — effectively a restart. Auth and forwarding policy can change
atomically without disrupting existing connections.
## Decision
**Split configuration into `StaticConfig` and `DynamicConfig`.**
### StaticConfig
Immutable after startup. Constructed from `ServeOptions` (the builder pattern is
preserved). Contains everything that affects socket binding, TLS handshakes, or
SSH session negotiation:
- Transport mode, listen address
- TLS config (cert, key)
- iroh config (relay URL)
- Stealth mode flag
- Host key, host key algorithm
- Max auth attempts, max connections per IP
- Proxy config
Changing any of these requires a restart.
### DynamicConfig
Hot-reloadable at runtime via `ArcSwap<DynamicConfig>`. Contains everything
checked per-connection or per-channel:
- `AuthPolicy` — authorized keys, certificate authorities, token config
- `ForwardingPolicy` — allow/deny rules for channel targets (ADR-031)
- `RateLimitConfig` — rate limiting parameters
`ArcSwap` provides lock-free reads on the hot path (every `auth_publickey()` and
every `channel_open_direct_tcpip()` call does an `Arc` dereference — zero cost
compared to the current approach). Writes are atomic: `store()` swaps the
pointer. Existing connections finish with their current config; new connections
get the new config.
### ConfigReloadHandle
```rust
pub struct ConfigReloadHandle {
dynamic: Arc<ArcSwap<DynamicConfig>>,
}
impl ConfigReloadHandle {
pub fn reload(&self, new_config: DynamicConfig) { ... }
}
```
The handle is obtained from `Server::run()` and passed to NAPI or the CLI.
### ConfigService
The `ConfigService` wraps `ArcSwap<DynamicConfig>` reloads behind an irpc
protocol (behind the `irpc` feature flag) for production deployments that use
the service layer. For minimal deployments (CLI, single-node), direct
`ConfigReloadHandle::reload()` is sufficient.
### TOML Config File
An optional TOML config file covers static config plus initial auth/forwarding
paths. This **amends** ADR-011 (does not supersede it) — the programmatic-first
API remains primary. The config file is a convenience input format:
```toml
[server]
transport = "tls"
listen = "0.0.0.0:443"
stealth = false
max_connections_per_ip = 5
max_auth_attempts = 3
[server.tls]
cert = "/etc/alknet/tls/cert.pem"
key = "/etc/alknet/tls/key.pem"
[auth]
host_key = "/etc/alknet/ssh/host_key"
[forwarding]
default = "deny"
```
### NAPI Reload API
```typescript
interface AlknetServer {
reloadAuth(auth: { authorizedKeys?: Buffer, certAuthority?: Buffer }): void;
reloadForwarding(policy: ForwardingPolicyConfig): void;
reloadAll(config: DynamicConfig): void;
}
```
The NAPI layer parses key data and constructs a new `DynamicConfig`, then calls
`ConfigReloadHandle::reload()`.
### Client Configuration
Client configuration stays as `ConnectOptions` — no `ArcSwap` needed. Client
config is almost entirely static (which server to connect to, which key to use).
## Consequences
- **Positive**: Auth credentials and forwarding policy can be reloaded without
restarting the server. Adding a key via `reloadAuth()` takes effect on the
next connection attempt.
- **Positive**: ADR-011's programmatic-first intent is preserved. The TOML
config file is an optional convenience layer, not a replacement for
`ServeOptions`.
- **Positive**: `ArcSwap` provides zero-cost reads on the hot path. Every auth
check and every channel open is a single `Arc` dereference.
- **Positive**: The `ConfigService` irpc protocol (behind feature flag) allows
production deployments to integrate config reload into their service mesh
without taking a direct dependency on `DynamicConfig` internals.
- **Positive**: Forwarding policy is now part of `DynamicConfig` — operators can
restrict access per identity, per destination, per transport (ADR-031).
- **Negative**: Two config structs where there was one. The split is clean
(transport vs. policy) but adds surface area.
- **Negative**: Config file introduces `toml` as a dependency in the CLI crate.
This is acceptable for a CLI binary.
## References
- [research/configuration.md](../../research/configuration.md) — Full analysis
- [ADR-011](011-no-ssh-config-programmatic-api.md) — Programmatic-first API (amended, not superseded)
- [ADR-031](031-forwarding-policy.md) — Forwarding policy (part of DynamicConfig)
- [ADR-029](029-identity-core-type.md) — Identity as core type (DynamicConfig.auth uses IdentityProvider)
- [integration-plan.md](../../research/integration-plan.md) — Phase 1.1

View File

@@ -1,138 +0,0 @@
# ADR-031: Forwarding Policy
## Status
Accepted
## Context
Currently, any authenticated client can open a `direct-tcpip` SSH channel to
any destination. The only gate is authentication — once authenticated, a client
has unrestricted network access through the tunnel. This is a security gap: a
compromised key grants unrestricted access.
Operators need the ability to:
- Restrict which hosts and ports authenticated clients can access
- Apply different rules to different principals (key fingerprints, accounts)
- Restrict WebTransport clients to alknet control channels only
- Set a default policy (allow-all for migration compatibility, deny-all for
production)
## Decision
**Add `ForwardingPolicy` as part of `DynamicConfig` (reloadable without
restart).**
### Type Definitions
```rust
pub struct ForwardingPolicy {
pub default: ForwardingAction,
pub rules: Vec<ForwardingRule>,
}
pub struct ForwardingRule {
pub target: TargetPattern,
pub action: ForwardingAction,
pub principals: Vec<String>, // Empty = matches all
pub transports: Vec<TransportKind>, // Empty = matches all
}
pub enum ForwardingAction {
Allow,
Deny,
}
pub enum TargetPattern {
Any,
Host(String), // "localhost", "*.example.com"
Cidr(IpNetwork), // "10.0.0.0/8"
PortRange(String, Range<u16>), // "localhost", ports 8080-8090
AlknetPrefix, // Matches alknet-* control channels
}
```
### Rule Evaluation
Rules are evaluated in order. First match wins. If no rule matches, the default
applies. This supports both allowlist and blocklist semantics:
- **Allowlist**: `default: Deny`, then explicit Allow rules for permitted
destinations.
- **Blocklist**: `default: Allow`, then explicit Deny rules for blocked
destinations.
### Principals
Each rule can specify which principals it applies to. A principal is an
`Identity.id` (fingerprint or UUID) or a scope from `Identity.scopes`. When the
rule's `principals` field is empty, it matches all identities.
This connects to the `IdentityProvider` trait (ADR-029): when a client
authenticates, the `Identity` is resolved, and the forwarding policy checks
rules against `Identity.id` and `Identity.scopes`.
### TransportKind-Aware Rules
Each rule can specify which `TransportKind` it applies to. This enables
transport-specific restrictions — for example, WebTransport clients can be
restricted to `alknet-*` control channels only:
```rust
ForwardingRule {
target: TargetPattern::AlknetPrefix,
action: ForwardingAction::Allow,
principals: vec![],
transports: vec![TransportKind::WebTransport { host: "*".into() }],
}
```
### Where the Policy Check Happens
The forwarding policy check occurs in `channel_open_direct_tcpip` before the
proxy task is spawned. The current behavior (no check) is equivalent to
`ForwardingPolicy::allow_all()` — default Allow, no rules. This preserves
backward compatibility during migration.
### DynamicConfig Integration
`ForwardingPolicy` is part of `DynamicConfig` and reloadable via
`ConfigReloadHandle::reload()` or NAPI's `reloadForwarding()`. Changes take
effect on the next channel open — existing connections continue with their
current policy.
### OQ Resolutions
- **OQ-12** (Per-user forwarding scope vs global rules): Resolved. Start with
global rules + principal matching from `Identity.scopes`. Per-user scope
from `peer_credentials.metadata.scopes` via `IdentityProvider`.
- **OQ-16** (Transport-specific forwarding): Resolved. Add `TransportKind`
match in `ForwardingRule`. WebTransport clients can be restricted.
- **OQ-18** (Source of Identity.scopes): Resolved by ADR-029 and this ADR.
`IdentityProvider` owns scopes. `ForwardingPolicy` consumes them.
## Consequences
- **Positive**: Operators can restrict access per identity, per destination, per
transport. A compromised key no longer grants unrestricted network access.
- **Positive**: Default-allow preserves current behavior during migration. Switch
to default-deny for production deployments.
- **Positive**: Policy is reloadable without restart. Adding a rule via
`reloadForwarding()` takes effect on the next channel open.
- **Positive**: `TransportKind`-aware rules enable transport-specific
restrictions (e.g., WebTransport clients restricted to alknet-* channels).
- **Negative**: Another check in the hot path (every `channel_open_direct_tcpip`
call). The cost is a linear scan of rules — acceptable for small rule sets.
Large rule sets should use compiled matchers (future optimization).
- **Negative**: `TargetPattern` string matching is lenient. Host patterns like
`*.example.com` require careful implementation to prevent bypasses. The
`glob` or `globset` crate can handle this correctly.
## References
- [research/configuration.md](../../research/configuration.md) — ForwardingPolicy section
- [auth.md](../auth.md) — Identity.scopes and IdentityProvider
- [open-questions.md](../open-questions.md) — OQ-12, OQ-16, OQ-18
- [ADR-029](029-identity-core-type.md) — Identity as core type
- [ADR-030](030-static-dynamic-config-split.md) — DynamicConfig (ForwardingPolicy is part of it)
- [integration-plan.md](../../research/integration-plan.md) — Phase 1.3

View File

@@ -1,96 +0,0 @@
# ADR-032: Event Boundary Discipline
## Status
Accepted
## Context
The research identified three distinct communication patterns in the system, and
conflating them is a known anti-pattern in event-driven architectures:
1. **Domain events** (Honker streams) — Internal to the service that owns that
data. Used for state reconstruction within the service's own boundaries.
Examples: `nodes:created`, `edges:deleted`, `accounts:updated`.
2. **irpc service calls** — Synchronous request-response within a node or
cluster. Internal to the system. Examples: `AuthProtocol::VerifyPubkey`,
`SecretProtocol::DeriveEd25519`, `ConfigProtocol::ReloadForwarding`.
3. **Call protocol events** (`EventEnvelope`) — Asynchronous integration events
that cross node boundaries. External to the system. Examples:
`call.requested`, `call.responded`, `call.completed`, `call.aborted`.
Without a hard constraint, it's tempting to have one service subscribe directly
to another service's Honker streams. This leads to:
- **Leaky event store**: Service A reads Service B's domain events directly,
coupling A to B's internal state representation. When B changes its schema, A
breaks.
- **Boomerang coupling**: An integration event is too thin, causing the
consumer to call back to the source service synchronously to get details. This
negates the benefit of async communication.
- **Fat notification trap**: A notification event carries full entity state,
when it should use state transfer instead.
## Decision
**Event boundary discipline is a hard architectural constraint, not a
suggestion.**
1. **Domain events stay within the owning service.** A Honker stream published
by the storage service (`nodes:created`) is for the storage service's own
state reconstruction. No other service reads these stream events directly.
2. **irpc service calls are synchronous and internal.** They never cross node
boundaries. They are request-response, not events. They should not be used
as a substitute for integration events.
3. **Call protocol events are the only events that cross node boundaries.**
`EventEnvelope` frames are the integration boundary. When a domain event
needs to be communicated to another node, it must be projected into a call
protocol event.
4. **Projection from domain events to integration events is required when
crossing boundaries.** A service that owns a Honker stream must project
relevant state changes into `EventEnvelope` frames before they leave the
node. The projection strips internal details and produces a versioned,
stable integration event.
This discipline applies at three levels:
```
Call Protocol (Layer 3, external, JSON)
└── irpc Service (Layer 3, internal, postcard)
└── Honker Streams (Domain events, within service boundary)
```
A call protocol handler MAY call an irpc service internally (e.g.,
`/head/auth/verify` calls `AuthProtocol::VerifyPubkey`). The irpc service MAY
use Honker streams for its own state management. But domain events never
propagate beyond the service boundary without projection.
## Consequences
- **Positive**: Prevents leaky event stores. Services are independently
deployable and their internal schemas can evolve without breaking consumers.
- **Positive**: Honker and irpc are implementation details, not cross-boundary
contracts. The call protocol's `EventEnvelope` is the only stable, versioned
contract that other nodes depend on.
- **Positive**: Clear ownership. Each service owns its Honker streams and can
change them freely. Integration events are a deliberate, reviewed contract.
- **Positive**: Makes testing easier. Services can be tested in isolation with
mock domain events. Integration events are tested against the `EventEnvelope`
schema.
- **Negative**: Projection code is required. Every domain event that needs to
cross a boundary must be explicitly projected. This is deliberate — the
overhead ensures the integration contract is intentional.
- **Negative**: Developers must resist the temptation to subscribe directly to
Honker streams across services. Code review should catch this pattern.
## References
- [research/services.md](../../research/services.md) — Event boundary discipline section
- [research/storage.md](../../research/storage.md) — Honker integration, event boundaries
- [research/integration-plan.md](../../research/integration-plan.md) — ADR 032 entry
- [event_source_types.md](../../research/event-sourcing/event_source_types.md) — Event-driven architecture patterns

View File

@@ -1,132 +0,0 @@
# ADR-033: OperationEnv as Universal Composition Mechanism
## Status
Accepted
## Context
The `@alkdev/operations` TypeScript package defines `OperationEnv` as a
universal composition mechanism. A handler receives `context.env[namespace][op](input)`
and can invoke any registered operation regardless of whether it runs locally, in
an irpc service on the same cluster, or on a remote node via call protocol.
The research documents define three dispatch paths:
1. **Local dispatch** — direct function call through the operation registry
2. **Service dispatch** — irpc protocol call to a service backend
3. **Remote dispatch** — call protocol `EventEnvelope` to a remote node
Without a formal decision, irpc services could be seen as a replacement for
OperationEnv or for the call protocol. They are not — irpc is one dispatch
backend for OperationEnv, not a replacement for anything. The call protocol is
another dispatch backend. OperationEnv unifies them from the handler's
perspective.
The three communication patterns in the system (ADR-032) are:
- Domain events (Honker streams) — internal to the owning service
- irpc service calls — synchronous, in-cluster
- Call protocol events — asynchronous, cross-node
irpc services and call protocol operations serve different scopes but must
compose cleanly through OperationEnv.
## Decision
**OperationEnv is the universal composition mechanism that all operation
handlers receive. It provides namespace + operation name → invoke with input,
return output, regardless of dispatch path.**
### OperationEnv Behavioral Contract
```rust
// The behavioral contract: given a namespace and operation name, invoke the
// operation with the given input and return the output. The handler neither
// knows nor cares whether the dispatch is local, via irpc, or via call protocol.
pub trait OperationEnv: Send + Sync {
fn invoke(&self, namespace: &str, operation: &str, input: Value) -> ResponseEnvelope;
}
```
The Rust implementation may use typed method dispatch or a registry behind the
scenes, but the handler-facing API must preserve this contract.
### Three Dispatch Paths
OperationEnv resolves each call to one of three dispatch backends:
| Path | Mechanism | Serialization | Scope |
|------|-----------|---------------|-------|
| Local | Direct function call through registry | None (in-process) | Same process |
| Service | irpc protocol enum dispatch | postcard (binary) | Same cluster |
| Remote | Call protocol `EventEnvelope` | JSON | Cross-node |
All three produce the same `ResponseEnvelope`. The handler always calls
`context.env.invoke("secrets", "derive", input)` and gets a `ResponseEnvelope`
back.
### Service Assembly
The deployment topology determines which dispatch path each operation uses:
```rust
// Minimal deployment (single node, all local)
let env = OperationEnv::local(local_registry);
// Production deployment (mix of local and remote)
let env = OperationEnv::new()
.local("auth", auth_registry) // Auth runs locally
.local("config", config_registry) // Config runs locally
.service("secrets", secret_irpc_client) // Secret service via irpc
.remote("worker-1", call_protocol_conn) // Worker-1 operations via call protocol
```
### irpc Services Are One Dispatch Backend
irpc services (`AuthProtocol`, `SecretProtocol`, `ConfigProtocol`) define the
wire format for in-cluster communication. They are Rust-to-Rust, type-safe,
and efficient. But they are not a replacement for OperationEnv or for the call
protocol. They are one dispatch backend.
An irpc service can be exposed as a call protocol operation:
`/head/auth/verify` receives a call protocol event and internally calls
`AuthProtocol::VerifyPubkey` via irpc. The layers compose:
```
Call Protocol (Layer 3, external, JSON)
└── irpc Service (Layer 3, internal, postcard)
└── Honker Streams (Domain events, within service boundary)
```
### Adapters Map to OperationEnv
HTTP (`POST /v1/{namespace}/{op}`), MCP (`tools/call`), DNS
(`{op}.{namespace}.alk.dev TXT?`), and call protocol
(`/call.requested`) all resolve through OperationEnv. This is what makes
operations universally composable across all interfaces.
## Consequences
- **Positive**: Handlers compose through a single interface. Adding a new
dispatch path (e.g., a new irpc service) doesn't change handler code.
- **Positive**: irpc and call protocol coexist naturally. The handler doesn't
know which path was taken.
- **Positive**: Adapters (MCP, HTTP, DNS) map to operations through the same
OperationEnv interface. One handler, multiple dispatch paths.
- **Positive**: Deployment topology determines dispatch, not code. Same handler
works locally, in-cluster, or cross-node.
- **Negative**: OperationEnv is a new abstraction that must coexist with the
existing call protocol handler pattern. The registry currently maps paths to
handlers; OperationEnv adds namespace-aware composition on top.
- **Negative**: The `@alkdev/operations` TypeScript `HashMap<String,
HashMap<String, fn>>` model needs idiomatic Rust translation. The behavioral
contract must match, but the implementation can differ.
## References
- [research/services.md](../../research/services.md) — OperationContext, OperationEnv
- [research/integration-plan.md](../../research/integration-plan.md) — Phase 1.5, OperationEnv wiring
- [ADR-026](026-transport-interface-separation.md) — Three-layer model (OperationEnv is Layer 3)
- [ADR-028](028-auth-irpc-service.md) — Auth as irpc service (one dispatch backend)
- [ADR-032](032-event-boundary-discipline.md) — Event boundary discipline
- [ADR-024](024-bidirectional-call-protocol.md) — Bidirectional call protocol
- [ADR-025](025-handler-spec-separation.md) — Handler/spec separation

View File

@@ -1,55 +0,0 @@
# ADR-034: Head/Worker Terminology
## Status
Accepted
## Context
The project previously used hub/spoke terminology for describing node
relationships: a hub node that coordinates connections and spokes that connect to
it. This terminology implies a strict star topology where the hub is
fundamentally different from spokes.
In practice, a coordinating node can also execute operations (run services,
forward traffic). Any node can become a coordinator. The architecture supports
mesh topologies where nodes coordinate in a peer-to-peer fashion.
The research documents (`core.md`, `services.md`) and updated architecture
specs (`call-protocol.md`, `auth.md`, `napi-and-pubsub.md`, `open-questions.md`)
already use head/worker consistently. Existing ADRs (024, 025) retain their
original hub/spoke language because ADRs are historical records.
## Decision
**Use head/worker terminology throughout the project.**
- **Head node**: A node that coordinates — accepts connections, routes
operations, manages cluster state. A head is also a worker (it can execute
operations).
- **Worker node**: A node that connects to a head, registers its services, and
executes operations. Any worker can become a head.
- **Node**: Any participant in the network. Every node has an Ed25519 identity.
The terms hub and spoke are deprecated in all new specs, code, and
documentation. Existing ADRs retain their original language as historical
records — ADRs document what was decided at the time, not what the current
terminology is.
## Consequences
- **Positive**: Natural mesh formation. A head that is also a worker enables
multi-hop routing, redundancy, and distributed topologies without a
centralized authority.
- **Positive**: Consistency with integration plan and research documents.
- **Positive**: The terminology better reflects the architecture — there is no
single "hub" that's fundamentally different from "spokes."
- **Neutral**: Existing ADRs (024, 025) retain hub/spoke in their text. This is
intentional — ADRs are historical records.
## References
- [research/integration-plan.md](../../research/integration-plan.md) — Phase 0 ADR 034 entry, inconsistencies section
- [ADR-024](024-bidirectional-call-protocol.md) — Uses hub/spoke historically
- [ADR-025](025-handler-spec-separation.md) — Uses hub/spoke historically
- [research/core.md](../../research/core.md) — Head/worker terminology

View File

@@ -1,65 +0,0 @@
# ADR-035: StreamInterface and MessageInterface Split
## Status
Accepted
## Context
The `Interface` trait (ADR-026) assumes a persistent byte stream from a `Transport`. It produces a `Session` that yields `InterfaceEvent` frames. This works for SSH and raw framing — both run over duplex streams.
However, HTTP and DNS do not fit this model. They handle individual request/response pairs, not persistent sessions. HTTP runs over a TLS connection after byte-peek protocol detection (extending the existing stealth mode pattern). DNS runs its own server on port 53. Both are stateless per-request, not session-oriented.
The three-layer model (Transport, Interface, Protocol) remains correct. The issue is that Layer 2 has two distinct patterns: stream-based (SSH, raw framing) where the transport provides a continuous byte stream, and message-based (HTTP, DNS) where the interface manages its own transport and handles discrete requests.
## Decision
Split the `Interface` trait into two independent traits:
1. **`StreamInterface`** — consumes a `TransportStream`, produces a long-lived `Session` that yields `InterfaceEvent` frames. Existing `SshInterface` and `RawFramingInterface` become `StreamInterface` implementations.
2. **`MessageInterface`** — handles individual `InterfaceRequest``InterfaceResponse` pairs. Manages its own transport (HTTP server, DNS server). `HttpInterface` and `DnsInterface` are `MessageInterface` implementations.
The traits are independent. They have different signatures (`accept(stream)` vs `handle_request(req)`), different lifecycles (long-lived session vs stateless per-request), and different transport ownership (provided by caller vs self-managed).
`ListenerConfig` gains variants for both:
```rust
pub enum ListenerConfig {
Stream {
transport: TransportKind,
interface: StreamInterfaceKind,
},
Http {
bind_addr: SocketAddr,
tls: bool,
stealth: bool,
},
Dns {
bind_addr: SocketAddr,
tls: bool,
},
}
```
`TransportKind::Dns` is removed. DNS is a `MessageInterface` that manages its own transport (UDP/TCP port 53), not a transport variant.
The call protocol handler (Layer 3) is interface-agnostic: it processes `InterfaceEvent` frames from `StreamInterface` sessions and `InterfaceRequest``InterfaceResponse` from `MessageInterface` handlers. The dispatch logic is the same — only the framing differs.
## Consequences
**Positive**: HTTP and DNS are first-class interfaces with proper type signatures. No forcing stateless protocols into a session model. The existing stealth mode byte-peek pattern naturally extends to `HttpInterface`. The `InterfaceRequest` / `InterfaceResponse` types normalize calls across message-based interfaces.
**Positive**: Removing `TransportKind::Dns` prevents a breaking change later — code should never depend on DNS as a transport variant.
**Positive**: `ListenerConfig` correctly models the server's accept loop: stream listeners spawn one accept loop per (transport, interface) pair, while HTTP and DNS listeners each manage their own server.
**Negative**: Two traits where there was one. But they serve fundamentally different purposes. A common super-trait would add complexity (`accept_stream` + `handle_request` + `transport_kind`) without practical benefit — implementations satisfy one trait or the other, never both.
**Negative**: The `accept()` method on the current `Interface` trait needs to be renamed. This is a rename of an existing method signature, not a semantic change — `SshInterface` and `RawFramingInterface` implementations become `StreamInterface` implementations with the same `accept()` logic.
## References
- ADR-026 (transport/interface separation — updated by this ADR)
- [interface.md](../interface.md) — Interface layer spec
- [research/phase2/interface-model.md](../../research/phase2/interface-model.md) — Full analysis
- [research/phase2/tls-transport.md](../../research/phase2/tls-transport.md) — HTTP interface, ListenerConfig

View File

@@ -1,82 +0,0 @@
# ADR-036: CredentialProvider as Core Type
## Status
Accepted
## Context
Alknet's `IdentityProvider` resolves **inbound** authentication: given a
credential (fingerprint or token), produce an `Identity`. But there is no
corresponding abstraction for **outbound** credentials: how does alknet
authenticate _to_ external services (vast.ai, rustfs, gitea)?
Without `CredentialProvider`, each service wrapper would independently solve
credential retrieval, caching, and lifecycle management. This leads to
duplicated effort and inconsistent security practices across service wrappers.
The pattern mirrors the existing `IdentityProvider` pattern: trait in core,
default impl using simple storage, production impl using the secret service
and database.
## Decision
Define `CredentialProvider` trait and `CredentialSet` enum in
`alknet_core::credentials`.
```rust
pub trait CredentialProvider: Send + Sync + 'static {
fn get_credentials(&self, service: &str) -> Option<CredentialSet>;
fn refresh_credentials(&self, service: &str) -> Option<CredentialSet>;
}
pub enum CredentialSet {
ApiKey { header_name: String, token: String },
Basic { username: String, password: String },
Bearer { token: String },
S3AccessKey { access_key: String, secret_key: String, session_token: Option<String> },
OidcToken { access_token: String, refresh_token: Option<String>, expires_at: Option<u64> },
Custom { scheme: String, params: HashMap<String, String> },
}
```
The trait is intentionally narrow. It returns credentials for a named service.
It does not try to abstract the auth mechanism itself — that stays with the
service wrapper that knows the protocol (S3 signing, OAuth2 refresh, etc.).
Phase 1 provides `SecretStoreCredentialProvider` (reads from
`SecretProtocol::Decrypt`, holds in RAM). Phase 2+ adds
`ManagedCredentialProvider` (with `CredentialManager` for lifecycle management:
refresh, expiration, provisioning).
`CredentialProvider` does not depend on `IdentityProvider`, though
`ManagedCredentialProvider` may use `Identity.id` for identity-bound credential
lookups.
## Consequences
**Positive**: Outbound auth has a unified abstraction, just as inbound auth
has `IdentityProvider`. Service wrappers retrieve credentials through one
interface. `OperationEnv` can expose credentials through `context.env`.
**Positive**: The `CredentialSet` enum covers all identified credential types
(API keys, bearer tokens, S3 access keys, OIDC tokens, basic auth, custom).
This is sufficient for Phases A-C. Phase D (alknet as OIDC provider) is additive.
**Positive**: The trait in core, impl in service crate pattern is consistent
with `IdentityProvider` (trait in core, `ConfigIdentityProvider` in core,
`StorageIdentityProvider` in alknet-storage).
**Negative**: Adds a new core type and a new module (`credentials`). But this
is the same pattern as `IdentityProvider` and `auth` — a small, narrow trait
with a clear contract.
**Negative**: `ManagedCredentialProvider` and `CredentialManager` are Phase C
concepts. The spec should define them as future extensions, not implement them
now.
## References
- ADR-029 (Identity as core type — same pattern)
- [credentials.md](../credentials.md) — CredentialProvider spec
- [research/phase2/credential-provider.md](../../research/phase2/credential-provider.md) — Full analysis
- [identity.md](../identity.md) — IdentityProvider (inbound, opposite direction)

View File

@@ -1,83 +0,0 @@
# ADR-037: API Keys as DynamicConfig Auth
## Status
Accepted
## Context
Alknet's token auth uses Ed25519-signed `AuthToken`s — the same key material
used for SSH auth. This is appropriate for interactive clients (browsers, CLI)
that can generate and sign Ed25519 key pairs.
But for service accounts, automation, and simple integrations, Ed25519 key
pairs are inconvenient. A dashboard backend, a CI/CD pipeline, or a monitoring
script needs a simple bearer token that can be stored in an environment variable
or config file without managing cryptographic key pairs.
The HTTP interface (Phase 2+) requires bearer token auth for `Authorization:
Bearer <token>` headers. `AuthToken` works but requires client-side Ed25519
signing. API keys offer a simpler alternative: short bearer tokens verified by
SHA-256 hash lookup, with optional scope restrictions and TTL.
## Decision
Add `[[auth.api_keys]]` section to `DynamicConfig`:
```toml
[[auth.api_keys]]
prefix = "alk_"
hash = "sha256:abc..."
scopes = ["relay:connect", "secrets:derive"]
description = "dashboard service account"
ttl = "30d" # optional
```
`ConfigIdentityProvider::resolve_from_token()` handles both token types:
- If the input starts with the configured prefix (default `alk_`), treat it as
an API key: hash it with SHA-256 and look up the hash in the `api_keys` table.
- Otherwise, treat it as an `AuthToken`: decode, verify Ed25519 signature,
check timestamp, resolve from `authorized_keys`.
Both paths produce the same `Identity` result. In database-backed deployments,
both resolve to the same account UUID.
API keys are stored as SHA-256 hashes (like password hashing — the cleartext
key is never stored, only its hash). The prefix enables O(1) routing between
AuthToken and API key verification without trying both paths.
The full key is provided to the client exactly once (at creation time). Subsequent
verifications only compare hashes.
## Consequences
**Positive**: Simple bearer token auth for HTTP and other non-SSH interfaces.
No cryptographic key management for service accounts. Consistent with industry
practice (Stripe, GitHub, AWS all use prefixed API keys).
**Positive**: Both AuthTokens and API keys go through `resolve_from_token()`.
The caller doesn't need to know which type they're using. This keeps the
authentication layer unified.
**Positive**: Scoped API keys enable fine-grained access control for service
accounts. A monitoring tool gets `["monitoring:read"]`, not full access.
**Negative**: API keys are bearer tokens — anyone who obtains the key has the
associated permissions. The hash storage and optional TTL mitigate but do not
eliminate this risk. Ed25519 AuthTokens remain the preferred auth method for
interactive clients.
**Negative**: API key rotation requires updating `DynamicConfig` (or the
`api_keys` database table). The `ConfigReloadHandle` / `ConfigService` reload
mechanism handles this, but it's a deliberate operation, not automatic.
**Negative**: No rate limiting on API key verification is built into this ADR.
Rate limiting on the HTTP interface is a separate concern.
## References
- ADR-023 (unified auth, shared key material)
- ADR-029 (Identity as core type)
- ADR-030 (static/dynamic config split)
- [auth.md](../auth.md) — Token auth, AuthPolicy, API keys
- [configuration.md](../configuration.md) — DynamicConfig, AuthPolicy
- [research/phase2/interface-model.md](../../research/phase2/interface-model.md) — API keys in config

View File

@@ -1,137 +0,0 @@
# ADR-038: Seed Lifecycle and Memory Security
## Status
Accepted
## Context
The alknet-secret crate holds the master BIP39 seed phrase in RAM. This seed is
the root of trust for all derived keys (identity, encryption, signing). If the
seed is leaked — through memory dumps, swap files, or core dumps — an attacker
can derive every key in the system.
Security-conscious key management systems typically employ three defenses:
1. **Zeroize**: Overwrite sensitive memory before deallocating. Prevents
stale-data reads from freed memory.
2. **Memory locking** (`mlock`/`VirtualLock`): Prevent the OS from paging
sensitive RAM to disk. Prevents swap-file leakage.
3. **Constant-time comparison**: Prevent timing side-channels when comparing
keys or tokens.
The question is: which of these should alknet-secret adopt in v1, and which
should be deferred?
## Decision
**Phase 3 (v1): Zeroize only. Defer mlock and constant-time comparison to
Phase B.**
- All sensitive types (seed bytes, derived private keys, passphrase strings)
derive `Zeroize` and implement `Drop` to call `zeroize()` before deallocation.
- The `Lock` operation calls `zeroize()` on the seed and all cached derived
keys, then drops them.
- `mlock`/`VirtualLock` and constant-time comparison are not included in v1.
### Rationale for deferring mlock
1. **Complexity**: `mlock` requires root/CAP_IPC_LOCK on Linux or
`SeLockMemory` on Windows. The crate should work in unprivileged contexts
(development, testing, single-user nodes) without requiring system
configuration changes.
2. **Performance**: `mlock` locks physical pages, which are typically 4KB.
Locking many small buffers wastes physical memory. The seed (64 bytes) and
derived keys (3264 bytes each) are tiny — the real risk is swap-file
leakage, which `zeroize` partially mitigates by wiping before free.
3. **Deployment flexibility**: Production head nodes running as root or with
`CAP_IPC_LOCK` can add `mlock` in Phase B. Development and CLI nodes
shouldn't need it.
4. **Audit surface**: `mlock` introduces platform-specific code paths (Linux
vs macOS vs Windows) that should be audited together, not bolted on
incrementally.
### Rationale for deferring constant-time comparison
The `SecretProtocol` service receives requests over irpc (local mpsc or remote
QUIC). Comparison timing is not observable by callers — they send a message and
wait for a response. The comparison that matters (auth token verification) is
in alknet-core's `IdentityProvider`, not in alknet-secret. Key derivation
results (DerivedKey) are not compared against attacker-controlled input within
this crate.
### Zeroize implementation
```rust
use zeroize::Zeroize;
#[derive(Zeroize)]
#[zeroize(drop)]
struct SeedHolder {
seed: Vec<u8>,
}
#[derive(Zeroize)]
#[zeroize(drop)]
struct DerivedKeyCache {
keys: HashMap<String, Vec<u8>>,
}
```
`#[zeroize(drop)]` ensures that `Drop` calls `zeroize()` on all fields,
overwriting memory before deallocation. This is a compile-time guarantee —
forgetting to zeroize a field is a compile error.
### Lock lifecycle
```
Unlock(passphrase)
→ validate mnemonic (if restoring) or generate new
→ derive master key from seed
→ store seed in SeedHolder (Zeroize-protected)
→ cache empty (keys derived on demand)
DeriveEd25519/DeriveEncryptionKey/Encrypt/Decrypt
→ require unlocked state (error if locked)
→ derive key, return result
→ optionally cache derived key
Lock
→ zeroize all cached derived keys
→ zeroize seed
→ drop all sensitive material
→ service returns to locked state
```
## Consequences
- **Positive**: Zeroize is zero-cost at compile time, minimal dependency
(`zeroize` crate is ~500 lines, no `unsafe` on stable), and provides
meaningful protection against stale-memory reads.
- **Positive**: Lock effectively purges all sensitive material. After Lock,
the process memory contains no useful secret data.
- **Positive**: No platform-specific code paths in v1. The crate compiles and
runs everywhere without privilege requirements.
- **Negative**: Without `mlock`, the OS can page the seed to swap before
zeroization occurs. This is a window of vulnerability that Phase B closes.
The risk is acceptable for v1 because swap-file extraction requires root
access or physical access to the machine — the same threat model as reading
process memory directly.
- **Negative**: Without constant-time comparison, timing side-channels exist
in theory. In practice, no comparison in alknet-secret operates on
attacker-controlled input, so the risk is nil within this crate.
- **Negative**: `zeroize` adds a dependency. The `zeroize` crate is widely
used in Rust crypto (ring, ed25519-dalek, x25519-dalek) and is a de facto
standard.
## References
- [secret-service.md](../secret-service.md) — Security model, Lock/Unlock lifecycle
- [ADR-027](027-crate-decomposition.md) — Crate decomposition (alknet-secret is independent)
- [credentials.md](../credentials.md) — SecretStoreCredentialProvider integration
- `zeroize` crate — https://crates.io/crates/zeroize