docs(architecture): spec alknet-core with per-crate subdocs, ADR-010/011

Add alknet-core architecture specs in docs/architecture/crates/core/ with focused subdocuments for core types, endpoint, auth, and config. Write ADR-010 (ALPN Router and Endpoint) defining AlknetEndpoint, HandlerRegistry, accept loop, and graceful shutdown. Write ADR-011 (AuthContext Structure) defining AuthContext fields, immutability in handle(), and IdentityProvider injection pattern. Resolve OQ-04 (static registration), OQ-12 (file paths only for v1). Add OQ-11 (auth observability). Fix remaining alknet-secret references to alknet-vault across ADRs 003/004/005/009.
2026-06-16 12:07:17 +00:00
parent 80128a56e5
commit 90d5f4eaf9
13 changed files with 1151 additions and 18 deletions
--- a/docs/architecture/decisions/003-crate-decomposition.md
+++ b/docs/architecture/decisions/003-crate-decomposition.md
@@ -12,7 +12,7 @@ The new ALPN dispatch model eliminates the need for a shared interface layer. Ea

 Key constraints:
 - Protocol crates must depend on alknet-core for auth/identity/config — but not on each other
- alknet-secret is already standalone (no alknet-core dependency) and must remain so (renamed to alknet-vault — see ADR-008)
+- alknet-vault is already standalone (no alknet-core dependency) and must remain so (see ADR-008)
 - The CLI binary assembles everything — it's the only crate that depends on all handler crates
 - Some handlers (SFTP, call protocol) need to compile to WASM for browser/client use
 - irpc is the foundation for the call protocol — it provides the operation registry, framing, and pub/sub patterns
--- a/docs/architecture/decisions/004-auth-as-shared-core.md
+++ b/docs/architecture/decisions/004-auth-as-shared-core.md
@@ -42,7 +42,7 @@ The `AuthContext` passed to `handle()` may be partial — containing only transp

 The `CredentialProvider` concept from the previous architecture is simplified: there is no phase progression (A–D). The `IdentityProvider` has two resolution paths — fingerprint and token — and a `ConfigIdentityProvider` implementation that draws from static and dynamic config.

-`alknet-secret` remains independent. It does not depend on `alknet-core` or `IdentityProvider`. The secret service provides derived keys on request; identity resolution is a separate concern.
+`alknet-vault` stays standalone. It does not depend on `alknet-core` or `IdentityProvider`. The vault provides derived keys on request; identity resolution is a separate concern.

 ## Consequences

--- a/docs/architecture/decisions/005-irpc-as-call-protocol-foundation.md
+++ b/docs/architecture/decisions/005-irpc-as-call-protocol-foundation.md
@@ -30,7 +30,7 @@ This means:
 - The TypeScript "operations" and "pub/sub" patterns that can import OpenAPI schemas and expose MCP tools are supported at the protocol level
 - Future NAPI and WASM clients speak the same wire format

-The `SecretProtocol` in alknet-secret also uses irpc as its service protocol. This is consistent — alknet-secret's irpc service is an independent service that happens to use the same framing, not a dependency on alknet-call.
+The `VaultProtocol` in alknet-vault also uses irpc as its service protocol. This is consistent — alknet-vault's irpc service is an independent service that happens to use the same framing, not a dependency on alknet-call.

 ## Consequences

@@ -39,7 +39,7 @@ The `SecretProtocol` in alknet-secret also uses irpc as its service protocol. Th
 - JSON Schema compatible — OpenAPI import, MCP tool exposure, cross-language client generation
 - No need to design a custom RPC wire format — irpc's is already battle-tested
 - The call protocol inherits irpc's streaming and subscription patterns
- Consistency with alknet-secret's service model — both use irpc
+- Consistency with alknet-vault's service model — both use irpc

 **Negative:**
 - alknet-call depends on irpc — if irpc has limitations or bugs, we're affected (mitigated: irpc is lightweight and we can fork if needed)
--- a/docs/architecture/decisions/009-one-way-door-decision-framework.md
+++ b/docs/architecture/decisions/009-one-way-door-decision-framework.md
@@ -8,7 +8,7 @@ Accepted

 Not all architectural decisions carry the same reversal cost. Some decisions are easy to change later — if you pick the wrong data structure, you refactor. Other decisions are nearly impossible to reverse — if you build a type hierarchy that forecloses WASM compatibility, every handler written against that hierarchy must be rewritten.

-This distinction matters especially during Phase 0 (exploration) and early Phase 1 (architecture). The project is post-pivot with foundational ADRs in place but no implementation code yet (except alknet-secret). Decisions made now shape the API surface that every handler depends on.
+This distinction matters especially during Phase 0 (exploration) and early Phase 1 (architecture). The project is post-pivot with foundational ADRs in place but no implementation code yet (except alknet-vault). Decisions made now shape the API surface that every handler depends on.

 Without an explicit framework, one-way doors can be treated as casually as two-way doors, leading to costly rework. Or conversely, two-way doors can be over-analyzed, blocking progress on decisions that are cheap to reverse.

--- a/docs/architecture/decisions/010-alpn-router-and-endpoint.md
+++ b/docs/architecture/decisions/010-alpn-router-and-endpoint.md
@@ -0,0 +1,141 @@
+# ADR-010: ALPN Router and Endpoint
+
+## Status
+
+Proposed
+
+## Context
+
+ADR-001 establishes ALPN-based protocol dispatch: a single QUIC+TLS endpoint accepts connections, and the ALPN negotiated during the TLS handshake routes each connection to the correct `ProtocolHandler`. ADR-002 defines the `ProtocolHandler` trait. ADR-006 establishes one ALPN per connection. ADR-007 defines `Connection` and `BiStream`.
+
+The question now is: **how does the endpoint work?** What accepts QUIC connections, negotiates ALPN, and hands connections to handlers? This is the central runtime piece of alknet-core — every handler depends on it.
+
+The reference implementation (`alknet-main`) uses a `Server` struct that binds a `TransportAcceptor`, runs an accept loop, and dispatches to a `ServerHandler` based on transport type and interface kind. This has three problems that the ALPN model solves:
+
+1. **Multiple listener types**: `ListenerConfig` has three variants (Stream, Http, Dns) with per-variant configuration and validation. ALPN eliminates this — one endpoint, one listener, ALPN does the routing.
+2. **Protocol detection by byte-peeking**: The `stealth` module reads the first bytes to detect SSH vs HTTP. ALPN negotiation makes this unnecessary — the TLS handshake tells you the protocol before any application bytes are read.
+3. **SSH-centric accept loop**: The current `handle_connection` immediately enters `russh::server::run_stream`. In the new model, the accept loop is ALPN-agnostic — it doesn't know or care what protocol the handler speaks.
+
+### iroh's pattern
+
+iroh's `Router` registers `ProtocolHandler` instances with ALPN strings, then calls `endpoint.accept()` in a loop. For each incoming connection, it reads the negotiated ALPN, looks up the handler, and calls `handler.accept(connection)`. This is clean and proven.
+
+### Key design questions
+
+1. **Handler registration**: Static (at startup) or dynamic (at runtime)?
+2. **TLS certificate management**: How does the endpoint get TLS certs? Where does ACME fit?
+3. **Connection lifecycle**: Who owns the `quinn::Endpoint`? How does graceful shutdown work?
+4. **Error handling**: What happens when a handler panics? When ALPN negotiation fails?
+
+## Decision
+
+### Endpoint owns the QUIC endpoint
+
+`alknet-core` owns the `quinn::Endpoint` directly. The endpoint binds to a single address, configures TLS with a `rustls::ServerConfig` that includes the ALPN strings from all registered handlers, and accepts connections in a loop.
+
+```rust
+pub struct AlknetEndpoint {
+    endpoint: quinn::Endpoint,
+    handlers: Arc<HandlerRegistry>,
+    dynamic: Arc<ArcSwap<DynamicConfig>>,
+    identity_provider: Arc<dyn IdentityProvider>,
+    shutdown: watch::Receiver<bool>,
+}
+```
+
+There is no `TransportAcceptor` trait, no `TransportKind` enum, no `ListenerConfig` enum. QUIC+TLS+ALPN replaces all of that.
+
+### HandlerRegistry maps ALPN strings to ProtocolHandler instances
+
+```rust
+pub struct HandlerRegistry {
+    handlers: HashMap<&'static [u8], Arc<dyn ProtocolHandler>>,
+}
+```
+
+Registration is static at startup. The CLI binary constructs a `HandlerRegistry` by inserting handlers for each ALPN, then passes it to `AlknetEndpoint::new()`. The ALPN strings in the TLS `ServerConfig` are derived from the registry's keys.
+
+This is a two-way door (OQ-04): starting static is simple. If dynamic registration is needed later, the registry can be wrapped in `ArcSwap<HandlerRegistry>` and the TLS `ServerConfig` can be regenerated. But ALPN negotiation happens during the TLS handshake, so adding a handler at runtime requires the next connection to use the new ALPN — which the client already has to know about. Dynamic registration has limited value for v1.
+
+### Accept loop: connect, dispatch, spawn
+
+```
+loop {
+    incoming = endpoint.accept().await
+    connection = incoming.await  // TLS handshake + ALPN negotiation
+    alpn = connection.alpn()
+    handler = registry.get(alpn)
+    
+    match handler {
+        Some(h) => {
+            auth = resolve_endpoint_auth(connection)  // TLS client cert, etc.
+            tokio::spawn(h.handle(connection, &auth))
+        }
+        None => connection.close()
+    }
+}
+```
+
+Key behaviors:
+- **ALPN mismatch**: The TLS handshake fails. This is correct — the client and server have no protocol in common.
+- **Handler not found**: Should not happen — the `ServerConfig` only advertises ALPNs that have registered handlers. If somehow a connection negotiates an ALPN with no handler, the connection is closed with an error log.
+- **Handler panic**: The handler runs in a spawned tokio task. If it panics, the task is caught by tokio's panic handler. The connection is dropped. Other connections are unaffected.
+- **Graceful shutdown**: A `watch::Sender<bool>` signals the accept loop to stop accepting new connections. Existing connections are given a drain timeout (2 seconds default), then forcefully closed.
+
+### TLS certificate configuration
+
+TLS certs come from `StaticConfig`:
+- File paths (`tls_cert`, `tls_key`) for manual provisioning
+- Self-signed for development
+
+The `rustls::ServerConfig` is built from the cert + key + ALPN list at startup. The ALPN list is derived from `HandlerRegistry::alpn_strings()`.
+
+ACME auto-provisioning (Let's Encrypt) is not in scope for v1. It will be added as a feature later (see OQ-12).
+
+### Error taxonomy
+
+```rust
+pub enum EndpointError {
+    BindFailed(io::Error),
+    TlsConfig(io::Error),
+    HandlerNotFound(Vec<u8>),  // ALPN string with no registered handler
+}
+
+pub enum HandlerError {
+    ConnectionClosed,
+    StreamError(io::Error),
+    AuthRequired,
+    Internal(Box<dyn std::error::Error + Send + Sync>),
+}
+```
+
+- `EndpointError`: Problems starting or running the endpoint. Fatal — the endpoint cannot accept connections.
+- `HandlerError`: Problems within a handler's `handle()` method. Non-fatal — the connection is closed, but the endpoint keeps running.
+
+## Consequences
+
+**Positive:**
+- Single accept loop replaces multiple listener types and byte-peeking
+- ALPN negotiation happens at the TLS layer — no application-level protocol detection
+- Adding a handler is registering an ALPN string — no endpoint code changes
+- Handler panics are isolated — one bad handler can't take down the endpoint
+- `quinn::Endpoint` is the only transport — no TransportAcceptor trait needed for v1
+- The endpoint is testable: give it mock handlers and a test ALPN, verify dispatch
+
+**Negative:**
+- Direct quinn dependency in alknet-core — WASM targets can't use quinn (mitigated: WASM clients don't run endpoints, they connect to them; the WASM door is for client-side handlers, not the endpoint itself)
+- No runtime handler registration without regenerating the TLS config (mitigated: two-way door, start static, add ArcSwap later if needed)
+- TLS cert provisioning is manual (file paths) for v1 — ACME auto-provisioning is a future feature (OQ-12)
+- One address per endpoint — if you need to listen on multiple addresses, run multiple endpoints (acceptable for v1)
+
+## References
+
+- ADR-001: ALPN-based protocol dispatch
+- ADR-002: ProtocolHandler trait
+- ADR-006: ALPN string convention and connection model
+- ADR-007: BiStream type definition (Connection, SendStream, RecvStream)
+- ADR-009: One-way door decision framework
+- OQ-04: Dynamic handler registration (two-way door, start static)
+- OQ-05: Multi-transport endpoint (two-way door, start with quinn)
+- iroh Router pattern: `docs/research/references/iroh/`
+- Reference implementation: `alknet-main/crates/alknet-core/src/server/serve.rs`
--- a/docs/architecture/decisions/011-authcontext-structure.md
+++ b/docs/architecture/decisions/011-authcontext-structure.md
@@ -0,0 +1,156 @@
+# ADR-011: AuthContext Structure and Resolution Flow
+
+## Status
+
+Proposed
+
+## Context
+
+ADR-004 establishes the hybrid auth model: the endpoint resolves what it can (TLS client certificate fingerprint), handlers resolve what they must (AuthToken in the first frame, Bearer header, SSH key fingerprint). The `AuthContext` passed to `handle()` may be partial.
+
+The reference implementation's `Identity` struct is:
+
+```rust
+pub struct Identity {
+    pub id: String,
+    pub scopes: Vec<String>,
+    pub resources: HashMap<String, Vec<String>>,
+}
+```
+
+And `ConfigIdentityProvider` resolves fingerprints and API keys to `Identity`. This works well and carries forward.
+
+But the reference implementation has no `AuthContext` type — auth resolution happens inside the SSH handler before calling `IdentityProvider`. The new model needs a type that represents "what the endpoint knows about this connection's identity before the handler starts," plus a way for handlers to enrich it.
+
+This is a one-way door: once handlers depend on `AuthContext`'s structure, changing it affects every handler. The structure must be right.
+
+### Design considerations
+
+1. **Handlers need identity information to make authorization decisions.** A handler that requires authentication needs to know: is the peer authenticated? Who are they? What scopes do they have?
+
+2. **The endpoint may have zero, partial, or complete identity information.** A plain QUIC connection with no TLS client cert gives the endpoint nothing. A TLS connection with a client cert gives the endpoint a fingerprint that may resolve to an Identity. A handler that extracts an AuthToken from the first frame can complete the resolution.
+
+3. **AuthContext must not be SSH-specific.** The reference implementation's auth types are tangled with russh (SSH key fingerprints, certificate authorities). The new model needs to be ALPN-agnostic.
+
+4. **AuthContext is constructed by the endpoint and enriched by handlers.** The endpoint creates it from TLS-level information. The handler mutates or replaces it with protocol-level information.
+
+5. **AuthContext must be cheap to construct.** Every incoming connection gets one, even if authentication ultimately fails.
+
+## Decision
+
+### AuthContext is a struct with optional fields
+
+```rust
+pub struct AuthContext {
+    /// The peer's authenticated identity, if resolved.
+    /// None means the endpoint has no identity information for this connection.
+    /// Some(Identity) means the endpoint resolved the peer's identity.
+    pub identity: Option<Identity>,
+
+    /// The negotiated ALPN for this connection.
+    /// Always present — the endpoint sets this from the TLS handshake.
+    pub alpn: Vec<u8>,
+
+    /// The peer's remote address, if available.
+    pub remote_addr: Option<SocketAddr>,
+
+    /// TLS client certificate fingerprint, if the client presented a certificate.
+    /// Set by the endpoint during TLS handshake. Handlers may use this for
+    /// SSH host key verification or other fingerprint-based auth.
+    pub tls_client_fingerprint: Option<String>,
+}
+```
+
+Key design points:
+
+- `identity: Option<Identity>` — not `Identity` with optional fields, not a separate `PartialAuthContext`. The endpoint sets it to `None` if it has no identity information, or `Some(identity)` if it resolved one. Handlers that need to complete auth call `IdentityProvider` themselves and store the resolved identity in a local variable — they do NOT mutate AuthContext (see immutability section below).
+- `alpn` is always present — every connection has a negotiated ALPN.
+- `remote_addr` is informational. It's available from the QUIC connection and useful for logging and rate limiting, but it's not authoritative (clients can be behind NATs/proxies).
+- `tls_client_fingerprint` captures the TLS-level credential. If present, it's the SHA-256 fingerprint of the client's TLS certificate. This is separate from `identity` because a handler might need the fingerprint even when `IdentityProvider::resolve_from_fingerprint()` returns `None` (e.g., unknown cert, but the handler wants to log it).
+
+### AuthContext is Clone
+
+`AuthContext` derives `Clone`. Handlers can clone it for per-stream or per-channel contexts within a connection. The `Identity` inside is also `Clone`.
+
+### Handler-level auth enrichment pattern
+
+Handlers that need to complete authentication do so inside `handle()`:
+
+```rust
+async fn handle(&self, connection: Connection, auth: &AuthContext) -> Result<(), HandlerError> {
+    let identity = if let Some(id) = &auth.identity {
+        id.clone()  // Endpoint already resolved identity
+    } else {
+        // Extract credentials from the protocol, resolve via IdentityProvider
+        let token = self.extract_auth_token(&connection).await?;
+        self.identity_provider.resolve_from_token(&token)
+            .ok_or(HandlerError::AuthRequired)?
+    };
+    // ... proceed with authenticated identity
+}
+```
+
+Handlers that don't need authentication (e.g., DNS resolver, health check) can ignore `auth.identity` entirely.
+
+### Identity carries over from reference implementation
+
+```rust
+pub struct Identity {
+    pub id: String,
+    pub scopes: Vec<String>,
+    pub resources: HashMap<String, Vec<String>>,
+}
+```
+
+This is the same structure from the reference implementation, minus the russh dependency. It's ALPN-agnostic:
+- `id`: A unique identifier string. For SSH key auth, this is the SHA-256 fingerprint. For API key auth, this is the key prefix. For certificate auth, this is the principal name.
+- `scopes`: Authorization scopes. `["relay:connect", "secrets:derive"]` etc.
+- `resources`: Named resource lists. `{"service": ["gitea", "registry"]}` etc.
+
+### AuthToken carries raw bytes
+
+```rust
+pub struct AuthToken {
+    pub raw: Vec<u8>,
+}
+```
+
+Unchanged from the reference implementation. Opaque bytes — the handler that extracted it knows its encoding.
+
+### IdentityProvider carries over with minor adaptation
+
+```rust
+pub trait IdentityProvider: Send + Sync + 'static {
+    fn resolve_from_fingerprint(&self, fingerprint: &str) -> Option<Identity>;
+    fn resolve_from_token(&self, token: &AuthToken) -> Option<Identity>;
+}
+```
+
+The implementation (`ConfigIdentityProvider`) changes from the reference: it no longer depends on russh types for key storage. Instead, it stores fingerprint strings and API key entries, drawing from `DynamicConfig` via `ArcSwap`.
+
+### AuthContext is NOT mutable inside handle()
+
+The `handle()` signature passes `&AuthContext` (immutable reference). Handlers that resolve identity create a local variable with the resolved identity — they don't mutate the AuthContext. This prevents accidental cross-contamination between streams on the same connection.
+
+## Consequences
+
+**Positive:**
+- `AuthContext` is a value type — cheap to construct, clone, and pass around
+- Handlers that don't need auth can ignore it entirely
+- The endpoint provides what it can for free (TLS client cert fingerprint), handlers complete what they need
+- No russh dependency in AuthContext — it's ALPN-agnostic
+- `Option<Identity>` is explicit — there's no "partially authenticated" state that handlers have to interpret
+- Handlers that need to enrich auth create local variables, not mutation — clean data flow
+
+**Negative:**
+- Handlers that need auth must call `IdentityProvider` themselves — this is intentional (ADR-004 hybrid model) but means each handler has its own auth extraction logic
+- `tls_client_fingerprint` is separate from `identity` — a handler might wonder "why do I have a fingerprint but no identity?" This happens when the client presents a cert that's not in the authorized keys. The handler can log the fingerprint for debugging.
+- `AuthContext` doesn't carry protocol-specific auth state (e.g., SSH auth method, HTTP auth scheme). This is by design — protocol-specific details belong inside the handler, not in the shared auth context.
+
+## References
+
+- ADR-002: ProtocolHandler trait
+- ADR-004: Auth as shared core (IdentityProvider, hybrid auth model)
+- ADR-007: BiStream type definition (Connection parameter)
+- ADR-010: ALPN router and endpoint (where AuthContext is created)
+- Reference implementation: `alknet-main/crates/alknet-core/src/auth/identity.rs`