docs: refactor hub/spoke to head/worker, add service layer and HD key derivation

- Replace hub/spoke terminology with head/worker throughout all research docs
- Add irpc service layer architecture (AuthProtocol, SecretProtocol,
  ConfigProtocol, StorageProtocol)
- Add BIP39/SLIP-0010 HD key derivation for secrets management
- Add event boundary discipline (domain events vs integration events)
- Add application services layer (Docker, Node, Wallet, Proxy, Compute)
- New docs/research/services.md defining irpc service protocols
- Update core.md with service layer section and head/worker model
- Update configuration.md to delegate auth to AuthService (irpc)
- Update storage.md with secrets/key derivation and event boundaries
- Update flow.md with event boundary decision and cross-references
This commit is contained in:
2026-06-06 15:33:35 +00:00
parent 2315a211ff
commit d291a485f0
5 changed files with 1007 additions and 49 deletions

View File

@@ -6,13 +6,25 @@ phase: exploration
# Configuration Architecture
## Terminology Change: Head/Worker
This document previously used **hub/spoke** terminology. It has been updated to **head/worker**:
- **Head node**: The coordinating node (formerly "hub"). A head can also be a worker.
- **Worker node**: A node that connects to a head and registers services (formerly "spoke").
- **Node**: Any participant in the network. Every node has an identity.
This better reflects that a head is also a worker, enabling mesh topologies.
## Problem
## Problem
Alknet's configuration is loaded once at startup and never changes. This has
three specific failures:
1. **No hot reload of authentication credentials.** Adding or removing an
authorized key requires restarting the server process. In a hub/spoke
authorized key requires restarting the server process. In a head/worker
deployment where keys are managed via a database (see
`@alkdev/storage`'s `peer_credentials` table), the alknet process must be
restarted every time a key is added, revoked, or rotated. This is
@@ -38,7 +50,7 @@ three specific failures:
data sources plug in from outside.
- This does not propose file-watching (potential attack vector, unnecessary
complexity). CLI usage loads config once at startup. Programmatic usage
(NAPI, hub) calls reload explicitly.
(NAPI, head node) calls reload explicitly.
- This does not replace the existing `ServeOptions` builder pattern. It
generalizes it.
@@ -62,6 +74,19 @@ atomically without disrupting existing connections.
The split is clean: anything that affects the SSH handshake or socket binding
is static. Anything that's checked per-connection or per-channel is dynamic.
### Auth Reload: Service Approach
The original design held all authorized keys in memory via `ArcSwap<DynamicConfig>`. For small deployments this works, but for nodes serving many users it requires loading every key into RAM and atomic-swapping the entire set on each reload.
The improved approach is to make auth an **irpc service** (see [core.md](core.md) and [services.md](services.md)). Auth verification becomes a service call: `VerifyPubkey { fingerprint, key_data }``oneshot::Sender<AuthResult>`. The service can:
- Query SQLite on demand (no need to hold all keys in memory)
- Maintain an LRU cache for hot keys
- Subscribe to honker streams for key invalidation
- Run locally (in-process mpsc) or remotely (QUIC stream)
`ArcSwap<DynamicConfig>` remains as a fallback for minimal deployments (CLI usage, single-node setups) where SQLite overhead isn't warranted. The service approach is the primary path for production deployments.
### Current Architecture
```
@@ -83,7 +108,7 @@ path to update it.
### Proposed Architecture
Replace `Arc<ServerAuthConfig>` with a reloadable provider:
Replace `Arc<ServerAuthConfig>` with a service-based approach:
```
StaticConfig (Arc, loaded once)
@@ -92,15 +117,24 @@ StaticConfig (Arc, loaded once)
├─ host key
└─ max_auth_attempts, max_connections_per_ip
AuthService (irpc service, local or remote)
├─ VerifyPubkey(fingerprint, key_data) → AuthResult
├─ VerifyToken(token_bytes) → AuthResult
└─ ReloadKeys() → ()
Backed by: SQLite (peer_credentials, api_keys)
Optional: ArcSwap<DynamicConfig> for minimal deployments
ConfigService (irpc service, always local)
├─ ReloadDynamicConfig(DynamicConfig)
└─ GetForwardingPolicy() → ForwardingPolicy
DynamicConfig (Arc<ArcSwap<DynamicConfig>>, reloadable)
├─ auth: ServerAuthConfig
├─ forwarding: ForwardingPolicy
└─ rate_limits: RateLimitConfig
ConfigReloadHandle (exposed to NAPI)
└─ reload(DynamicConfig)
```
For production: auth verification goes through the auth service, which queries SQLite. The `DynamicConfig` only holds forwarding policy and rate limits — not the full key set. For minimal deployments: auth falls back to `ArcSwap<DynamicConfig>` with all keys in memory, wrapped by the same service interface.
`ArcSwap` provides lock-free reads on the hot path. Every `auth_publickey()`
and `channel_open_direct_tcpip()` call does an `Arc` dereference — zero cost
compared to the current approach. Writes are atomic: `store()` swaps the
@@ -138,7 +172,7 @@ pub enum TargetPattern {
Rule evaluation: first match wins, default applies if no rule matches. This
model maps to OpenSSH's `AllowTcpForwarding` + `PermitOpen` but is more
expressive. It also maps to `peer_credentials.metadata.scopes` in `@alkdev/storage`
— the hub can generate forwarding rules from stored scopes.
— the head node can generate forwarding rules from stored scopes.
Rule ordering matters. A deny-then-allow pattern gives blocklist semantics. An
allow-then-deny pattern gives allowlist semantics. Both are useful. The
@@ -220,7 +254,7 @@ interface ForwardingRuleConfig {
}
```
The hub calls `server.reloadAuth(...)` after writing to `peer_credentials`.
The head node calls `server.reloadAuth(...)` after writing to `peer_credentials`.
The NAPI layer parses the key data and constructs a new `DynamicConfig`, then
calls the `ConfigReloadHandle`.
@@ -235,7 +269,7 @@ A config file for client connections could define named profiles:
```toml
[profiles.production]
server = "hub.alk.dev:443"
server = "head.alk.dev:443"
transport = "tls"
identity = "/home/user/.ssh/id_ed25519"
@@ -252,16 +286,17 @@ This is a convenience layer on top of `ConnectOptions`, not a replacement.
| Interface | Static config | Dynamic config | Reload mechanism |
|---|---|---|---|
| CLI | Flags + optional `--config` file | Loaded at startup from `--authorized-keys` | None (restart to change) |
| Core Rust | `StaticConfig` struct | `ArcSwap<DynamicConfig>` | `ConfigReloadHandle::reload()` |
| NAPI | `serve()` options | Same `ArcSwap` | `server.reloadAuth()`, `server.reloadForwarding()` |
| Core Rust | `StaticConfig` struct | `AuthService` (irpc) or `ArcSwap<DynamicConfig>` (minimal) | `ConfigService::reload()` or `ConfigReloadHandle::reload()` |
| NAPI | `serve()` options | Same | `server.reloadAuth()`, `server.reloadForwarding()` |
The CLI doesn't need a reload mechanism. When you're running alknet from the
command line, restarting is fine. The reload mechanism exists for programmatic
consumers that manage credentials in a database.
consumers and for the auth service pattern where keys are queried on demand from
a database.
### Multi-Transport Listeners
A host may want to accept connections on multiple transports simultaneously:
A head node may want to accept connections on multiple transports simultaneously:
- TCP on port 22 (simple, direct SSH)
- TLS on port 443 (stealth mode, corporate firewalls)
@@ -458,7 +493,7 @@ compat via accepting both `transport: string` (single) and
Global rules with principal matching is simpler and covers most cases. Per-user
scope derived from certificates is more granular but requires the server to
maintain a mapping from key fingerprint to scope. This mapping comes from the
hub's database, not from the SSH protocol. Phase 2 starts with global rules;
head node's database, not from the SSH protocol. Phase 2 starts with global rules;
per-user scope can be added as an extension.
- **OQ-CFG-02**: Should the config file watch for changes and auto-reload?
@@ -553,15 +588,34 @@ compat via accepting both `transport: string` (single) and
presents an Ed25519-signed timestamp token. Verification produces the same
`Identity` type via the `IdentityProvider` trait. One `reloadAuth()` call
updates both. See [auth.md](../architecture/auth.md) and
[ADR-023](../architecture/decisions/023-unified-auth-shared-key-material.md).
[ADR-023](../architecture/decisions/023-unified-auth-shared-key-material.md).
- **OQ-CFG-07**: Should auth and secret services share a single irpc endpoint
or be separate services?
Separate services are better. Auth (verify credentials) and Secret (derive/store
keys) have different security boundaries. The secret service holds the master
seed; the auth service only needs public key fingerprints. They may run on
different machines. See [services.md](services.md) for protocol definitions.
- **OQ-CFG-08**: How do external credentials (API keys, OAuth tokens) relate
to the secret service's HD key derivation?
HD-derived keys (from SLIP-0010/BIP39) cover self-generated secrets (identity
keys, encryption keys, SSH keys). External credentials (third-party API keys,
OAuth tokens) can't be derived — they must be stored encrypted. The secret
service handles both: derived keys are regenerated on demand; stored secrets
are encrypted with a key that is itself derived from the seed. See
[services.md](services.md) for the `SecretProtocol` definition.
## Decisions Required
These decisions will be extracted into ADRs when the architecture is finalized:
1. **ADR-020**: Static/dynamic config split, `ArcSwap<DynamicConfig>` for
hot-reloadable auth and forwarding policy. Supersedes ADR-011's "no config
file" — adds optional config file while preserving programmatic-first API.
1. **ADR-020**: Static/dynamic config split. Auth delegated to `AuthService` (irpc)
for production; `ArcSwap<DynamicConfig>` for minimal deployments. Supersedes
ADR-011's "no config file" — adds optional config file while preserving
programmatic-first API.
2. **ADR-021**: Forwarding policy with rule-based allow/deny. Default-allow
preserves current behavior during migration; default-deny for production
@@ -571,6 +625,13 @@ These decisions will be extracted into ADRs when the architecture is finalized:
loops sharing auth config, session state, and shutdown. Replaces single
`ServeTransportMode` with `Vec<ListenerConfig>`.
4. **ADR-026**: Head/worker terminology. Replace hub/spoke with head/worker
throughout all documentation and APIs. A head is also a worker.
5. **ADR-028**: Auth as service. Auth verification via irpc `AuthProtocol`
service, not in-memory key set. Enables SQLite-backed auth for production,
`ArcSwap` fallback for minimal deployments.
## References
- [ADR-011](../architecture/decisions/011-no-ssh-config-programmatic-api.md) — Programmatic-first API (superseded by ADR-020)
@@ -585,4 +646,6 @@ These decisions will be extracted into ADRs when the architecture is finalized:
- [arc-swap crate](https://docs.rs/arc-swap) — Lock-free read, atomic write for shared state
- [ADR-023](../architecture/decisions/023-unified-auth-shared-key-material.md) — Unified auth with shared key material
- [auth.md](../architecture/auth.md) — Unified auth architecture spec
- [call-protocol.md](../architecture/call-protocol.md) — Bidirectional call protocol spec
- [call-protocol.md](../architecture/call-protocol.md) — Bidirectional call protocol spec
- [services.md](services.md) — Service layer architecture (irpc services)
- [core.md](core.md) — Core overview, head/worker terminology, service layer