docs(architecture): add alknet-vault spec, ADR-018, ADR-019, OQ-20/21/22

Spec the vault crate from its existing implementation. The vault is stable (implementation exists); this spec documents what IS so the implementation-sync agent can reconcile source drift. New spec documents (crates/vault/): - README.md — crate index, security constraints, public API - mnemonic-derivation.md — BIP39, SLIP-0010, BIP-0032, derivation paths - encryption.md — AES-256-GCM, EncryptedData, key versioning, salt - service.md — VaultServiceHandle lifecycle, actor dispatch, cache - protocol.md — VaultProtocol irpc messages, DerivedKey redaction New ADRs: - ADR-018: Vault as standalone crate (zero alknet deps; own types/errors) - ADR-019: Vault assembly-layer-only access (CLI is sole caller) New open questions: - OQ-20: Salt/KDF Phase B (open, low priority — salt field reserved) - OQ-21: Remote vault administration (deferred — needs ADR if ever needed) - OQ-22: Key rotation mechanism (open, low priority — workflow not specced) Spec-vs-source drift explicitly flagged (for the sync agent): - rand::random() used for IVs instead of OsRng (security-critical) - unwrap() on every RwLock acquisition (must use unwrap_or_else) - ADR-038 / OQ-SVC-03 references in source comments are stale (old numbering) - VaultServiceActor::spawn returns a non-functional second actor (source bug) - KeyVersionMismatch error variant is defined but unused in v1
2026-06-19 09:23:47 +00:00
parent 40f6468e18
commit dd1ca1de70
10 changed files with 1564 additions and 8 deletions
--- a/docs/architecture/crates/vault/service.md
+++ b/docs/architecture/crates/vault/service.md
@@ -0,0 +1,361 @@
+---
+status: draft
+last_updated: 2026-06-19
+---
+
+# Service
+
+The `VaultServiceHandle` runtime API: unlock/lock lifecycle, key
+derivation, encryption, caching, and the actor dispatch path.
+
+## What
+
+The service layer wraps the vault's cryptographic primitives in a
+stateful runtime with a clear lifecycle. It holds the master seed in
+`Zeroize`-protected memory and provides methods for the unlock/lock
+lifecycle, key derivation, and encryption/decryption.
+
+This is the API the assembly layer (CLI binary) calls. No other component
+calls these methods directly (ADR-019).
+
+## VaultServiceHandle
+
+The primary API for local (in-process) use. Thread-safe via
+`Arc<RwLock<VaultServiceInner>>`.
+
+```rust
+#[derive(Clone)]
+pub struct VaultServiceHandle {
+    inner: Arc<RwLock<VaultServiceInner>>,
+}
+
+struct VaultServiceInner {
+    mnemonic: Option<Mnemonic>,  // None if locked
+    seed: Option<Seed>,         // None if locked
+    unlocked: bool,
+    cache: KeyCache,            // TTL + LRU, see Cache section
+}
+```
+
+`VaultServiceHandle` is `Clone` — cloning shares the underlying state via
+`Arc`. This is how the actor and the assembly layer share the same vault.
+
+## Lifecycle
+
+```
+Locked (initial state)
+  │
+  │ unlock(phrase, passphrase) / unlock_new(word_count)
+  ▼
+Unlocked — derive, encrypt, decrypt available
+  │
+  │ lock()
+  ▼
+Locked — seed and cache purged
+```
+
+### unlock(phrase, passphrase)
+
+```rust
+pub fn unlock(&self, phrase: &str, passphrase: Option<&str>) -> Result<(), VaultServiceError>;
+```
+
+Unlock with an existing mnemonic phrase. Validates the phrase against the
+BIP39 word list, derives the seed, and stores both in `VaultServiceInner`.
+Returns `AlreadyUnlocked` if the vault is already unlocked.
+
+The passphrase is the BIP39 password extension (the "25th word"). `None`
+means no passphrase (equivalent to empty string). Different passphrases
+produce different seeds.
+
+### unlock_new(word_count) → phrase
+
+```rust
+pub fn unlock_new(&self, word_count: usize) -> Result<String, VaultServiceError>;
+```
+
+Generate a new random mnemonic, unlock with it, and return the phrase.
+Store the returned phrase securely — it is the root of trust. Supported
+word counts: 12, 15, 18, 21, 24.
+
+This is the "first run" path — a new node generates its mnemonic, writes
+it down, and the vault is unlocked for the process lifetime.
+
+### lock()
+
+```rust
+pub fn lock(&self);
+```
+
+Purge the seed, mnemonic, and all cached derived keys. Calls `zeroize()`
+on all sensitive material. After locking, no derive/encrypt/decrypt
+operations are possible until `unlock` is called again.
+
+`lock()` on an already-locked service is a no-op (not an error).
+
+### is_unlocked()
+
+```rust
+pub fn is_unlocked(&self) -> bool;
+```
+
+Check whether the vault is currently unlocked. Cheap (read lock only).
+
+## Derive Methods
+
+All derive methods require an unlocked vault and return
+`VaultServiceError::VaultLocked` if called while locked.
+
+### derive_ed25519(path) → DerivedKey
+
+```rust
+pub fn derive_ed25519(&self, path: &str) -> Result<DerivedKey, VaultServiceError>;
+```
+
+Derive an Ed25519 keypair at the given SLIP-0010 path. Checks the cache
+first; on a miss, derives from the seed and caches the result. Returns a
+`DerivedKey` with `KeyType::Ed25519`.
+
+### derive_encryption_key(path) → DerivedKey
+
+```rust
+pub fn derive_encryption_key(&self, path: &str) -> Result<DerivedKey, VaultServiceError>;
+```
+
+Derive an AES-256-GCM encryption key at the given path. Same cache
+behavior as `derive_ed25519`. Returns a `DerivedKey` with
+`KeyType::Aes256Gcm`.
+
+### derive_ethereum_key(path) → DerivedKey (feature-gated)
+
+```rust
+pub fn derive_ethereum_key(&self, path: &str) -> Result<DerivedKey, VaultServiceError>;
+```
+
+Derive a secp256k1 keypair at the given BIP-0032 path. Returns
+`UnsupportedKeyType` when the `secp256k1` feature is disabled. Returns a
+`DerivedKey` with `KeyType::Secp256k1` (33-byte compressed public key).
+
+### derive_password(path, length) → Vec<u8>
+
+```rust
+pub fn derive_password(&self, path: &str, length: usize) -> Result<Vec<u8>, VaultServiceError>;
+pub fn derive_password_string(&self, path: &str, length: usize) -> Result<String, VaultServiceError>;
+```
+
+Derive deterministic password bytes at the given path, truncated to
+`length`. This is **not cached** — password derivation is cheap and
+passwords are typically one-shot (derive, use, discard). The string
+variant base64url-encodes the bytes (URL-safe, no padding).
+
+`derive_password` is the mechanism for per-site deterministic passwords:
+the same seed + path always produces the same password. The path includes
+a site hash (`site_password_path(site_hash)`) so different sites get
+different passwords.
+
+## Encrypt and Decrypt
+
+### encrypt(plaintext, key_version) → EncryptedData
+
+```rust
+pub fn encrypt(&self, plaintext: &str, key_version: u32) -> Result<EncryptedData, VaultServiceError>;
+```
+
+Encrypt plaintext using the encryption key derived at `PATHS::ENCRYPTION`.
+Derives (and caches) the encryption key on first call, then uses the cache
+for subsequent calls. See [encryption.md](encryption.md) for the
+cryptographic details.
+
+### decrypt(encrypted) → String
+
+```rust
+pub fn decrypt(&self, encrypted: &EncryptedData) -> Result<String, VaultServiceError>;
+```
+
+Decrypt an `EncryptedData` blob. Derives (and caches) the encryption key at
+`PATHS::ENCRYPTION` if not already cached. The `encrypted.key_version` is
+stamped onto the `EncryptionKey` for forward compatibility but **does not
+select a different derivation path in v1** — the same key (at
+`m/74'/2'/0'/0'`) decrypts any version. Path-per-version routing is a Phase
+B concern (OQ-22). See [encryption.md](encryption.md).
+
+## Cache
+
+Derived keys are cached for performance — HD derivation involves HMAC
+operations that are not free. The cache is keyed by derivation path and
+has TTL-based expiry and LRU eviction.
+
+```rust
+pub struct KeyCache {
+    entries: HashMap<String, CachedKey>,
+    order: Vec<String>,         // LRU ordering
+    config: CacheConfig,
+}
+
+pub struct CacheConfig {
+    pub ttl: Duration,          // default: 1 hour
+    pub max_entries: usize,     // default: 64
+}
+```
+
+- **TTL**: entries expire after `ttl` (default 1 hour). Expired entries are
+  evicted lazily on access (`get` checks expiry) or via `evict_expired()`.
+- **LRU**: when the cache exceeds `max_entries` (default 64), the least
+  recently used entry is evicted. Access (`get`) updates the LRU order.
+- **Zeroized**: `CachedKey` derives `Zeroize` and `ZeroizeOnDrop`. Evicted
+  and cleared entries are zeroized — derived private keys do not linger in
+  freed heap memory.
+- **Cleared on lock**: `lock()` calls `cache.clear()`, which removes and
+  zeroizes all entries.
+
+### What is and isn't cached
+
+| Operation | Cached? | Why |
+|-----------|---------|-----|
+| `derive_ed25519` | Yes | Derivation is expensive; keys are reused |
+| `derive_encryption_key` | Yes | Same — encryption key reused across calls |
+| `derive_ethereum_key` | Yes | Same |
+| `derive_password` | No | Cheap derivation; passwords are one-shot |
+| `encrypt` / `decrypt` | Key cached | The encryption key (at `PATHS::ENCRYPTION`) is cached; the plaintext is not |
+
+`derive_password` does not cache because it's a truncation of derived
+bytes, not a keypair that's reused. Caching it would grow the cache with
+unique paths (one per site hash) for no reuse benefit.
+
+## Actor Dispatch
+
+The `VaultServiceActor` processes `VaultMessage` variants from an mpsc
+channel and dispatches to `VaultServiceHandle` methods. This is the irpc
+dispatch mechanism (ADR-005) — the in-process actor pattern that irpc
+services use.
+
+```rust
+pub struct VaultServiceActor {
+    handle: VaultServiceHandle,
+}
+
+impl VaultServiceActor {
+    pub fn new(handle: VaultServiceHandle) -> Self;
+    pub async fn run(mut self, mut rx: mpsc::Receiver<VaultMessage>);
+    pub fn spawn(handle: VaultServiceHandle) -> (Client<VaultProtocol>, VaultServiceActor);
+}
+```
+
+- `run(rx)`: Message loop. Each `VaultMessage` variant is dispatched to the
+  corresponding handle method, and the response is sent through the oneshot
+  channel embedded in the message. Consumes `self`.
+- `spawn(handle)`: Spawn the actor as a `tokio::task` and return a
+  `Client<VaultProtocol>` for sending messages. **Source bug: the current
+  `spawn` implementation returns a fresh, unspawned `VaultServiceActor` as
+  the second tuple element (the spawned actor is consumed by `run`). The
+  returned actor has no channel and is non-functional. This should be
+  corrected during implementation sync — either drop the second return
+  value (return only `Client<VaultProtocol>`) or restructure the API so
+  the returned actor is the one that was spawned.**
+
+The actor pattern is the irpc dispatch mechanism (ADR-005). For local
+in-process use, prefer `VaultServiceHandle` directly — no channel, no
+serialization. The actor exists for irpc service dispatch, which is an
+in-process pattern (the actor and the handle share state via `Arc`).
+
+### Dispatch paths
+
+| Path | Type | Serialization | Use case |
+|------|------|---------------|----------|
+| Direct (in-process) | `VaultServiceHandle` method calls | None | CLI binary at startup (the supported path) |
+| Actor (in-process) | `VaultMessage` over mpsc | None (channel) | irpc service dispatch (in-process) |
+
+Remote (in-cluster) vault dispatch — where the vault runs as a sidecar
+and other processes send `VaultMessage` over a network — is **not
+supported** (ADR-019, OQ-21). The irpc `RemoteService` trait infrastructure
+exists in the library, but exposing the vault over the network would
+require its own ADR with an explicit threat model (the master seed must
+never cross the network). The dispatch table above lists only the
+supported paths.
+
+The assembly layer (CLI binary) uses the direct path. The actor path
+exists for in-process irpc dispatch but is not used by the assembly layer
+— it's available for test harnesses and future in-process service
+patterns. Neither path is on the alknet call protocol (ADR-008, ADR-014).
+
+## Errors
+
+```rust
+#[derive(Debug, thiserror::Error, Serialize, Deserialize)]
+pub enum VaultServiceError {
+    VaultLocked,          // called derive/encrypt/decrypt while locked
+    AlreadyUnlocked,      // called unlock while already unlocked
+    Mnemonic(String),     // mnemonic generation/validation failed
+    Derivation(String),   // HD derivation failed (bad path, HMAC error)
+    Encryption(String),   // AES-GCM encrypt/decrypt failed
+    InvalidPath(String),  // derivation path is malformed
+    UnsupportedKeyType,   // secp256k1 called without the feature
+}
+```
+
+`VaultServiceError` is `Serialize`/`Deserialize` (for irpc dispatch) and
+wraps sub-errors as strings. It does not implement `From` for alknet-core
+error types — the CLI binary converts at the assembly boundary (ADR-018).
+
+## Design Decisions
+
+| Decision | ADR | Summary |
+|----------|-----|---------|
+| Assembly layer is the sole caller | [ADR-019](../../decisions/019-vault-assembly-layer-only.md) | Handlers never hold a vault reference |
+| RwLock for thread safety | — | Multiple readers (derive), exclusive writer (unlock/lock) |
+| TTL + LRU cache | — | Bounded memory, fresh keys, zeroized eviction |
+| Actor for in-cluster dispatch | [ADR-005](../../decisions/005-irpc-as-call-protocol-foundation.md) | irpc message dispatch; not on the call protocol |
+| `derive_password` not cached | — | One-shot; caching grows cache with no reuse |
+
+## Open Questions
+
+See [open-questions.md](../../open-questions.md) for full details.
+
+- **OQ-21** (deferred): Remote vault administration — network unlock is not
+  supported; needs an ADR if ever needed.
+
+## Security Constraints
+
+These are security-critical implementation requirements, not
+architectural decisions. They are documented here so implementation agents
+don't miss them.
+
+- **OsRng for IVs**: AES-GCM IVs and any cryptographic nonces must use
+  `OsRng` (or equivalent CSPRNG), not `rand::random()`. IV reuse under the
+  same key is catastrophic for GCM (authenticity breaks, two-time-pad on
+  plaintext). **The current source uses `rand::random()` for IV generation
+  in `encryption::encrypt()` — this is a known drift and must be corrected
+  during implementation sync.**
+- **Zeroized drop**: `Seed`, `Mnemonic`, `CachedKey`, `EncryptionKey`,
+  `ExtendedPrivKey`, `Secp256k1ExtendedPrivKey`, and `DerivedKey` all
+  derive `Zeroize` and `ZeroizeOnDrop`. The cache must clear on drop, not
+  just on explicit `lock()`. **The current `KeyCache::clear()` removes
+  entries but relies on `CachedKey`'s `Drop` impl for zeroization —
+  verify that `HashMap::clear()` actually drops the values (it does, but
+  this is worth a test).**
+- **No `unwrap()` or `expect()` outside tests**: poisoned lock recovery
+  uses `unwrap_or_else(|e| e.into_inner())` or explicit error propagation.
+  A panic in one vault operation must not brick the vault for all other
+  operations. **The current source uses `unwrap()` on every `RwLock`
+  acquisition in `VaultServiceHandle` (lines 142, 161, 182, 191, 196, 227,
+  264, 307, 340, 367) — this is a known drift and must be corrected. A
+  poisoned lock should be recovered with `unwrap_or_else(|e|
+  e.into_inner())`, not panicked.**
+- **`DerivedKey` is move-only, not `Clone`**: `DerivedKey` does not derive
+  `Clone`. It is move-only — consumers receive it by value and zeroize it
+  when done (handled by `#[zeroize(drop)]`). This prevents accidental
+  duplication of secret material. **The current source does not derive
+  `Clone` on `DerivedKey` — this is correct.**
+- **Cache eviction zeroizes**: when the cache evicts an entry (LRU or
+  TTL), the `CachedKey` is dropped, which triggers `ZeroizeOnDrop`. Do not
+  replace `CachedKey` with a type that doesn't zeroize.
+
+## References
+
+- Implementation: `crates/alknet-vault/src/service.rs`,
+  `crates/alknet-vault/src/cache.rs`
+- Tests: `crates/alknet-vault/tests/service_tests.rs`,
+  `crates/alknet-vault/src/service.rs` (unit tests),
+  `crates/alknet-vault/src/cache.rs` (unit tests)
+- [protocol.md](protocol.md) — `VaultMessage` and `DerivedKey`
+- [encryption.md](encryption.md) — `encrypt` / `decrypt` cryptographic details