docs: complete Phase 0 architecture — spec updates, review fixes, and link portability

Update four existing specs (overview, server, napi-and-pubsub, call-protocol) to
reflect Phase 0 decisions: three-layer model, IdentityProvider, ForwardingPolicy,
OperationEnv, static/dynamic config split. Review all 9 Phase 0a ADRs (026-034)
for consistency. Fix 4 critical issues from architecture review: missing OQ-SVC-05
in open-questions.md, deprecated hub terminology, undefined AuthService and noq
terms. Replace inline OQ text with cross-references per format rules. Add
ConfigServiceImpl definition to configuration.md. Port absolute workspace paths
to project-relative links by copying referenced docs (feasibility, certbot,
fail2ban, event_source_types) into docs/research/.
This commit is contained in:
2026-06-07 11:27:52 +00:00
parent 835724d087
commit d3633b7839
22 changed files with 1508 additions and 115 deletions

View File

@@ -7,25 +7,26 @@ last_updated: 2026-06-07
## Current State ## Current State
Architecture specification in active development. Phase 0 foundation ADRs Architecture specification in active development. Phase 0 foundation complete:
completed (026034). New spec documents created for identity, services, ADRs 001034 accepted, new spec documents created for all components, existing
interface, configuration, storage, flowgraph, and secret service. Existing specs updated for the three-layer model, crate decomposition, unified identity,
specs updated for the three-layer model, crate decomposition, and unified OperationEnv, and forwarding policy. Remaining open questions: OQ-15 (QUIC
identity. See [open-questions.md](open-questions.md) for remaining open coexistence), OQ-19 (WebTransport TLS), OQ-20 (worker registration), OQ-IF-01
questions. (Interface session/EventEnvelope), OQ-IF-02 (ForwardingPolicy placement). See
[open-questions.md](open-questions.md).
## Architecture Documents ## Architecture Documents
| Document | Status | Description | | Document | Status | Description |
|----------|--------|-------------| |----------|--------|-------------|
| [overview.md](overview.md) | reviewed | Package purpose, exports, dependencies | | [overview.md](overview.md) | reviewed | Package purpose, crate structure, three-layer model, exports, dependencies |
| [transport.md](transport.md) | reviewed | Transport abstraction: TCP, TLS, iroh | | [transport.md](transport.md) | reviewed | Transport abstraction: TCP, TLS, iroh |
| [auth.md](auth.md) | draft | Unified auth: SSH + token, IdentityProvider trait | | [auth.md](auth.md) | draft | Unified auth: SSH + token, IdentityProvider trait |
| [call-protocol.md](call-protocol.md) | draft | Bidirectional call/event protocol, operation registry | | [call-protocol.md](call-protocol.md) | draft | Bidirectional call/event protocol, OperationEnv, three dispatch paths |
| [client.md](client.md) | reviewed | Client connection, SOCKS5, port forwarding | | [client.md](client.md) | reviewed | Client connection, SOCKS5, port forwarding |
| [server.md](server.md) | reviewed | Server acceptance, channel handling, proxy | | [server.md](server.md) | reviewed | Server acceptance, IdentityProvider, ForwardingPolicy, channel handling |
| [tun-shim.md](tun-shim.md) | deprecated | TUN interface wrapper — **deferred**, use tun2proxy | | [tun-shim.md](tun-shim.md) | deprecated | TUN interface wrapper — **deferred**, use tun2proxy |
| [napi-and-pubsub.md](napi-and-pubsub.md) | reviewed | NAPI wrapper and pubsub event target adapter | | [napi-and-pubsub.md](napi-and-pubsub.md) | reviewed | NAPI wrapper, reload API, pubsub event target adapter |
| [identity.md](identity.md) | draft | Identity type, IdentityProvider trait, auth flows | | [identity.md](identity.md) | draft | Identity type, IdentityProvider trait, auth flows |
| [services.md](services.md) | draft | irpc service layer, OperationEnv, three dispatch paths | | [services.md](services.md) | draft | irpc service layer, OperationEnv, three dispatch paths |
| [interface.md](interface.md) | draft | Layer 2: Interface trait, SshInterface, RawFramingInterface | | [interface.md](interface.md) | draft | Layer 2: Interface trait, SshInterface, RawFramingInterface |
@@ -44,6 +45,9 @@ questions.
| [storage.md](../research/storage.md) | draft | Metagraph, identity, ACL, secrets, honker | | [storage.md](../research/storage.md) | draft | Metagraph, identity, ACL, secrets, honker |
| [flow.md](../research/flow.md) | draft | FlowGraph, operation graph, call graph, petgraph mapping | | [flow.md](../research/flow.md) | draft | FlowGraph, operation graph, call graph, petgraph mapping |
| [integration-plan.md](../research/integration-plan.md) | draft | Phased integration plan for services, pubsub, and operations | | [integration-plan.md](../research/integration-plan.md) | draft | Phased integration plan for services, pubsub, and operations |
| [feasibility/](../research/feasibility/) | — | SSH tunnel feasibility assessment and related analyses |
| [event-sourcing/](../research/event-sourcing/) | — | Event sourcing patterns and event-driven architecture reference |
| [ops/](../research/ops/) | — | Production ops reference: certbot, fail2ban |
## ADR Table ## ADR Table
@@ -81,6 +85,9 @@ questions.
| [033](decisions/033-operationenv-irpc-call-protocol.md) | OperationEnv as universal composition mechanism | Accepted | | [033](decisions/033-operationenv-irpc-call-protocol.md) | OperationEnv as universal composition mechanism | Accepted |
| [034](decisions/034-head-worker-terminology.md) | Head/worker terminology replacing hub/spoke | Accepted | | [034](decisions/034-head-worker-terminology.md) | Head/worker terminology replacing hub/spoke | Accepted |
> ADR numbers 020022 were allocated to proposals that were withdrawn before
> acceptance and are not listed.
## Open Questions ## Open Questions
See [open-questions.md](open-questions.md) for all open and resolved questions. See [open-questions.md](open-questions.md) for all open and resolved questions.

View File

@@ -13,6 +13,11 @@ subscriptions, and unidirectional events — all using the same wire format. The
protocol is defined as a spec + handler + registry; downstream consumers (NAPI, protocol is defined as a spec + handler + registry; downstream consumers (NAPI,
Python, head/worker) register their own operations without modifying core. Python, head/worker) register their own operations without modifying core.
OperationEnv extends the call protocol with a universal composition mechanism
that unifies local dispatch, irpc service dispatch, and remote dispatch. A
handler receives `context.env.invoke(namespace, op, input)` and doesn't know
whether the operation runs locally, in-cluster, or on a remote node.
## Why ## Why
The current control channel (ADR-018) is unidirectional (client → server) and The current control channel (ADR-018) is unidirectional (client → server) and
@@ -21,6 +26,10 @@ The call protocol generalizes it to support bidirectional calls (ADR-024) and
downstream service registration (ADR-025), enabling the head/worker model where downstream service registration (ADR-025), enabling the head/worker model where
workers expose operations the head invokes. workers expose operations the head invokes.
Without OperationEnv, handlers calling other operations would need to know
whether the target is local, in-cluster, or on a remote node. OperationEnv
abstracts this away — one handler-facing API, three dispatch backends (ADR-033).
## Architecture ## Architecture
### Operation Paths ### Operation Paths
@@ -316,6 +325,101 @@ that carries `EventEnvelope` frames:
The framing is always: 4-byte BE length prefix + JSON. The envelope shape is The framing is always: 4-byte BE length prefix + JSON. The envelope shape is
the same regardless of transport. the same regardless of transport.
### OperationEnv — Universal Composition Mechanism
OperationEnv provides the handler-facing API for composing operations. A handler
receives `context.env.invoke(namespace, operation, input)` and gets back a
`ResponseEnvelope` — regardless of which dispatch path the operation takes
(ADR-033).
Three dispatch paths, one API:
| Path | Mechanism | Serialization | Scope |
|------|-----------|---------------|-------|
| **Local** | Direct function call through registry | None (in-process) | Same process |
| **Service** | irpc protocol enum dispatch | postcard (binary) | Same cluster |
| **Remote** | Call protocol `EventEnvelope` | JSON | Cross-node |
All three produce the same `ResponseEnvelope`. Service assembly determines
which path each operation uses:
```rust
// Minimal deployment (Phase 1: single node, all local)
let env = OperationEnv::local(local_registry);
// Production deployment (Phase 2+: mix of local and remote)
let env = OperationEnv::new()
.local("auth", auth_registry)
.local("config", config_registry)
.service("secrets", secret_irpc_client)
.remote("worker-1", call_protocol_conn);
```
**Phase boundary**: Phase 1 ships with local dispatch only (direct function
calls through the operation registry). The irpc service dispatch and remote
dispatch paths are contracted here but not built yet. irpc service protocols
(`AuthProtocol`, `SecretProtocol`, etc.) are defined in the specs but the
implementations are Phase 2+ work.
**irpc is one dispatch backend for OperationEnv, not a replacement for the
call protocol or for OperationEnv.** A call protocol handler can call an irpc
service internally (e.g., `/head/auth/verify` calls
`AuthProtocol::VerifyPubkey`) — the layers compose. irpc is behind a feature
flag in alknet-core. See [services.md](services.md) for full OperationEnv and
irpc service details.
### OperationContext
Every handler receives an `OperationContext`:
```rust
pub struct OperationContext {
pub request_id: String,
pub parent_request_id: Option<String>,
pub identity: Option<Identity>,
pub metadata: HashMap<String, Value>,
pub env: OperationEnv,
pub trusted: bool, // set by buildEnv(), not by callers
}
```
- **`identity`**: The authenticated identity making the call. Populated by
`IdentityProvider` from the interface layer ([identity.md](identity.md)).
- **`env`**: The operation environment — namespaced access to other operations.
- **`trusted`**: When a handler calls another operation through `env`, the
nested call is `trusted` (skips ACL checks). This prevents double-checking:
if `/head/agent/chat` is allowed, and it internally calls
`/head/auth/verify`, the auth check is trusted.
Handler signature:
```rust
fn handle(input: Value, context: OperationContext) -> ResponseEnvelope;
```
### ResponseEnvelope
The universal return type from all three dispatch paths:
```rust
pub struct ResponseEnvelope {
pub request_id: String,
pub result: Result<Value, CallError>,
}
pub struct CallError {
pub code: String,
pub message: String,
pub retryable: bool,
}
```
Local dispatch produces `ResponseEnvelope` with no serialization. irpc service
dispatch produces postcard-encoded results that are decoded into
`ResponseEnvelope`. Remote dispatch receives `call.responded` EventEnvelope
frames and maps them to `ResponseEnvelope`. The handler always gets the same
type back.
### Relationship to @alkdev/pubsub and @alkdev/operations ### Relationship to @alkdev/pubsub and @alkdev/operations
The call protocol in core is a Rust reimplementation of the same protocol The call protocol in core is a Rust reimplementation of the same protocol
@@ -335,11 +439,11 @@ through core, out over SSH channel, into a JavaScript pubsub adapter, and
be dispatched through `@alkdev/operations`'s call handler** — with zero be dispatched through `@alkdev/operations`'s call handler** — with zero
translation at the wire level. translation at the wire level.
### Agent Service Pattern (Future) ### Agent Service Pattern (Downstream Application Concern)
An agent service — coordinating between LLM providers and tool calls — is a An agent service — coordinating between LLM providers and tool calls — is a
primary use case for the call protocol. It would be just another set of primary downstream use case for the call protocol. It would be just another set
registered operations with no special treatment: of registered operations with no special treatment:
- `/head/agent/chat` — send a message, get a completion. Routes to the - `/head/agent/chat` — send a message, get a completion. Routes to the
appropriate LLM provider based on available workers and configuration. appropriate LLM provider based on available workers and configuration.
@@ -348,12 +452,10 @@ registered operations with no special treatment:
durable storage). durable storage).
- `/head/sessions/history` — retrieve a specific session's message history. - `/head/sessions/history` — retrieve a specific session's message history.
The agent service would use the same call protocol to invoke tools on workers The agent service uses OperationEnv to invoke tools on workers. **This is a
(e.g., `/dev1/fs/readFile` for file access, `/dev1/bash/exec` for shell downstream application concern, not a core requirement.** The call protocol
commands). This is a **downstream application concern**, not a core enables it by providing the universal composition mechanism (ADR-033), but the
requirement. The call protocol enables it by providing the universal composition agent service itself is built on top, not into the core.
mechanism (OperationEnv, ADR-033), but the agent service itself is built on
top, not into the core.
## Constraints ## Constraints
@@ -370,6 +472,16 @@ top, not into the core.
boundary. ACL is enforced at the `AccessControl` level, not by path prefix boundary. ACL is enforced at the `AccessControl` level, not by path prefix
alone. A worker that exposes `/dev1/bash/exec` can restrict access via alone. A worker that exposes `/dev1/bash/exec` can restrict access via
`required_scopes` — not every authenticated identity should have shell access. `required_scopes` — not every authenticated identity should have shell access.
- **OperationEnv composition model matches the `@alkdev/operations` behavioral
contract**: namespace + operation name → invoke with input, return output.
The Rust implementation may differ in structure but must preserve this
contract (ADR-033).
- **irpc is explicitly positioned as one dispatch backend for OperationEnv**
(ADR-033, ADR-028). It is not a replacement for the call protocol or for
OperationEnv.
- **Phase 1 is local dispatch only.** irpc service dispatch and remote dispatch
are contracted in this spec but not built yet. The `OperationEnv::local()`
path is the Phase 1 implementation.
## Open Questions ## Open Questions
@@ -378,9 +490,13 @@ top, not into the core.
disconnect, or heartbeat-based discovery? See disconnect, or heartbeat-based discovery? See
[open-questions.md](open-questions.md). [open-questions.md](open-questions.md).
- **OQ-22**: Should the call protocol support streaming inputs (client streaming - **OQ-22**: ~~Should the call protocol support streaming inputs (client streaming
in gRPC terms), or is client→server always a single request payload with in gRPC terms)?~~ Resolved — deferred. Current model covers all identified use
streaming only server→client? See [open-questions.md](open-questions.md). cases. See [open-questions.md](open-questions.md).
- **OQ-IF-01**: How does the `Interface` session type relate to the call
protocol's `EventEnvelope` stream? This needs design during Phase 1.8
implementation. See [open-questions.md](open-questions.md).
## Design Decisions ## Design Decisions
@@ -389,6 +505,8 @@ top, not into the core.
| [018](decisions/018-control-channel-for-pubsub.md) | Control channel for pubsub | Reserved destination for event bus | | [018](decisions/018-control-channel-for-pubsub.md) | Control channel for pubsub | Reserved destination for event bus |
| [024](decisions/024-bidirectional-call-protocol.md) | Bidirectional call protocol | Generalizes ADR-018, both sides can call | | [024](decisions/024-bidirectional-call-protocol.md) | Bidirectional call protocol | Generalizes ADR-018, both sides can call |
| [025](decisions/025-handler-spec-separation.md) | Handler/spec separation | Downstream registers operations without modifying core | | [025](decisions/025-handler-spec-separation.md) | Handler/spec separation | Downstream registers operations without modifying core |
| [028](decisions/028-auth-irpc-service.md) | Auth as irpc service | irpc is one dispatch backend for OperationEnv |
| [033](decisions/033-operationenv-irpc-call-protocol.md) | OperationEnv | Universal composition with three dispatch paths |
## References ## References
@@ -396,7 +514,10 @@ top, not into the core.
- [napi-and-pubsub.md](napi-and-pubsub.md) — NAPI wrapper and pubsub adapter - [napi-and-pubsub.md](napi-and-pubsub.md) — NAPI wrapper and pubsub adapter
- [server.md](server.md) — Channel handling and control channel routing - [server.md](server.md) — Channel handling and control channel routing
- [transport.md](transport.md) — Transport abstraction - [transport.md](transport.md) — Transport abstraction
- [configuration.md](../research/configuration.md) — ForwardingPolicy, service metadata - [identity.md](identity.md) — Identity struct, IdentityProvider trait
- [interface.md](interface.md) — Interface layer, EventEnvelope stream from interfaces
- [configuration.md](configuration.md) — ForwardingPolicy, service metadata
- [services.md](services.md) — OperationEnv, OperationContext, irpc service layer
- `@alkdev/pubsub` — TypeScript event target adapters and `EventEnvelope` - `@alkdev/pubsub` — TypeScript event target adapters and `EventEnvelope`
- `@alkdev/operations` — TypeScript call protocol, `OperationSpec`, registry - `@alkdev/operations` — TypeScript call protocol, `OperationSpec`, registry
- `@alkdev/storage``peer_credentials` table, ACL graph, `Identity` - `@alkdev/storage``peer_credentials` table, ACL graph, `Identity`

View File

@@ -69,6 +69,39 @@ impl ConfigReloadHandle {
Obtained from `Server::run()`. Passed to NAPI or CLI for explicit reload. Obtained from `Server::run()`. Passed to NAPI or CLI for explicit reload.
### ConfigServiceImpl
The Phase 1 implementation of config service logic, backed by
`ArcSwap<DynamicConfig>`. Where `ConfigIdentityProvider` wraps the auth section
of `DynamicConfig`, `ConfigServiceImpl` wraps the forwarding and rate-limit
sections. Both are ArcSwap-backed and share the same `DynamicConfig` instance.
```rust
pub struct ConfigServiceImpl {
dynamic: Arc<ArcSwap<DynamicConfig>>,
}
impl ConfigServiceImpl {
pub fn forwarding_policy(&self) -> Arc<ForwardingPolicy> {
self.dynamic.load().forwarding.clone()
}
pub fn rate_limits(&self) -> Arc<RateLimitConfig> {
self.dynamic.load().rate_limits.clone()
}
pub fn reload(&self, new_config: DynamicConfig) {
self.dynamic.store(Arc::new(new_config));
}
}
```
Phase 1 deploys `ConfigServiceImpl` directly — no irpc service boundary. The
`ConfigProtocol` irpc service (behind feature flag) wraps `ConfigServiceImpl`
for production deployments that use the service layer. This mirrors the
`ConfigIdentityProvider` / `AuthProtocol` pattern from [identity.md](identity.md)
and ADR-028.
### ConfigService irpc Service ### ConfigService irpc Service
```rust ```rust
@@ -155,7 +188,7 @@ iroh_relay = "https://relay.alk.dev"
| Interface | Static config | Dynamic config | Reload mechanism | | Interface | Static config | Dynamic config | Reload mechanism |
|-----------|--------------|----------------|------------------| |-----------|--------------|----------------|------------------|
| CLI | Flags + optional `--config` file | Loaded at startup from `--authorized-keys` | None (restart to change) | | CLI | Flags + optional `--config` file | Loaded at startup from `--authorized-keys` | None (restart to change) |
| Core Rust | `StaticConfig` struct | `AuthService` (irpc) or `ArcSwap<DynamicConfig>` (minimal) | `ConfigService::reload()` or `ConfigReloadHandle::reload()` | | Core Rust | `StaticConfig` struct | `AuthProtocol` (irpc) or `ConfigIdentityProvider` (ArcSwap) | `ConfigProtocol::ReloadDynamicConfig` or `ConfigReloadHandle::reload()` |
| NAPI | `serve()` options | Same | `server.reloadAuth()`, `server.reloadForwarding()` | | NAPI | `serve()` options | Same | `server.reloadAuth()`, `server.reloadForwarding()` |
## Constraints ## Constraints

View File

@@ -23,4 +23,4 @@ This makes adding a new transport (e.g., WebSocket, QUIC directly) a matter of i
## References ## References
- [transport.md](../transport.md) - [transport.md](../transport.md)
- [Feasibility assessment §3](../../../../conversations/research/ssh-tunnel-vpn-alternative-feasibility.md) - [Feasibility assessment §3](../../research/feasibility/ssh-tunnel-vpn-alternative-feasibility.md)

View File

@@ -28,4 +28,4 @@ Option 3 was rejected because it would require modifying russh to understand iro
## References ## References
- [transport.md](../transport.md) - [transport.md](../transport.md)
- [Feasibility assessment §11](../../../../conversations/research/ssh-tunnel-vpn-alternative-feasibility.md) - [Feasibility assessment §11](../../research/feasibility/ssh-tunnel-vpn-alternative-feasibility.md)

View File

@@ -25,4 +25,4 @@ This is directly enabled by russh's `connect_stream()` and `run_stream()` APIs,
## References ## References
- [transport.md](../transport.md) - [transport.md](../transport.md)
- [Feasibility assessment §3.4](../../../../conversations/research/ssh-tunnel-vpn-alternative-feasibility.md) - [Feasibility assessment §3.4](../../research/feasibility/ssh-tunnel-vpn-alternative-feasibility.md)

View File

@@ -4,7 +4,7 @@
Accepted Accepted
## Context ## Context
TLS transport mode requires certificates. Manual certificate management is error-prone — users need to obtain, install, and renew certificates. Our production setup uses certbot with Let's Encrypt (documented in `/workspace/system/dev1/certbot.md`), which automates this via the ACME protocol. TLS transport mode requires certificates. Manual certificate management is error-prone — users need to obtain, install, and renew certificates. Our production setup uses certbot with Let's Encrypt (documented in [certbot.md](../../research/ops/certbot.md)), which automates this via the ACME protocol.
There are two ACME flows: There are two ACME flows:
1. **Domain-based**: Standard flow with DNS-01 or HTTP-01 challenge. Certificate is tied to a domain name, auto-renews via certbot/systemd timer. Requires port 80 or DNS access for challenges. 1. **Domain-based**: Standard flow with DNS-01 or HTTP-01 challenge. Certificate is tied to a domain name, auto-renews via certbot/systemd timer. Requires port 80 or DNS access for challenges.
@@ -35,4 +35,4 @@ The implementation should use the `rustls-acme` crate (or similar pure-Rust ACME
- [server.md](../server.md) - [server.md](../server.md)
- [OQ-01](../open-questions.md) — resolved by this ADR - [OQ-01](../open-questions.md) — resolved by this ADR
- [OQ-07](../open-questions.md) — resolved by this ADR - [OQ-07](../open-questions.md) — resolved by this ADR
- Production certbot setup: `/workspace/system/dev1/certbot.md` - Production certbot setup: [certbot.md](../../research/ops/certbot.md)

View File

@@ -4,7 +4,7 @@
Accepted Accepted
## Context ## Context
The server needs to handle abuse on public-facing deployments. Our production infrastructure uses fail2ban on Linux (documented in `/workspace/system/dev1/fail2ban.md`) with nftables and systemd journal backend. fail2ban needs structured, parseable logs to identify abusive IP addresses. The server needs to handle abuse on public-facing deployments. Our production infrastructure uses fail2ban on Linux (documented in [fail2ban.md](../../research/ops/fail2ban.md)) with nftables and systemd journal backend. fail2ban needs structured, parseable logs to identify abusive IP addresses.
However, fail2ban is Linux-specific. On other platforms (macOS, Windows, BSD), users need a different approach to reject abusive connections. The server should provide enough logging for fail2ban on Linux and enough built-in protection for other platforms. However, fail2ban is Linux-specific. On other platforms (macOS, Windows, BSD), users need a different approach to reject abusive connections. The server should provide enough logging for fail2ban on Linux and enough built-in protection for other platforms.
@@ -36,4 +36,4 @@ This ensures that even without fail2ban, the server rejects obviously abusive co
## References ## References
- [server.md](../server.md) - [server.md](../server.md)
- [OQ-08](../open-questions.md) — resolved by this ADR - [OQ-08](../open-questions.md) — resolved by this ADR
- Production fail2ban setup: `/workspace/system/dev1/fail2ban.md` - Production fail2ban setup: [fail2ban.md](../../research/ops/fail2ban.md)

View File

@@ -64,17 +64,30 @@ format, but not as a crate dependency.
### Dependency Graph ### Dependency Graph
``` ```
alknet-secret alknet-secret alknet-storage alknet-flowgraph
/ \ (standalone) (standalone) (standalone)
/ \
alknet-core ←──── ←── alknet-storage (feature flags (trait impl │ (type compat
\ / in CLI binary) via CLI wire) via JSON)
alknet-flowgraph ▼ ▼
┌─────────────────────┐
alknet-napi │ alknet-core │
alknet (CLI binary — assembles everything) │ (transport, SSH, │
│ call protocol, │
│ Identity, Config) │
└─────────┬───────────┘
┌────────────┼────────────┐
▼ ▼ ▼
alknet-napi alknet (CLI binary — assembles everything)
``` ```
All four library crates (core, secret, storage, flowgraph) are independent of
each other. Dependencies flow **upward** only. The CLI binary sits at the top
and wires concrete implementations together. alknet-storage implements
alknet-core's `IdentityProvider` trait without a crate dependency — the CLI
binary provides the bridge.
### Narrow Interface Points ### Narrow Interface Points
Three types serve as the narrow interface points between crates: Three types serve as the narrow interface points between crates:
@@ -147,4 +160,5 @@ alknet-storage does NOT depend on alknet-secret as a crate. Instead:
- [research/services.md](../../research/services.md) — Service protocols - [research/services.md](../../research/services.md) — Service protocols
- [research/storage.md](../../research/storage.md) — alknet-storage contents - [research/storage.md](../../research/storage.md) — alknet-storage contents
- [research/flow.md](../../research/flow.md) — alknet-flowgraph contents - [research/flow.md](../../research/flow.md) — alknet-flowgraph contents
- [ADR-028](028-auth-irpc-service.md) — Auth as irpc service (service protocol enabled by decomposition)
- [ADR-029](029-identity-core-type.md) — Identity as core type (narrow interface point) - [ADR-029](029-identity-core-type.md) — Identity as core type (narrow interface point)

View File

@@ -93,4 +93,4 @@ propagate beyond the service boundary without projection.
- [research/services.md](../../research/services.md) — Event boundary discipline section - [research/services.md](../../research/services.md) — Event boundary discipline section
- [research/storage.md](../../research/storage.md) — Honker integration, event boundaries - [research/storage.md](../../research/storage.md) — Honker integration, event boundaries
- [research/integration-plan.md](../../research/integration-plan.md) — ADR 032 entry - [research/integration-plan.md](../../research/integration-plan.md) — ADR 032 entry
- [event_source_types.md](/workspace/research/event_sourcing/event_source_types.md) — Event-driven architecture patterns - [event_source_types.md](../../research/event-sourcing/event_source_types.md) — Event-driven architecture patterns

View File

@@ -125,6 +125,8 @@ operations universally composable across all interfaces.
- [research/services.md](../../research/services.md) — OperationContext, OperationEnv - [research/services.md](../../research/services.md) — OperationContext, OperationEnv
- [research/integration-plan.md](../../research/integration-plan.md) — Phase 1.5, OperationEnv wiring - [research/integration-plan.md](../../research/integration-plan.md) — Phase 1.5, OperationEnv wiring
- [ADR-026](026-transport-interface-separation.md) — Three-layer model (OperationEnv is Layer 3)
- [ADR-028](028-auth-irpc-service.md) — Auth as irpc service (one dispatch backend)
- [ADR-032](032-event-boundary-discipline.md) — Event boundary discipline - [ADR-032](032-event-boundary-discipline.md) — Event boundary discipline
- [ADR-024](024-bidirectional-call-protocol.md) — Bidirectional call protocol - [ADR-024](024-bidirectional-call-protocol.md) — Bidirectional call protocol
- [ADR-025](025-handler-spec-separation.md) — Handler/spec separation - [ADR-025](025-handler-spec-separation.md) — Handler/spec separation

View File

@@ -1,6 +1,6 @@
--- ---
status: reviewed status: reviewed
last_updated: 2026-06-02 last_updated: 2026-06-07
--- ---
# NAPI Wrapper & PubSub Event Target # NAPI Wrapper & PubSub Event Target
@@ -71,11 +71,36 @@ function serve(options: AlknetServeOptions): Promise<AlknetServer>;
interface AlknetServer { interface AlknetServer {
close(): Promise<void>; close(): Promise<void>;
onConnection(callback: (stream: Duplex, info: ConnectionInfo) => void): void; onConnection(callback: (stream: Duplex, info: ConnectionInfo) => void): void;
// Dynamic config reload (ADR-030)
reloadAuth(auth: { authorizedKeys?: Buffer, certAuthority?: Buffer }): void;
reloadForwarding(policy: ForwardingPolicyConfig): void;
reloadAll(config: DynamicConfig): void;
}
interface ForwardingPolicyConfig {
default: 'allow' | 'deny';
rules: ForwardingRuleConfig[];
}
interface ForwardingRuleConfig {
target: string; // "localhost:*", "10.0.0.0/8:80", "alknet-*"
action: 'allow' | 'deny';
principals?: string[]; // default ["*"]
} }
``` ```
The NAPI layer is **transport-agnostic** — it doesn't know about pubsub's `EventEnvelope`. The pubsub adapter wraps the `Duplex` stream to implement `TypedEventTarget`. This separation ensures the NAPI wrapper is reusable for any stream-based protocol, not tied specifically to pubsub. The NAPI layer is **transport-agnostic** — it doesn't know about pubsub's `EventEnvelope`. The pubsub adapter wraps the `Duplex` stream to implement `TypedEventTarget`. This separation ensures the NAPI wrapper is reusable for any stream-based protocol, not tied specifically to pubsub.
### NAPI Call Protocol Integration
NAPI consumers can register operation handlers to participate in the call protocol. The `Duplex` stream from `connect()` or `serve()` carries `EventEnvelope` frames (4-byte BE length prefix + JSON). A TypeScript consumer can implement a call protocol handler that reads these frames and dispatches to registered operations — the same wire protocol used by `@alkdev/operations`.
See [call-protocol.md](call-protocol.md) for the call protocol spec and [services.md](services.md) for OperationEnv and dispatch paths.
### NAPI irpc Service Creation
Behind the `irpc` feature flag, NAPI consumers can create irpc service instances for in-cluster communication. This is a Phase 2+ capability — Phase 1 uses `ConfigIdentityProvider` and direct `ConfigReloadHandle` calls. See [services.md](services.md) for the irpc service layer and ADR-027 for crate decomposition.
### NAPI `connect()` vs CLI `alknet connect` ### NAPI `connect()` vs CLI `alknet connect`
The NAPI `connect()` function and the CLI `alknet connect` command are fundamentally different operations despite sharing the same name: The NAPI `connect()` function and the CLI `alknet connect` command are fundamentally different operations despite sharing the same name:
@@ -155,3 +180,10 @@ None — all resolved.
| [015](decisions/015-napi-rs-for-ffi-bridge.md) | napi-rs for FFI | Standard Node.js native addon tooling | | [015](decisions/015-napi-rs-for-ffi-bridge.md) | napi-rs for FFI | Standard Node.js native addon tooling |
| [016](decisions/016-napi-expose-connect-and-serve.md) | Both connect() and serve() | NAPI exposes client and server sides from the start | | [016](decisions/016-napi-expose-connect-and-serve.md) | Both connect() and serve() | NAPI exposes client and server sides from the start |
| [018](decisions/018-control-channel-for-pubsub.md) | Control channel for pubsub | Reserved `alknet-control` destination for event bus | | [018](decisions/018-control-channel-for-pubsub.md) | Control channel for pubsub | Reserved `alknet-control` destination for event bus |
| [030](decisions/030-static-dynamic-config-split.md) | Static/dynamic config split | NAPI reload methods for auth, forwarding, and all dynamic config |
## References
- [configuration.md](configuration.md) — DynamicConfig, ForwardingPolicy, reload mechanism
- [services.md](services.md) — OperationEnv, irpc service layer
- [call-protocol.md](call-protocol.md) — Call protocol wire format and operation registry

View File

@@ -105,7 +105,7 @@ last_updated: 2026-06-07
- **Origin**: [research/configuration.md](../research/configuration.md) - **Origin**: [research/configuration.md](../research/configuration.md)
- **Status**: resolved - **Status**: resolved
- **Priority**: low - **Priority**: low
- **Resolution**: No file watching. CLI loads once at startup; NAPI/hub reload explicitly. File watching is a potential attack vector and unnecessary complexity for a security tool. - **Resolution**: No file watching. CLI loads once at startup; NAPI/head reload explicitly. File watching is a potential attack vector and unnecessary complexity for a security tool.
- **Cross-references**: configuration.md - **Cross-references**: configuration.md
### OQ-14: ArcSwap vs RwLock for dynamic config ### OQ-14: ArcSwap vs RwLock for dynamic config
@@ -221,11 +221,18 @@ last_updated: 2026-06-07
### OQ-SVC-04: Should workers cache derived keys locally? ### OQ-SVC-04: Should workers cache derived keys locally?
- **Origin**: [secret-service.md](secret-service.md) - **Origin**: [secret-service.md](secret-service.md)
- **Status**: open - **Status**: ~~resolved~~
- **Priority**: low - **Priority**: low
- **Resolution**: Yes, with a TTL (default: 1 hour). The head can revoke by invalidating the session. - **Resolution**: Yes, with a TTL (default: 1 hour). The head can revoke by invalidating the session.
- **Cross-references**: [secret-service.md](secret-service.md) - **Cross-references**: [secret-service.md](secret-service.md)
### OQ-SVC-05: How does the NFT-based ACL smart contract interact with the secret service?
- **Origin**: [storage.md](storage.md)
- **Status**: open
- **Priority**: low
- **Resolution**: The Ethereum signing key (`m/44'/60'/0'/0/0`) is derived from the same seed as the secret service. The smart contract is a separate concern — it reads on-chain ACL state, it doesn't call the secret service.
- **Cross-references**: [storage.md](storage.md), [secret-service.md](secret-service.md)
## Interface ## Interface
### OQ-IF-01: How does the Interface session type relate to the call protocol's EventEnvelope stream? ### OQ-IF-01: How does the Interface session type relate to the call protocol's EventEnvelope stream?

View File

@@ -1,6 +1,6 @@
--- ---
status: reviewed status: reviewed
last_updated: 2026-06-02 last_updated: 2026-06-07
--- ---
# Alknet Overview # Alknet Overview
@@ -16,6 +16,64 @@ Alknet is a self-hostable SSH-based tunnel tool that provides VPN-like functiona
The core insight: SSH tunnels work because SSH is fundamental infrastructure. Blocking it breaks the internet. Alknet makes SSH tunneling accessible through a simple CLI with pluggable transports. The core insight: SSH tunnels work because SSH is fundamental infrastructure. Blocking it breaks the internet. Alknet makes SSH tunneling accessible through a simple CLI with pluggable transports.
## Crate Structure
Alknet is decomposed into six crates with a strict acyclic dependency graph (ADR-027):
| Crate | Purpose | Exists Now? |
|-------|---------|-------------|
| **alknet-core** | Transport, SSH, call protocol, config, auth types, `OperationSpec`, `Interface` trait | Yes |
| **alknet-napi** | Node.js native addon via napi-rs | Yes |
| **alknet-secret** | BIP39, SLIP-0010 HD key derivation, AES-256-GCM, `SecretProtocol` irpc service | Phase 2+ |
| **alknet-storage** | SQLite-backed metagraph, identity tables, ACL graph, honker, `StorageProtocol` | Phase 2+ |
| **alknet-flowgraph** | `FlowGraph<N,E>` over petgraph, operation graph, call graph | Phase 2+ |
| **alknet** (CLI) | Binary that assembles everything with feature flags | Yes |
The four library crates (core, secret, storage, flowgraph) are independent of each other. Dependencies flow upward only: the CLI binary sits at the top and wires concrete implementations together. alknet-storage implements alknet-core's `IdentityProvider` trait without a crate dependency — the CLI binary provides the bridge.
irpc is behind a feature flag in alknet-core. Nodes that only do SSH tunneling don't need the service layer overhead.
## Three-Layer Model
Alknet uses a three-layer model (ADR-026):
| Layer | Responsibility | Examples |
|-------|---------------|----------|
| **Layer 1: Transport** | Produces byte streams (`AsyncRead + AsyncWrite + Unpin + Send`) | TCP, TLS, iroh, DNS (future), WebTransport (future) |
| **Layer 2: Interface** | Consumes a transport stream and produces call protocol sessions | SSH (handshake + auth + channel multiplexing), raw framing (length-prefix + JSON) |
| **Layer 3: Protocol** | Carries semantics — operation registry, service calls, events | Call protocol, OperationEnv, operation dispatch |
SSH is an interface, not a transport. The three-layer model enables DNS control channels (DNS transport + raw framing), local service mesh (TCP + raw framing), and browser direct call protocol (WebTransport + raw framing) without wrapping SSH inside those transports.
A connection is always a (Transport, Interface) pair. The protocol layer is agnostic to both.
## Service Layer
The irpc service layer decomposes alknet's core responsibilities into independently testable, deployable, and replaceable components (ADR-033, [services.md](services.md)):
- **Auth** (`AuthProtocol`) — verify identities, check credentials
- **Secret** (`SecretProtocol`) — derive keys, encrypt/decrypt
- **Config** (`ConfigProtocol`) — dynamic config reload
- **Storage** (`StorageProtocol`) — graph CRUD, metagraph operations
**OperationEnv** is the universal composition mechanism. A handler receives `context.env.invoke("secrets", "derive", input)` and doesn't know whether the dispatch is local (direct function call), in-cluster (irpc service), or cross-node (call protocol `EventEnvelope`). Three dispatch paths, one handler-facing API.
**Phase boundary**: Phase 1 ships `ConfigIdentityProvider` (ArcSwap-backed) and `ConfigServiceImpl` (ArcSwap-backed) as the only auth and config implementations. The irpc service protocols (`AuthProtocol`, `SecretProtocol`, etc.) and the production deployment topology (multi-node with `StorageIdentityProvider`) are contracted in the specs but will be implemented in Phase 2+. Application services (DockerService, NodeService, agent services) are downstream concerns that build on top of the call protocol and OperationEnv.
## Identity
`Identity` struct and `IdentityProvider` trait are core types in alknet-core (ADR-029, [identity.md](identity.md)):
```rust
pub struct Identity {
pub id: String, // Fingerprint (config auth) or account UUID (database auth)
pub scopes: Vec<String>, // Authorization scope strings
pub resources: HashMap<String, Vec<String>>, // Resource-level authorization
}
```
`IdentityProvider` decouples alknet-core from identity storage. Phase 1 ships `ConfigIdentityProvider` (reads from `ArcSwap<DynamicConfig.auth>`). `StorageIdentityProvider` (Phase 2+, backed by SQLite) replaces it for production deployments. Both produce the same `Identity` result.
## Exports ## Exports
### Binary: `alknet` ### Binary: `alknet`
@@ -35,24 +93,40 @@ The `alknet-core` crate exports the pluggable components for embedding or progra
- `TcpTransport` — direct TCP connection - `TcpTransport` — direct TCP connection
- `TlsTransport` — TCP + tokio-rustls TLS - `TlsTransport` — TCP + tokio-rustls TLS
- `IrohTransport` — iroh QUIC P2P connection - `IrohTransport` — iroh QUIC P2P connection
- `Interface` trait — consumes transport stream, produces call protocol session
- `Socks5Server` — local SOCKS5 proxy that forwards through SSH channels - `Socks5Server` — local SOCKS5 proxy that forwards through SSH channels
- `PortForwarder` — manages local/remote port forwards - `PortForwarder` — manages local/remote port forwards
- `ServerHandler` — russh server handler with configurable auth and channel policies - `ServerHandler` — russh server handler with configurable auth and channel policies
- `ConnectOptions` / `ServeOptions` — programmatic configuration structs (no file parsing) - `Identity` / `IdentityProvider` — core identity types (ADR-029)
- `OperationSpec` — operation registration for call protocol (ADR-025)
- `ConnectOptions` / `ServeOptions` — programmatic configuration structs
- `StaticConfig` / `DynamicConfig` — static/immutable vs. hot-reloadable config (ADR-030)
- `ConfigReloadHandle` — programmatic reload of dynamic config
## Dependencies ## Dependencies
| Dependency | Purpose | Feature-gated | | Dependency | Purpose | Crate | Feature-gated |
|------------|---------|---------------| |------------|---------|-------|---------------|
| `russh` | SSH client & server | No (core) | | `russh` | SSH client & server | core | No (core) |
| `tokio` | Async runtime | No (core) | | `tokio` | Async runtime | core | No (core) |
| `tokio-rustls` | TLS wrapping | Yes (`tls`) | | `tokio-rustls` | TLS wrapping | core | Yes (`tls`) |
| `rustls` | TLS implementation | Yes (`tls`) | | `rustls` | TLS implementation | core | Yes (`tls`) |
| `rustls-acme` | ACME/Let's Encrypt auto-cert | Yes (`acme`) | | `rustls-acme` | ACME/Let's Encrypt auto-cert | core | Yes (`acme`) |
| `iroh` | P2P QUIC transport | Yes (`iroh`) | | `iroh` | P2P QUIC transport | core | Yes (`iroh`) |
| `clap` | CLI argument parsing | No (core) | | `irpc` | Streaming RPC service layer | core | Yes (`irpc`) |
| `tracing` | Structured logging | No (core) | | `arc-swap` | Lock-free dynamic config | core | No (core) |
| `anyhow` / `thiserror` | Error handling | No (core) | | `serde` | Serialization | core | No (core) |
| `clap` | CLI argument parsing | CLI | No (CLI) |
| `toml` | TOML config file | CLI | No (CLI) |
| `tracing` | Structured logging | core | No (core) |
| `anyhow` / `thiserror` | Error handling | core | No (core) |
| `bip39` | Mnemonic generation | secret | No (secret) |
| `ed25519-bip32` | HD key derivation | secret | No (secret) |
| `aes-gcm` | AES-256-GCM encryption | secret | No (secret) |
| `rusqlite` | SQLite (via honker) | storage | No (storage) |
| `honker` | Event-sourced storage | storage | No (storage) |
| `petgraph` | Graph data structure | storage, flowgraph | No |
| `jsonschema` | JSON Schema validation | storage, flowgraph | No |
> Note: `tun-rs` is no longer a dependency. TUN support is deferred in favor of the external `tun2proxy` tool (ADR-014). > Note: `tun-rs` is no longer a dependency. TUN support is deferred in favor of the external `tun2proxy` tool (ADR-014).
@@ -60,19 +134,29 @@ The `alknet-core` crate exports the pluggable components for embedding or progra
1. **SSH runs over transport, not alongside** — The transport layer produces a single `AsyncRead+AsyncWrite+Unpin+Send` stream. SSH runs over that stream via `russh::client::connect_stream()` / `russh::server::run_stream()`. The SSH layer never knows what transport it's on. (ADR-001, ADR-004) 1. **SSH runs over transport, not alongside** — The transport layer produces a single `AsyncRead+AsyncWrite+Unpin+Send` stream. SSH runs over that stream via `russh::client::connect_stream()` / `russh::server::run_stream()`. The SSH layer never knows what transport it's on. (ADR-001, ADR-004)
2. **SOCKS5 is the primary client interface** — Port forwarding is built on top of SOCKS5-like channel management. For VPN-like "route all traffic" behavior, users run `tun2proxy` alongside alknet's SOCKS5 proxy. TUN is not in the project scope. (ADR-005, ADR-014) 2. **Three-layer model: Transport, Interface, Protocol** — SSH is an interface (Layer 2), not a transport (Layer 1). A connection is always a (Transport, Interface) pair. The call protocol (Layer 3) is agnostic to both. This enables DNS control channels, raw framing, and WebTransport direct call protocol without wrapping SSH inside those transports. (ADR-026)
3. **No logging of tunnel destinations** — The server logs auth attempts and connections (for fail2ban) but does not log `channel_open_direct_tcpip` destinations, DNS lookups, or bytes transferred. (ADR-006, ADR-013) 3. **SOCKS5 is the primary client interface** — Port forwarding is built on top of SOCKS5-like channel management. For VPN-like "route all traffic" behavior, users run `tun2proxy` alongside alknet's SOCKS5 proxy. TUN is not in the project scope. (ADR-005, ADR-014)
4. **Programmatic-first API** — Configuration via CLI flags, library API structs (`ConnectOptions`, `ServeOptions`), and environment variables. No `~/.ssh/config` parsing, no custom config files. (ADR-011) 4. **No logging of tunnel destinations** — The server logs auth attempts and connections (for fail2ban) but does not log `channel_open_direct_tcpip` destinations, DNS lookups, or bytes transferred. (ADR-006, ADR-013)
5. **Feature flags control transport inclusion**`tls`, `iroh`, `acme` are feature-gated so the base install is lean. Users opt in to heavier dependencies. 5. **Programmatic-first API** — Configuration via CLI flags, library API structs (`ConnectOptions`, `ServeOptions`), and environment variables. No `~/.ssh/config` parsing. Optional `--config` TOML file for reproducible deployments. (ADR-011, ADR-030)
6. **Authentication is key-based** — Ed25519 public key (default) and OpenSSH certificate authority. No password authentication over SSH. (ADR-012) 6. **Feature flags control transport inclusion**`tls`, `iroh`, `acme`, `irpc` are feature-gated so the base install is lean. Users opt in to heavier dependencies.
7. **NAPI exposes both connect() and serve()** — The napi-rs wrapper provides client and server functionality, using napi-rs as the FFI bridge. The NAPI layer is transport-agnostic and not tied to pubsub. (ADR-015, ADR-016) 7. **Authentication is key-based and unified** — Ed25519 public key (default) and OpenSSH certificate authority. Same key material for SSH and token auth. Identity resolves through `IdentityProvider` trait, decoupling core from identity storage. (ADR-012, ADR-023, ADR-029)
8. **Error handling follows a consistent layered pattern** — Transport and auth errors cause reconnection (client, with exponential backoff) or connection rejection (server). Channel-level errors (target unreachable, proxy failure) close the individual channel without killing the session. Library API errors propagate via `anyhow::Result` / `thiserror` types. CLI reports errors to stderr with appropriate exit codes. NAPI errors are marshalled as JavaScript exceptions. 8. **NAPI exposes both connect() and serve()** — The napi-rs wrapper provides client and server functionality, using napi-rs as the FFI bridge. The NAPI layer is transport-agnostic and not tied to pubsub. (ADR-015, ADR-016)
9. **Static/dynamic config split** — Transport-level settings (listen address, TLS certs) are immutable after startup. Auth, forwarding policy, and rate limits are hot-reloadable via `ArcSwap<DynamicConfig>`. (ADR-030)
10. **Forwarding policy enforced before proxy spawn** — Each `channel_open_direct_tcpip` is checked against `ForwardingPolicy` before a TCP connection is made. Default-allow preserves current behavior. (ADR-031)
11. **OperationEnv as universal composition mechanism** — Handlers call `context.env.invoke(namespace, op, input)` regardless of dispatch path (local, irpc service, remote call protocol). (ADR-033)
12. **Event boundary discipline** — Domain events (Honker streams) stay within the owning service. irpc calls are synchronous and in-cluster. Call protocol `EventEnvelope` is the only thing that crosses node boundaries. (ADR-032)
13. **Error handling follows a consistent layered pattern** — Transport and auth errors cause reconnection (client, with exponential backoff) or connection rejection (server). Channel-level errors (target unreachable, proxy failure) close the individual channel without killing the session. Library API errors propagate via `anyhow::Result` / `thiserror` types. CLI reports errors to stderr with appropriate exit codes. NAPI errors are marshalled as JavaScript exceptions.
## Design Decisions ## Design Decisions
@@ -88,7 +172,7 @@ The `alknet-core` crate exports the pluggable components for embedding or progra
| [008](decisions/008-acme-lets-encrypt.md) | ACME/Let's Encrypt | Auto-provision TLS certs, domain and IP paths | | [008](decisions/008-acme-lets-encrypt.md) | ACME/Let's Encrypt | Auto-provision TLS certs, domain and IP paths |
| [009](decisions/009-default-iroh-relay.md) | Default iroh relay | n0 relay by default, `--iroh-relay` override | | [009](decisions/009-default-iroh-relay.md) | Default iroh relay | n0 relay by default, `--iroh-relay` override |
| [010](decisions/010-transport-chaining-cli.md) | Transport chaining | `--proxy` works with all transports natively | | [010](decisions/010-transport-chaining-cli.md) | Transport chaining | `--proxy` works with all transports natively |
| [011](decisions/011-no-ssh-config-programmatic-api.md) | Programmatic-first | No file-based config; options are structs, env vars, CLI flags | | [011](decisions/011-no-ssh-config-programmatic-api.md) | Programmatic-first | No SSH config files; options are structs, env vars, CLI flags (amended by ADR-030 for optional TOML) |
| [012](decisions/012-auth-ed25519-and-cert-authority.md) | Key + cert-authority | Ed25519 keys + OpenSSH CA; no password auth | | [012](decisions/012-auth-ed25519-and-cert-authority.md) | Key + cert-authority | Ed25519 keys + OpenSSH CA; no password auth |
| [013](decisions/013-fail2ban-friendly-logging.md) | Fail2ban-friendly | Structured auth logs + built-in rate limiting | | [013](decisions/013-fail2ban-friendly-logging.md) | Fail2ban-friendly | Structured auth logs + built-in rate limiting |
| [014](decisions/014-defer-tun-recommend-socks5-proxy.md) | Defer TUN | Use tun2proxy for VPN-like behavior; no alknet-tun binary | | [014](decisions/014-defer-tun-recommend-socks5-proxy.md) | Defer TUN | Use tun2proxy for VPN-like behavior; no alknet-tun binary |
@@ -97,17 +181,46 @@ The `alknet-core` crate exports the pluggable components for embedding or progra
| [017](decisions/017-stealth-mode-protocol-multiplexing.md) | Stealth mode | Protocol multiplexing on port 443 | | [017](decisions/017-stealth-mode-protocol-multiplexing.md) | Stealth mode | Protocol multiplexing on port 443 |
| [018](decisions/018-control-channel-for-pubsub.md) | Control channel | Reserved `alknet-control` destination for pubsub | | [018](decisions/018-control-channel-for-pubsub.md) | Control channel | Reserved `alknet-control` destination for pubsub |
| [019](decisions/019-proxy-dual-semantics.md) | Proxy dual semantics | `--proxy` routes transport on client, data on server | | [019](decisions/019-proxy-dual-semantics.md) | Proxy dual semantics | `--proxy` routes transport on client, data on server |
| [023](decisions/023-unified-auth-shared-key-material.md) | Unified auth | Same key material for SSH and token auth |
| [024](decisions/024-bidirectional-call-protocol.md) | Bidirectional call protocol | Both sides can initiate calls |
| [025](decisions/025-handler-spec-separation.md) | Handler/spec separation | Downstream registers operations without modifying core |
| [026](decisions/026-transport-interface-separation.md) | Three-layer model | SSH is Layer 2, not Layer 1 |
| [027](decisions/027-crate-decomposition.md) | Crate decomposition | Six crates, acyclic deps, feature-gated irpc |
| [028](decisions/028-auth-irpc-service.md) | Auth as irpc service | IdentityProvider is the contract, irpc is one backend |
| [029](decisions/029-identity-core-type.md) | Identity as core type | `Identity` and `IdentityProvider` in alknet-core |
| [030](decisions/030-static-dynamic-config-split.md) | Static/dynamic config | ArcSwap for hot-reloadable auth and forwarding |
| [031](decisions/031-forwarding-policy.md) | Forwarding policy | Per-identity, per-destination, per-transport rules |
| [032](decisions/032-event-boundary-discipline.md) | Event boundary | Domain events never cross service boundaries |
| [033](decisions/033-operationenv-irpc-call-protocol.md) | OperationEnv | Universal composition, three dispatch paths |
| [034](decisions/034-head-worker-terminology.md) | Head/worker | Replaces hub/spoke terminology |
## Open Questions ## Open Questions
All open questions have been resolved. See [open-questions.md](open-questions.md) for resolution details. See [open-questions.md](open-questions.md) for all open and resolved questions.
Key open questions: OQ-15 (QUIC coexistence), OQ-19 (WebTransport TLS),
OQ-20 (worker registration), OQ-IF-01 (Interface session / EventEnvelope
relationship).
## References ## References
- [Feasibility Assessment](../../../conversations/research/ssh-tunnel-vpn-alternative-feasibility.md) - [transport.md](transport.md) — Transport abstraction (Layer 1)
- [interface.md](interface.md) — Interface layer (Layer 2)
- [call-protocol.md](call-protocol.md) — Call protocol (Layer 3)
- [auth.md](auth.md) — Unified authentication
- [identity.md](identity.md) — Identity and IdentityProvider
- [configuration.md](configuration.md) — StaticConfig, DynamicConfig, ForwardingPolicy
- [services.md](services.md) — irpc service layer, OperationEnv
- [server.md](server.md) — Server acceptance, channel handling
- [client.md](client.md) — Client connection, SOCKS5, port forwarding
- [napi-and-pubsub.md](napi-and-pubsub.md) — NAPI wrapper and pubsub adapter
- [storage.md](storage.md) — alknet-storage: metagraph, identity, ACL
- [flowgraph.md](flowgraph.md) — alknet-flowgraph: call graph, operation graph
- [secret-service.md](secret-service.md) — alknet-secret: BIP39, SLIP-0010, AES-GCM
- [Feasibility Assessment](../research/feasibility/ssh-tunnel-vpn-alternative-feasibility.md)
- [russh API](/workspace/russh) — SSH client/server library - [russh API](/workspace/russh) — SSH client/server library
- [Dispatch](/workspace/@alkdev/dispatch) — Reference implementation of russh port forwarding - [Dispatch](/workspace/@alkdev/dispatch) — Reference implementation of russh port forwarding
- [iroh](/workspace/iroh) — P2P QUIC connections - [iroh](/workspace/iroh) — P2P QUIC connections
- [tun2proxy](https://github.com/tun2proxy/tun2proxy) — Recommended external TUN-to-SOCKS5 tool - [tun2proxy](https://github.com/tun2proxy/tun2proxy) — Recommended external TUN-to-SOCKS5 tool
- [Production certbot setup](/workspace/system/dev1/certbot.md) — Let's Encrypt on our infrastructure - [irpc](/workspace/irpc) — iroh streaming RPC
- [Production fail2ban setup](/workspace/system/dev1/fail2ban.md) — fail2ban with nftables on our infrastructure - [Production certbot setup](../research/ops/certbot.md) — Let's Encrypt on our infrastructure
- [Production fail2ban setup](../research/ops/fail2ban.md) — fail2ban with nftables on our infrastructure

View File

@@ -166,20 +166,16 @@ never leaves the secret service node.
## Open Questions ## Open Questions
- **OQ-SVC-01**: Should the secret service support multiple seed phrases (one per - **OQ-SVC-01**: Should the secret service support multiple seed phrases (one
tenant)? See [open-questions.md](open-questions.md). per tenant)? See [open-questions.md](open-questions.md).
- **OQ-SVC-02**: Should service protocols use postcard (binary) or JSON for - **OQ-SVC-02**: Should service protocols use postcard (binary) or JSON for
remote calls? Postcard for irpc (Rust-to-Rust), JSON for call protocol remote calls? See [open-questions.md](open-questions.md).
(cross-language). See [open-questions.md](open-questions.md).
- **OQ-SVC-03**: How does the secret service integrate with the existing - **OQ-SVC-03**: How does the secret service integrate with the existing
`EncryptedDataSchema` from `@alkdev/storage`? The Rust implementation replaces `EncryptedDataSchema` from `@alkdev/storage`? See [open-questions.md](open-questions.md).
PBKDF2 password-based encryption with derived AES-256-GCM keys. The
`EncryptedData` format is a superset.
- **OQ-SVC-04**: Should workers cache derived keys locally? Yes, with a TTL - **OQ-SVC-04**: Should workers cache derived keys locally? See [open-questions.md](open-questions.md).
(default: 1 hour). The head can revoke by invalidating the session.
## Design Decisions ## Design Decisions

View File

@@ -1,6 +1,6 @@
--- ---
status: reviewed status: reviewed
last_updated: 2026-06-02 last_updated: 2026-06-07
--- ---
# Server # Server
@@ -51,21 +51,30 @@ The server is the tunnel endpoint. It receives SSH channels requesting TCP conne
### Authentication ### Authentication
The server supports Ed25519 public key authentication (default) and OpenSSH certificate authority authentication (ADR-012): The server authenticates connections through the `IdentityProvider` trait (ADR-029, [identity.md](identity.md)). `IdentityProvider` decouples the server from any specific identity storage — the server resolves an identity, it doesn't manage keys.
**Ed25519 public key** (default): **Phase 1 implementation**: `ConfigIdentityProvider` (in alknet-core) reads from `ArcSwap<DynamicConfig.auth>` (ADR-030). Every authorized key gets a default scope set. No database required. This is the default for CLI and single-node deployments.
1. Load authorized keys from a specified path or in-memory data
2. `auth_publickey()` checks the presented key against the authorized set
3. Uses constant-time comparison to prevent timing attacks
**OpenSSH certificate authority** (ADR-012): **Future implementation**: `StorageIdentityProvider` (in alknet-storage, not yet built) backed by SQLite `peer_credentials` and `api_keys` tables plus the ACL graph. The server doesn't need to know which implementation is active — it goes through the trait.
1. Load a trusted CA public key (`--cert-authority <path>`)
2. `auth_publickey()` validates the presented certificate: checks CA signature, expiry, and principal restrictions
3. Supports certificate options: `permit-port-forwarding`, `no-pty`, `source-address`
This enables multi-user deployments where adding one CA line to `authorized_keys` is simpler than managing individual keys for every user. The server supports two auth presentation paths (ADR-023, [auth.md](auth.md)):
**No password authentication over SSH.** Keys and certificates are sufficient and more secure. If a local SOCKS5 proxy needs its own auth layer, that's a separate concern. **SSH public key auth** (SSH transports):
1. `auth_publickey()` callback receives the presented key
2. Delegates to `IdentityProvider::resolve_from_fingerprint()` with the key fingerprint
3. Returns `Accept` (with `Identity` attached) or `Reject`
**Ed25519 + OpenSSH certificate authority** (ADR-012):
1. If no direct key match, validate the presented certificate against trusted cert-authorities
2. Check CA signature, expiry, and principal restrictions
3. Certificate options: `permit-port-forwarding`, `no-pty`, `source-address`
**Token auth** (non-SSH transports, WebTransport):
1. Extract token from URL path or `Authorization` header
2. Delegate to `IdentityProvider::resolve_from_token()`
3. Same verification: same authorized keys set, same `Identity` result (ADR-023)
**No password authentication over SSH channels.** Keys and certificates are sufficient and more secure. If a local SOCKS5 proxy needs its own auth layer, that's a separate concern.
### Key Material Format ### Key Material Format
@@ -87,7 +96,9 @@ When a client opens a `channel_open_direct_tcpip(host, port, originator_addr, or
**Reserved destination** — If `host` starts with `alknet-` (e.g., `alknet-control`), the server routes the channel internally instead of connecting to a TCP target. The primary reserved destination is `alknet-control:0`, which bridges the channel to the local pubsub event bus (ADR-018). **Reserved destination** — If `host` starts with `alknet-` (e.g., `alknet-control`), the server routes the channel internally instead of connecting to a TCP target. The primary reserved destination is `alknet-control:0`, which bridges the channel to the local pubsub event bus (ADR-018).
**Regular destination** — For all other targets: **Forwarding policy check** — Before the proxy task is spawned for any non-reserved destination, the server evaluates `ForwardingPolicy` against the authenticated `Identity` (ADR-031, [configuration.md](configuration.md)). The policy check uses `Identity.id` and `Identity.scopes` from the identity resolved during auth. If the policy denies the destination, the channel open is rejected — no TCP connection is attempted. The default policy (`ForwardingPolicy::allow_all()`) preserves current behavior.
**Regular destination** — For targets that pass the forwarding policy check:
1. **Connection** — connect to `host:port`, either directly or via the configured outbound proxy 1. **Connection** — connect to `host:port`, either directly or via the configured outbound proxy
2. **Outbound connection** — connect to the target, either directly or via the configured outbound proxy 2. **Outbound connection** — connect to the target, either directly or via the configured outbound proxy
@@ -122,17 +133,23 @@ This makes the server appear as an ordinary web server to port scanners and DPI
The server handler implements `russh::server::Handler` with two primary responsibilities: The server handler implements `russh::server::Handler` with two primary responsibilities:
**Authentication (`auth_publickey`)**: **Authentication (`auth_publickey`)**:
- Check the presented key against the configured `authorized_keys` set (constant-time comparison) - Delegate to `IdentityProvider::resolve_from_fingerprint()` with the presented key fingerprint
- If no direct match, check whether the key is a certificate signed by a trusted cert-authority - If identity resolved, return `Accept` with the `Identity` attached to the session
- Validate certificate signature, expiry, and principal restrictions (e.g., `permit-port-forwarding`, `no-pty`, `source-address`) - If no identity, check certificate authority: validate CA signature, expiry, principals
- Return `Accept` or `Reject` - Return `Accept` or `Reject`
**Channel handling (`channel_open_direct_tcpip`)**: **Channel handling (`channel_open_direct_tcpip`)**:
- If the destination host starts with `alknet-`, route internally (control channel, ADR-018) - If the destination host starts with `alknet-`, route internally (control channel, ADR-018)
- Otherwise, connect to `host:port` (directly or via the configured outbound proxy) - Otherwise, evaluate `ForwardingPolicy` against the session's `Identity` (ADR-031)
- If denied, reject the channel open
- If allowed, connect to `host:port` (directly or via the configured outbound proxy)
- Spawn a bidirectional proxy task between the SSH channel and the outbound TCP stream - Spawn a bidirectional proxy task between the SSH channel and the outbound TCP stream
- Return the channel for data flow - Return the channel for data flow
### Interface Abstraction
SSH is one interface at Layer 2 in the three-layer model (ADR-026, [interface.md](interface.md)). The current `ServerHandler` will be refactored into `SshInterface` — it manages SSH session concerns (handshake, auth delegation, channel multiplexing). Forwarding policy, operation routing, and call protocol handling are Layer 3 concerns that live outside the interface. This refactoring is the most invasive code change in Phase 1 (integration-plan, Phase 1.8).
### Logging and Rate Limiting ### Logging and Rate Limiting
**Logging** (for fail2ban integration on Linux): **Logging** (for fail2ban integration on Linux):
@@ -159,6 +176,25 @@ These provide abuse protection on platforms without fail2ban (macOS, Windows, BS
### CLI Interface ### CLI Interface
Configuration sources (in priority order): CLI flags, environment variables, optional `--config` TOML file (ADR-030). The TOML config file is a convenience input for reproducible deployments; it does not replace `ServeOptions` (ADR-011).
Multi-transport listeners use `[[listeners]]` in the TOML config (ADR-030):
```toml
[[listeners]]
transport = "tls"
listen = "0.0.0.0:443"
[listeners.tls]
cert = "/etc/alknet/tls/cert.pem"
key = "/etc/alknet/tls/key.pem"
[[listeners]]
transport = "iroh"
```
Currently, the server binds to a single transport at a time. Multi-transport via `[[listeners]]` is coming per ADR-030.
```bash ```bash
# Basic server (SSH on port 22) # Basic server (SSH on port 22)
alknet serve --key ~/.ssh/ssh_host_ed25519_key alknet serve --key ~/.ssh/ssh_host_ed25519_key
@@ -230,7 +266,9 @@ No listening port is needed. The server connects outbound to the iroh relay (def
- The server does not log tunnel destinations (ADR-006). Auth events and connection events are logged for fail2ban integration (ADR-013). - The server does not log tunnel destinations (ADR-006). Auth events and connection events are logged for fail2ban integration (ADR-013).
- Destination strings beginning with `alknet-` are reserved for internal use (ADR-018). The server must not attempt TCP connections to `alknet-*` destinations — these are intercepted for control channel routing. - Destination strings beginning with `alknet-` are reserved for internal use (ADR-018). The server must not attempt TCP connections to `alknet-*` destinations — these are intercepted for control channel routing.
- One `ServerHandler` instance per connection. Handler state is not shared between connections (unless explicitly configured via `Arc` shared state for things like connection limits). - One `ServerHandler` instance per connection. Handler state is not shared between connections (unless explicitly configured via `Arc` shared state for things like connection limits).
- The server binds to a single transport at a time. Running multiple transports (e.g., TCP + iroh) simultaneously requires separate processes or a future multiplexing feature. - The server currently binds to a single transport at a time. Multi-transport via `[[listeners]]` is coming per ADR-030.
- Forwarding policy is evaluated before every channel proxy spawn. Denied channels are rejected immediately (ADR-031).
- Auth resolves through `IdentityProvider` (ADR-029). Phase 1 uses `ConfigIdentityProvider` backed by `ArcSwap<DynamicConfig>` (ADR-030). `StorageIdentityProvider` (Phase 2+) replaces it for production deployments with SQLite.
- ACME support requires the `acme` feature flag. Without it, only manual TLS certs are supported. - ACME support requires the `acme` feature flag. Without it, only manual TLS certs are supported.
- No password authentication over SSH channels. Key-based and cert-authority only (ADR-012). - No password authentication over SSH channels. Key-based and cert-authority only (ADR-012).
- Stealth mode (`--stealth`) requires TLS transport. It has no effect on TCP or iroh transports (ADR-017). - Stealth mode (`--stealth`) requires TLS transport. It has no effect on TCP or iroh transports (ADR-017).
@@ -273,3 +311,15 @@ None — all resolved.
| [017](decisions/017-stealth-mode-protocol-multiplexing.md) | Stealth mode | Protocol multiplexing on port 443 | | [017](decisions/017-stealth-mode-protocol-multiplexing.md) | Stealth mode | Protocol multiplexing on port 443 |
| [018](decisions/018-control-channel-for-pubsub.md) | Control channel | Reserved `alknet-control` destination for pubsub | | [018](decisions/018-control-channel-for-pubsub.md) | Control channel | Reserved `alknet-control` destination for pubsub |
| [019](decisions/019-proxy-dual-semantics.md) | Proxy dual semantics | `--proxy` routes transport on client, data on server | | [019](decisions/019-proxy-dual-semantics.md) | Proxy dual semantics | `--proxy` routes transport on client, data on server |
| [026](decisions/026-transport-interface-separation.md) | Three-layer model | SSH is Layer 2 interface, ServerHandler → SshInterface |
| [028](decisions/028-auth-irpc-service.md) | Auth as irpc service | IdentityProvider is the contract; irpc service is one backend |
| [029](decisions/029-identity-core-type.md) | Identity as core type | IdentityProvider trait in alknet-core |
| [030](decisions/030-static-dynamic-config-split.md) | Static/dynamic config split | ArcSwap for dynamic config, ConfigReloadHandle |
| [031](decisions/031-forwarding-policy.md) | Forwarding policy | Evaluated before channel proxy spawn |
## References
- [configuration.md](configuration.md) — DynamicConfig, ForwardingPolicy, ConfigReloadHandle
- [identity.md](identity.md) — IdentityProvider trait, Identity struct
- [auth.md](auth.md) — Unified auth, AuthPolicy, token auth
- [interface.md](interface.md) — Interface trait, SshInterface, three-layer model

View File

@@ -20,8 +20,8 @@ last_updated: 2026-06-07
The irpc service layer decomposes alknet's core responsibilities into The irpc service layer decomposes alknet's core responsibilities into
independently testable, deployable, and replaceable components. Auth, Secret, independently testable, deployable, and replaceable components. Auth, Secret,
Config, and Storage are irpc protocol enums that work both as in-process async Config, and Storage are irpc protocol enums that work both as in-process async
boundaries (tokio channels) and cross-process/cross-network (QUIC streams via boundaries (tokio channels) and cross-process/cross-network (irpc over iroh
noq). OperationEnv is the universal composition mechanism that unifies local QUIC streams). OperationEnv is the universal composition mechanism that unifies local
dispatch, irpc service dispatch, and remote call protocol dispatch. dispatch, irpc service dispatch, and remote call protocol dispatch.
## Why ## Why
@@ -209,13 +209,10 @@ layer to be built — they are Phase 2+ concerns.
## Open Questions ## Open Questions
- **OQ-SVC-01**: Should the secret service support multiple seed phrases (one - **OQ-SVC-01**: Should the secret service support multiple seed phrases (one
per tenant)? Defer for now — one seed per node. Multi-seed can be added per tenant)? See [open-questions.md](open-questions.md).
later by indexing the `Unlock` call with a tenant ID.
- **OQ-SVC-02**: Should service protocols use postcard (binary) or JSON for - **OQ-SVC-02**: Should service protocols use postcard (binary) or JSON for
remote calls? Postcard for irpc (Rust-to-Rust, efficient). JSON for call remote calls? See [open-questions.md](open-questions.md).
protocol (cross-language, universal). The irpc remote path naturally uses
postcard.
## Design Decisions ## Design Decisions

View File

@@ -197,17 +197,12 @@ dependency.
## Open Questions ## Open Questions
- **OQ-SVC-03**: How does the secret service integrate with the existing - **OQ-SVC-03**: How does the secret service integrate with the existing
`EncryptedDataSchema` from `@alkdev/storage`? The Rust implementation replaces `EncryptedDataSchema` from `@alkdev/storage`? See [open-questions.md](open-questions.md).
PBKDF2 password-based encryption with derived AES-256-GCM keys. The
`EncryptedData` format is a superset — old format can be migrated by
re-encrypting with the new key.
- **OQ-SVC-04**: Should workers cache derived keys locally? Yes, with a TTL - **OQ-SVC-04**: Should workers cache derived keys locally? See [open-questions.md](open-questions.md).
(default: 1 hour). The head can revoke by invalidating the session.
- **OQ-SVC-05**: How does the smart contract (NFT-based ACL) interact with the - **OQ-SVC-05**: How does the NFT-based ACL smart contract interact with the
secret service? The Ethereum signing key (`m/44'/60'/0'/0/0`) is derived from secret service? See [open-questions.md](open-questions.md).
the same seed. The smart contract is a separate concern.
## Design Decisions ## Design Decisions

View File

@@ -0,0 +1,91 @@
Here is an article tailored specifically to untangle these concepts. It is structured not just as a conceptual guide, but as a **diagnostic tool**—perfect for feeding into an AI coding CLI to sniff out architectural smells and "spaghetti concepts" in a codebase.
***
# Deconstructing Event-Driven Architecture: Untangling "Spaghetti Concepts"
In modern software architecture, the term "Event" has fallen victim to *semantic diffusion*—a concept popularized by Martin Fowler where a term becomes so widely used that it loses its original, specific meaning. When developers use the same word to describe state persistence, data distribution, and asynchronous notifications, the result is "Spaghetti Concepts."
Just like spaghetti code, spaghetti concepts lead to tight coupling, brittle systems, and unpredictable side effects. To fix an Event-Driven Architecture (EDA), we must draw hard boundaries around what an "event" is actually doing in any given context.
This guide breaks down the distinct types of events, their proper use cases, and the structural anti-patterns (Conflation Points) that occur when they are mixed up.
---
## 1. Event Sourcing (State Persistence)
**The Concept:** Event Sourcing is a method of persisting state. Instead of saving the *current* state of an entity (e.g., `Quantity: 27`) in a database row, you save the *history of facts* that led to that state (e.g., `Received 30`, `Shipped 5`, `Adjusted +2`). The current state is derived by replaying these facts.
**The Golden Rule:** Event Sourcing is an **internal implementation detail** of a specific service or Aggregate. It is highly specific to the domain logic.
**How to Identify It:**
* Uses a specialized stream database (like EventStoreDB).
* Events are named in the past tense representing highly specific domain actions (`InventoryAdjusted`, `OrderPlaced`).
* The system reads a stream of these events to reconstruct an object in memory before applying new business rules.
### 🚨 Conflation Point: Leaking the Event Store (The Database Reach-In)
**The Smell:** Service B connects directly to Service As event store to read its events and react to them.
**Why its bad:** Because Event Sourcing events are internal state, exposing them externally completely shatters Service A's encapsulation. If Service A refactors how it calculates inventory, Service B breaks.
**The Fix:** Service A should project its internal Event Sourcing events into generalized **Integration Events** (see below) and publish those to a message broker (like RabbitMQ or Kafka) for Service B to consume.
---
## 2. Event-Carried State Transfer (Data Distribution)
**The Concept:** Also known as "Fat Events," this pattern is used to distribute data across services to avoid synchronous API calls (temporal coupling). If Service B needs to know about a Product's price to calculate a shopping cart total, Service A publishes an event containing the *entire* current state of that product. Service B listens to this event and builds a local, read-only cache (a projection).
**The Golden Rule:** These events exist to answer the question, *"What does the data look like now?"* without requiring a synchronous HTTP callback.
**How to Identify It:**
* Events often have generic CRUD-like names (`ProductUpdated`, `CustomerCreated`).
* Payloads are "fat"—they contain a lot of data (ID, Name, Price, Category, etc.).
* Often implemented using Change Data Capture (CDC) tools like Debezium reading from a primary database and publishing to Kafka.
### 🚨 Conflation Point: Event Sourcing vs. State Transfer
**The Smell:** Using a state transfer tool (like Debezium publishing `RowUpdated` events) as a makeshift Event Sourcing log to derive business logic.
**Why its bad:** A database row update doesn't tell you *why* the data changed. Was a user's address updated because they moved, or because there was a typo? Business intent is lost.
**The Fix:** Keep CDC and state transfer events strictly for updating local read-caches in downstream services. Do not use them to drive complex business workflows that rely on "intent."
---
## 3. Notification Events (Behavioral Triggers)
**The Concept:** Also known as "Thin Events," these are lean messages broadcasted to notify the system that a business milestone has occurred. They usually contain minimal data—often just an Entity ID and an action.
**The Golden Rule:** They act as an asynchronous "tap on the shoulder" to tell downstream services to trigger their own workflows (Choreography).
**How to Identify It:**
* Payloads are "thin" (e.g., `{ "Event": "OrderShipped", "OrderId": "123" }`).
* Used heavily in integrations (e.g., triggering an email via AWS SES, or notifying a shipping warehouse).
### 🚨 Conflation Point: The Synchronous Callback Trap (Boomerang Coupling)
**The Smell:** Service A publishes a thin `OrderPlaced` event. Service B receives it, but to do its job, it must immediately make a synchronous HTTP REST call back to Service A to fetch the order details.
**Why its bad:** If Service A goes down, Service B fails. You have successfully implemented Event-Driven Architecture, but kept the exact synchronous temporal coupling you were trying to eliminate. Furthermore, a flood of events can cause a DDoS attack on your own service.
**The Fix:** If downstream services *always* need the data to process the event, upgrade the Notification Event to an Event-Carried State Transfer ("Fat Event") by including the required data in the payload.
---
## 4. Domain Events vs. Integration Events (The Boundary Rule)
*Own Insight / DDD Integration*
A massive source of spaghetti concepts is failing to differentiate between events meant for *inside* the house and events meant for *outside* the house.
* **Domain Events:** Fired and consumed *within the same service boundary*. They can contain rich, complex, internal domain models because the producer and consumer share the same codebase/ubiquitous language.
* **Integration Events:** Fired *across service boundaries*. They should be simple, generalized, and stripped of internal jargon or complex objects.
### 🚨 Conflation Point: The Leaky Domain Model
**The Smell:** A microservice publishes an event to a global Kafka topic, and the payload contains internal database IDs, complex nested objects, or serialized language-specific data types (like Java/C# specific enums).
**Why its bad:** Downstream services are now strictly coupled to the internal data structure of the upstream service.
**The Fix:** Implement an Anti-Corruption Layer. The producing service should catch its own Domain Event, map the data to a simplified, standardized, versioned Integration Event, and publish *that* to the wider system.
---
## Code Review CLI Prompt: "The Conflation Detector"
*(Note: Feed the following heuristics to your AI CLI alongside this article to review your codebase).*
**AI CLI Instructions:** Scan the provided codebase for Event-Driven Architecture anti-patterns. Flag code that violates the conceptual boundaries described in the article. Look specifically for:
1. **Shared Event Stores:** Are multiple distinct microservices connecting to the same EventStoreDB or reading the exact same raw Event Sourcing stream?
2. **Boomerang Callbacks:** Is an event consumer receiving a message from a broker (RabbitMQ/Kafka/Azure Service Bus), extracting an ID, and immediately making an HTTP request to the service that originated the event?
3. **Leaky Domain Models:** Are internal entity objects (e.g., classes mapped directly to ORMs like Entity Framework or Hibernate) being serialized directly into event payloads sent to external message brokers?
4. **Misused CDC:** Are Debezium/database-trigger events being used to trigger business logic workflows, rather than simply updating read-models/caches?
5. **Fat Notification Trap:** Are Notification events carrying massive payloads just to trigger an email, when a thin event would suffice? Or conversely, are thin events starving consumers of necessary data?

View File

@@ -0,0 +1,773 @@
# SSH Tunnel VPN Alternative — Feasibility Assessment
**Date**: 2026-06-01
**Status**: Feasibility assessment / architecture sketch
**Updated**: 2026-06-01 — Added iroh transport analysis (§11)
## 1. Problem Statement
Countries in the "developed west" (UK, CA, etc.) are increasingly banning or restricting VPNs at the protocol level. The valid use case of a VPN — a *virtual private network* for securing traffic on hostile networks, accessing private infrastructure, and tunneling between trusted endpoints — gets caught in the crossfire when VPNs are treated primarily as location-spoofing tools.
SSH-based tunnels cover the same functional ground without being a VPN protocol. Blocking SSH would break the internet in critical ways (infrastructure management, CI/CD, development workflows). The goal is to build a dead-simple, self-hostable Rust client/server that provides VPN-like functionality over SSH, with optional TLS wrapping for traffic obfuscation.
## 2. Reference Codebase Analysis
### 2.1 Dispatch (`/workspace/@alkdev/dispatch`)
Dispatch proves russh usage well within scope. Key takeaways:
- **Pure SSH client** — `client::Handler` is a zero-sized type, auto-accepts server keys. Minimal boilerplate.
- **Arc-wrapped Handle pattern** — `Arc<client::Handle<Client>>` enables sharing across concurrent tasks (port forwarding, SFTP, exec).
- **Port forwarding via `channel_open_direct_tcpip`** — Already implemented. Local TCP listener → `direct-tcpip` SSH channel → `tokio::io::copy_bidirectional`. This is the standard SSH `-L` pattern, implemented programmatically.
- **Channel-per-operation model** — Each operation opens its own SSH channel on a shared session. Multiplexing is handled by russh internally.
- **Channel.into_stream()** — Converts SSH channels to `AsyncRead + AsyncWrite` streams, enabling use with any tokio I/O combinator.
The dispatch codebase is clean and demonstrates that the core SSH mechanics are straightforward. The new project would need both client **and** server sides, but russh's server API mirrors the client API closely.
### 2.2 russh (`/workspace/russh`)
Critical capabilities confirmed:
| Feature | API | Status |
|---------|-----|--------|
| Local port forwarding (client → server → remote) | `Handle::channel_open_direct_tcpip()` | Available, no feature flag |
| Remote port forwarding (server listens, client gets channels) | `Handle::tcpip_forward()` / Handler callback `server_channel_open_forwarded_tcpip()` | Available, no feature flag |
| Unix socket forwarding | `Handle::channel_open_direct_streamlocal()` / `Handle::streamlocal_forward()` | Available, no feature flag |
| Server-side reverse forwarding | `server::Handler::tcpip_forward()` / `server::Handle::forward_tcpip()` | Available, no feature flag |
| Arbitrary stream transport | `client::connect_stream()` / `server::run_stream()` | **Both accept `AsyncRead+AsyncWrite+Unpin+Send`** |
| Channel as bidirectional stream | `Channel::into_stream()` / `split()` | Available |
**The `connect_stream()` and `run_stream()` APIs are the key enabler for TLS wrapping.** They accept any async byte stream, meaning we can layer TLS (via `tokio-rustls`) underneath russh without modifying russh itself. The SSH session runs over a TLS stream, which looks like HTTPS to DPI.
## 3. Architecture Sketch
### 3.1 Components
```
┌─────────────────────────────────┐ ┌─────────────────────────────────┐
│ CLIENT │ │ SERVER │
│ │ │ │
│ ┌──────────┐ ┌───────────┐ │ │ ┌───────────┐ ┌──────────┐ │
│ │ TUN │ │ SSH │ │ SSH │ │ SSH │ │ Proxy │ │
│ │ Interface│───▶│ Client │──┼─ over ──▶│ Server │───▶│ Handler │ │
│ │ (tun-rs)│◀───│ (russh) │ │ TLS │ (russh) │◀───│ │ │
│ └──────────┘ └─────┬─────┘ │ opt. │ └─────┬─────┘ └────┬─────┘ │
│ │ │ │ │ │ │
│ ┌─────▼─────┐ │ │ ┌─────▼─────┐ ┌────▼─────┐ │
│ │ TLS Layer │ │ │ │ TLS Layer │ │ Outbound │ │
│ │(tokio- │ │ │ │(tokio- │ │ Proxy │ │
│ │ rustls) │ │ │ │ rustls) │ │(SOCKS5/ │ │
│ └─────┬─────┘ │ │ └─────┬─────┘ │ HTTP) │ │
│ │ │ │ │ └────┬─────┘ │
│ ┌─────▼─────┐ │ │ ┌─────▼─────┐ │ │
│ │ TCP │ │ │ │ TCP │ ┌────▼─────┐ │
│ │ Connect │◀─┼────────▶│ │ Listener │ │ Direct │ │
│ └───────────┘ │ │ └───────────┘ │ Forward │ │
│ │ │ └────┬─────┘ │
└─────────────────────────────────┘ └─────────────────────────────────┘
│ │
Proxy Mode Direct Mode
(outbound via (outbound
SOCKS5/HTTP) direct TCP)
```
### 3.2 Data Flow — Client TUN Mode
1. **TUN interface** (created via `tun-rs`) captures IP packets from the OS routing table
2. **Client reads IP packets** from the TUN device, determines destination IP:port
3. **Client opens `direct-tcpip` SSH channel** to destination via `handle.channel_open_direct_tcpip(dest_ip, dest_port, ...)`
4. **Client writes packet payload** to the SSH channel, reads response
5. **Client writes response** back to TUN interface
This is essentially what tun2proxy does, except instead of SOCKS5 upstream, it's an SSH channel.
### 3.3 Data Flow — TLS Obfuscation Mode
When `--tls` or `--https` is specified:
1. **Client establishes TLS connection** to `server:443` using `tokio-rustls::TlsStream`
2. **SSH session runs over the TLS stream** via `client::connect_stream(Arc::new(config), tls_stream, handler)`
3. **Server accepts TLS connection**, then runs `server::run_stream(server_config, tls_stream, handler)`
4. **To DPI, the traffic looks like HTTPS** — standard TLS handshake, then encrypted application data
5. Optional: Server can present a legitimate-looking certificate and serve a fake nginx 404 to non-SSH probes (similar to https_proxy's stealth approach)
### 3.4 Data Flow — Server-Side Proxy Mode
When `--proxy` is specified on the server:
1. Client requests `channel_open_direct_tcpip(target_host, target_port, ...)`
2. Server's `channel_open_direct_tcpip` handler checks ACLs
3. Instead of connecting directly, server routes through a local SOCKS5/HTTP proxy
4. This provides an additional hop for privacy — the SSH server's IP isn't exposed to the destination
### 3.5 CLI Interface Sketch
```bash
# Server — simplest mode (SSH only, port 22)
ghost serve --key /etc/ssh/ssh_host_ed25519_key
# Server — with TLS on port 443
ghost serve --key /etc/ssh/ssh_host_ed25519_key --tls --tls-cert /etc/ssl/cert.pem --tls-key /etc/ssl/key.pem
# Server — with TLS + outbound proxy
ghost serve --key /etc/ssh/ssh_host_ed25519_key --tls --tls-cert /etc/ssl/cert.pem --tls-key /etc/ssl/key.pem --proxy socks5://127.0.0.1:9050
# Client — TUN mode (routes all traffic through SSH tunnel)
ghost connect --server example.com:443 --tls --identity ~/.ssh/id_ed25519 --tun
# Client — Single port forward (like SSH -L)
ghost connect --server example.com:443 --tls --identity ~/.ssh/id_ed25519 --forward 5432:db.internal:5432
# Client — SOCKS5 proxy mode (local SOCKS5 that tunnels through SSH)
ghost connect --server example.com:443 --tls --identity ~/.ssh/id_ed25519 --socks5 1080
```
**Working name: `ghost`** (as in "ghost in the shell" — it's SSH, it's stealthy, it passes through walls). Or `shade`, `wraith`, `spectre`. Pick anything.
## 4. Key Technical Decisions & Unknowns Analysis
### 4.1 TUN Interface — SOLVED
**Library: `tun-rs` (v2, formerly `tun` crate)**
- Supports Linux, macOS, Windows (via wintun.dll), FreeBSD, OpenBSD, NetBSD, Android, iOS
- Async API with `tokio` feature: `DeviceBuilder::new().build_async()`
- Clean `recv()` / `send()` API — read IP packets, write IP packets
- Already used in production by tun2proxy and similar projects
- Supports hardware offload (TSO/GSO) on Linux for performance
- No `CAP_NET_ADMIN` needed on some platforms when using `--unshare` namespace approach (tun2proxy pattern)
**This is a solved problem.** The `tun-rs` crate is mature, cross-platform, and async-native with tokio. The implementation is straightforward:
```rust
let dev = DeviceBuilder::new()
.ipv4("10.0.0.1", 24, None)
.mtu(1400)
.build_async()?;
let mut buf = vec![0u8; 65536];
loop {
let len = dev.recv(&mut buf).await?;
// Parse IP header, determine destination
// Open SSH channel to destination
// Write response back to TUN
}
```
**Key consideration**: On Linux requires `CAP_NET_ADMIN` or root. The tun2proxy approach of using network namespaces (`--unshare`) is worth adopting for unprivileged operation.
### 4.2 SSH over TLS — SOLVED (architecturally)
**Approach: Layer TLS beneath SSH using russh's `connect_stream` / `run_stream`**
This is the critical insight. russh already decouples transport from protocol:
- `client::connect_stream(config, stream, handler)` — accepts any `AsyncRead + AsyncWrite + Unpin + Send`
- `server::run_stream(config, stream, handler)` — same for server
This means:
```rust
// Client side
let tcp_stream = TcpStream::connect((server_addr, server_port)).await?;
let tls_stream = TlsStream::connect(tls_connector, server_domain, tcp_stream).await?;
let handle = client::connect_stream(config, tls_stream, handler).await?;
// Server side
let (tcp_stream, addr) = tcp_listener.accept().await?;
let tls_stream = TlsStream::accept(tls_acceptor, tcp_stream).await?;
server::run_stream(config, tls_stream, handler).await?;
```
**No modification to russh is needed.** This is a clean layering.
**For HTTPS stealth**: The server can:
1. Accept connections on port 443
2. Present a valid TLS certificate (self-signed or Let's Encrypt via ACME)
3. Non-SSH clients making HTTP requests get a normal-looking 404 response
4. SSH clients speak SSH protocol directly after TLS handshake
5. DPI sees standard HTTPS traffic since the TLS handshake is normal
The https_proxy project demonstrates this pattern well — stealth proxy returning fake nginx 404s to probes.
### 4.3 IP Packet Handling — NEEDS DESIGN
When using TUN mode, we're receiving raw IP packets. We need to:
1. **Parse IP headers** to determine destination IP and port
2. **Track connection state** — map `(src_ip, src_port, dst_ip, dst_port)` to SSH channels
3. **TCP reassembly** — handle segmentation, retransmission, etc.
4. **ICMP handling** — respond to pings, handle unreachable destinations
5. **DNS interception** — handle DNS queries that arrive at the TUN interface
This is the most complex part. Options:
**Option A: Use a userspace TCP/IP stack (smoltcp)**
- Parse packets, but let a userspace stack handle TCP
- Heavier dependency, but proven approach (what tun2proxy does with its own stack)
- `smoltcp` is well-maintained, used in embedded and networking projects
**Option B: Raw packet forwarding with NAT**
- Simpler conceptually — just NAT the packets, forward them through the SSH channel
- Requires handling TCP state at the IP level (seq/ack manipulation, checksum recalculation)
- More error-prone
**Option C: SOCKS5 proxy mode only (no TUN)**
- Simplest to implement — just a local SOCKS5 server that forwards through SSH
- Browsers, curl, and most apps can use SOCKS5
- No root/CAP_NET_ADMIN needed
- But: doesn't capture all traffic (UDP, DNS leaks, etc.)
**Recommendation**: Start with Option C (SOCKS5 proxy mode) as the minimal viable product. Add TUN mode (Option A with smoltcp) as an advanced feature. This matches how tun2proxy structures their project and is the pragmatic path.
### 4.4 SSH Server Authentication — STRAIGHTFORORD
The server implementation needs:
- **Public key authentication** — primary method, matching standard SSH practices
- **`authorized_keys` file support** — read `~/.ssh/authorized_keys` or a custom path
- **Optional password authentication** — for convenience, but not recommended for production
russh's `server::Handler` trait provides `auth_publickey` and `auth_password` callbacks. Implementation is trivial:
```rust
async fn auth_publickey(&mut self, user: &str, public_key: &PublicKey) -> Auth {
if self.authorized_keys.iter().any(|k| k == public_key) {
Auth::Accept
} else {
Auth::Reject { proceed_with_methods: None, partial_success: false }
}
}
```
### 4.5 DNS Handling — DESIGN DECISION NEEDED
In TUN mode, DNS queries need to be routed through the tunnel. Options:
1. **Virtual DNS** (tun2proxy approach) — intercept DNS packets, map query names to fake IPs from a reserved range (198.18.0.0/15), resolve via the SSH tunnel
2. **DNS-over-TCP** — Force DNS through the SSH tunnel
3. **Direct DNS** — Don't handle DNS in the tunnel, rely on system resolver
4. **SOCKS5 mode** — SOCKS5 supports DOMAIN names natively (SOCKS5h), so DNS resolution happens server-side
**Recommendation**: SOCKS5 mode handles DNS naturally via SOCKS5h. For TUN mode, adopt the virtual DNS approach from tun2proxy (their `ip-stack` crate handles this).
### 4.6 Connection Multiplexing — ALREADY SOLVED
russh multiplexes channels over a single SSH connection. No need to manage multiple TCP connections per tunnel. One SSH connection, many channels. This is exactly what we want.
### 4.7 Keep-Alive and Reconnection — NEEDS DESIGN
- **SSH keepalive**: russh `Config` has `keepalive_interval` and `keepalive_max`
- **Auto-reconnect**: Client should detect disconnection (`is_closed()`) and reconnect with exponential backoff
- **TUN continuity**: When SSH reconnects, existing TCP connections through the tunnel will fail, but new ones will work. This is acceptable behavior (same as any VPN).
### 4.8 Server-Side Proxy (Outbound) — STRAIGHTFORORD
When `--proxy` is specified, the server's `channel_open_direct_tcpip` handler forwards through a local proxy:
```rust
async fn channel_open_direct_tcpip(
&mut self,
host: &str,
port: u32,
...
) -> Result<Channel<Msg>, Self::Error> {
// Option 1: Connect directly
let stream = TcpStream::connect((host, port as u16)).await?;
// Option 2: Connect through SOCKS5 proxy
let stream = connect_socks5(proxy_addr, host, port).await?;
// Option 3: Connect through HTTP CONNECT proxy
let stream = connect_http_proxy(proxy_addr, host, port).await?;
// Then bidirectional copy between SSH channel and stream
Ok(channel)
}
```
SOCKS5 client implementation is simple (5-byte handshake, variable-length connect). HTTP CONNECT is also straightforward. Both can be implemented in a few hundred lines.
## 5. Dependency Assessment
| Dependency | Purpose | Maturity | Risk |
|------------|---------|----------|------|
| `russh` | SSH client & server | High (used in dispatch, well-maintained) | Low — already proven |
| `tun-rs` (v2) | TUN/TAP interface | High (cross-platform, prod-tested, bench'd at 70Gbps) | Low — well-maintained |
| `tokio-rustls` | TLS layer | High (standard Rust TLS) | Low — widely used |
| `rustls` | TLS implementation | High | Low — no ring dependency needed with aws-lc-rs |
| `smoltcp` | Userspace TCP/IP stack (TUN mode) | Medium-High | Medium — complex but well-proven |
| `clap` | CLI args | High | None |
| `tracing` | Structured logging | High | None |
| `anyhow/thiserror` | Error handling | High | None |
| `tokio` | Async runtime | High | None |
**No immature or risky dependencies.** Every crate is well-established with active maintenance.
## 6. Risk Assessment
### 6.1 Technical Risks
| Risk | Likelihood | Impact | Mitigation |
|------|-----------|--------|------------|
| TUN mode complexity (TCP state, IP parsing) | Medium | Medium | Start with SOCKS5 mode; TUN is advanced feature |
| Cross-platform TUN differences | Medium | Medium | tun-rs handles most; `--unshare` for Linux privilege separation |
| TLS + SSH interaction edge cases | Low | Low | Both are well-tested; russh's `connect_stream` / `run_stream` abstracts transport |
| Performance under load | Low | Medium | russh multiplexes channels; tun-rs has benchmarked 35+ Gbps async |
| DPI detecting SSH banner over TLS | Medium | High | After TLS, the SSH banner ("SSH-2.0-...") is encrypted. But SNI reveals domain. Use `Config { anonymous: true }` to minimize fingerprint, or configure `client_id` to look like a web server. |
### 6.2 Protocol-Level Risks
| Risk | Likelihood | Impact | Mitigation |
|------|-----------|--------|------------|
| SSH protocol fingerprinting (packet sizes, timing) | Medium | Medium | Pad messages, add random delays. russh doesn't do this natively — would need custom channel wrapping. |
| SNI leaks domain in TLS handshake | High | Low | Use a innocuous domain. Could also explore ECH (Encrypted Client Hello) in rustls if available. |
| Deep packet inspection identifying SSH patterns even over TLS | Low-Medium | Medium | The TLS layer prevents payload inspection. Only traffic analysis (sizes, timing) is possible. Padding and traffic shaping could help. |
| Countries blocking SSH traffic on port 22 | Already happening | N/A | That's the whole point — we run SSH over TLS on port 443 |
### 6.3 Usability Risks
| Risk | Likelihood | Impact | Mitigation |
|------|-----------|--------|------------|
| Requires self-hosted server | By design | Medium | Document simple deployment. Provide Docker image. Consider one-command install script. |
| Root/CAP_NET_ADMIN needed for TUN on Linux | High | Medium | Provide `--unshare` mode. SOCKS5 mode needs no privileges. |
| Certificate management for TLS mode | Medium | Low | Support self-signed certs, ACME (Let's Encrypt), or manual cert paths. |
## 7. Implementation Plan
### Phase 1: MVP (2-3 days)
**SOCKS5 proxy mode only. No TUN. Client + server.**
1. **Server binary** (`ghost serve`)
- russh server implementation with public key auth
- `channel_open_direct_tcpip` handler: connect to target directly or via outbound proxy
- Optional TLS wrapping via `tokio-rustls` + `server::run_stream`
- Config: listen address, host key path, authorized keys, TLS options, proxy options
2. **Client binary** (`ghost connect`)
- russh client with public key auth
- Local SOCKS5 server that forwards connections through SSH `channel_open_direct_tcpip`
- Optional TLS wrapping via `tokio-rustls` + `client::connect_stream`
- Config: server address, identity key, TLS options, SOCKS5 listen address
3. **Testing**
- Integration test: client → server → HTTP target
- Test with: `curl --socks5-hostname 127.0.0.1:1080 https://example.com`
- Test TLS mode against DPI-like inspection
### Phase 2: Port Forwarding (1 day)
4. **Client: explicit port forwards** (`--forward local:remote:port`)
- Direct reimplementation of SSH `-L` and `-R`
- Uses `channel_open_direct_tcpip` for local forwards
- Uses `tcpip_forward` / handler callback for remote forwards
5. **Client: SOCKS5 with DNS** (SOCKS5h)
- Domain names resolved server-side, not client-side
### Phase 3: TUN Mode (2-3 days)
6. **Client: TUN interface mode** (`--tun`)
- Create TUN device via `tun-rs`
- IP packet routing through SSH channels
- Either: raw packet forwarding (simpler, but fragile) or smoltcp integration (robust, but more code)
- Recommend: use tun2proxy's `ip-stack` crate or similar for TCP reconstruction
- Virtual DNS for TUN mode
7. **Privilege separation**
- `--unshare` mode for Linux (create network namespace, unshare)
- Document CAP_NET_ADMIN requirement
### Phase 4: Hardening & Polish (1-2 days)
8. **Obfuscation improvements**
- SSH banner customization (`client_id` config)
- Random padding in channel data
- Traffic shaping / constant-rate padding (optional, advanced)
9. **Server stealth**
- Non-SSH connection detection: serve fake nginx 404 on TLS port
- Dual-protocol listener: HTTPS for browsers, SSH for ghost clients
10. **Auto-reconnect**
- Exponential backoff reconnect on SSH session drop
- TUN interface survives reconnect (new connections work, in-flight connections fail gracefully)
### Phase 5: Distribution (1 day)
11. **Build & packaging**
- Static musl binary for Linux
- Docker image
- systemd unit file
- One-line install script
## 8. Estimated Timeline
| Phase | Duration | Cumulative |
|-------|----------|------------|
| Phase 1: SOCKS5 MVP | 2-3 days | 2-3 days |
| Phase 2: Port Forwarding | 1 day | 3-4 days |
| Phase 3: TUN Mode | 2-3 days | 5-7 days |
| Phase 4: Hardening & Polish | 1-2 days | 6-9 days |
| Phase 5: Distribution | 1 day | 7-10 days |
With LLM-assisted development, the MVP (Phase 1) could realistically be done in 1-2 focused sessions. The full feature set in under a week.
## 9. Open Questions
1. **Project name**`ghost`, `wraith`, `shade`, `spectre`, something else? Needs to be catchy, not conflict with existing Rust crates, and suggest stealth/mobility.
2. **TUN vs smoltcp** — Should TUN mode integrate smoltcp for a userspace TCP stack, or try the simpler "just forward packets and let the OS handle TCP" approach? Smoltcp is more work but more robust. tun2proxy's approach (which uses their own `ip-stack`) suggests userspace TCP is the way to go for reliability.
3. **TLS certificate story** — Should the server support ACME/Let's Encrypt auto-provisioning (like https_proxy does), or is manual cert management sufficient? Auto-provisioning is more user-friendly but adds significant complexity and a dependency on the ACME protocol.
4. **Mobile support** — Should we target iOS/Android eventually? tun-rs supports both via platform APIs, but mobile is a much bigger scope. Probably Phase 6+.
5. **Multi-user server** — Should the server support multiple simultaneous clients? russh's server model handles this naturally (each connection gets its own Handler instance), but access control (per-user ACLs, bandwidth limits) would add complexity.
6. **Crates structure** — Single binary with subcommands (`ghost serve`, `ghost connect`), or separate binaries? Single crate with `#[tokio::main]` dispatch seems cleanest for MVP.
## 10. Conclusion
**This is feasible and straightforward.** The core mechanics — SSH tunnel via russh, TLS wrapping via tokio-rustls, TUN interface via tun-rs — are all solved problems with mature Rust libraries. The dispatch codebase proves russh is production-ready for this kind of work. The `connect_stream` / `run_stream` API in russh makes TLS wrapping a clean layering, not a hack.
The biggest design decision is TUN mode approach (raw packets vs. userspace TCP), and the recommendation is to start with SOCKS5 mode and add TUN later. This gives a working tool in 2-3 days that covers the primary use case (private tunneling that doesn't look like VPN traffic).
The project is well-scoped, the risk profile is low, and the existing tooling (russh, tun-rs, tokio-rustls) handles the hard parts. This is a "few days of focused work" estimate, not a "few weeks."
## 11. iroh Transport — Feasibility Addendum
### 11.1 The Insight
russh's `connect_stream()` and `server::run_stream()` accept **any** `AsyncRead + AsyncWrite + Unpin + Send` stream. The iroh project provides exactly such a stream — a QUIC bidirectional stream (`open_bi()` / `accept_bi()`) where both `SendStream` and `RecvStream` implement `tokio::io::AsyncWrite` and `tokio::io::AsyncRead` respectively.
This means **iroh can serve as a transport layer beneath SSH**, the same way TLS can. The architecture becomes:
```
┌──────────────────────────────────────────────────┐
│ APPLICATION │
│ (SOCKS5 / TUN / port-forward) │
├──────────────────────────────────────────────────┤
│ SSH (russh) │
│ channel_open_direct_tcpip/etc. │
├──────────────────────────────────────────────────┤
│ Transport Layer (SWAPPABLE) │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────────┐ │
│ │ TCP │ │ TLS │ │ iroh │ │
│ │(direct) │ │(obfusc) │ │ (P2P QUIC) │ │
│ └──────────┘ └──────────┘ └──────────────┘ │
└──────────────────────────────────────────────────┘
```
### 11.2 Why iroh is Compelling
iroh solves the **biggest deployment problem** with SSH tunnels: the server needs a public IP and open port.
With iroh as transport:
1. **No public IP needed** — Server and client both connect outbound to iroh's relay servers. Hole-punching attempts direct UDP in the background.
2. **No open firewall ports** — The server only needs outbound HTTPS to the relay. No inbound 22 or 443 required.
3. **NAT traversal for free** — iroh's relay + hole-punching means peers behind CGNAT or strict firewalls can still connect.
4. **Ed25519-based addressing** — Peers are identified by public key (EndpointId), no DNS or IP addresses needed.
5. **Built-in address discovery** — pkarr DNS records let you find a peer knowing only their public key.
6. **Still SSH underneath** — All the channel multiplexing, port forwarding, SOCKS5 logic still works. iroh is just the wire.
The use cases multiply:
- **Home server behind NAT**: No reverse proxy, no dynamic DNS, no port forwarding. Just run the server, share the EndpointId.
- **Temporary infrastructure**: Spin up a server anywhere (even behind corporate NAT), connect by public key.
- **Internal services**: Expose Postgres/Redis etc. over an SSH connection that traverses any NAT, no VPN required.
- **Censorship circumvention**: SSH over iroh QUIC to a relay that uses standard HTTPS. The deep packet inspector sees HTTPS traffic to a relay server, not SSH.
### 11.3 How It Works — The Code
The integration is trivially clean because both primitives implement the right traits:
**Client side:**
```rust
// Create iroh endpoint
let endpoint = Endpoint::builder(presets::N0)
.alpns(vec![b"ghost-ssh/1".to_vec()])
.bind()
.await?;
// Connect to peer (no IP needed — just public key)
let addr = EndpointAddr::from_bytes(peer_id_bytes);
let conn = endpoint.connect(addr, b"ghost-ssh/1").await?;
// Open a bidirectional QUIC stream
let (send_stream, recv_stream) = conn.open_bi().await?;
// Combine into a single AsyncRead+AsyncWrite
let iroh_stream = tokio::io::join(recv_stream, send_stream);
// OR use a custom wrapper that implements AsyncRead+AsyncWrite
// Run SSH client over the iroh stream
let handle = client::connect_stream(
Arc::new(client_config),
iroh_stream,
client_handler
).await?;
```
**Server side:**
```rust
// Create iroh endpoint
let endpoint = Endpoint::builder(presets::N0)
.alpns(vec![b"ghost-ssh/1".to_vec()])
.bind()
.await?;
// Accept incoming connections
while let Some(incoming) = endpoint.accept().await {
let conn = incoming.await?;
// For each connection, accept a bidirectional stream
let (send_stream, recv_stream) = conn.accept_bi().await?;
let iroh_stream = tokio::io::join(recv_stream, send_stream);
// Run SSH server over the iroh stream
server::run_stream(
Arc::new(server_config),
iroh_stream,
server_handler
).await?;
}
```
**Or using iroh's Router + ProtocolHandler pattern:**
```rust
struct GhostSshProtocol;
impl ProtocolHandler for GhostSshProtocol {
async fn accept(&self, connection: Connection) -> Result<(), AcceptError> {
// iroh already handled connection acceptance
// We can accept bi streams on the connection directly
// Or: each SSH session could be a new bi stream on the same connection
let (send, recv) = connection.accept_bi().await
.map_err(AcceptError::from_err)?;
let stream = join_streams(recv, send);
server::run_stream(server_config, stream, GhostHandler).await
.map_err(AcceptError::from_err)
}
}
let endpoint = Endpoint::builder(presets::N0).bind().await?;
let router = Router::builder(endpoint)
.accept(b"ghost-ssh/1", GhostSshProtocol)
.spawn();
```
### 11.4 Design Decision: One Stream per Session vs. One Connection with Multiple Streams
There are two ways to layer SSH over iroh:
**Option A: One QUIC bi-stream per SSH session**
- Each SSH session opens a new `open_bi()` stream under a single iroh `Connection`
- The iroh Connection itself persists (one QUIC connection per peer pair)
- Simpler: `open_bi()` gives you a stream, you feed it to `connect_stream()`
- Pro: Connection setup cost amortized. If SSH disconnects, `open_bi()` again is cheap.
- Con: Need to combine `RecvStream` + `SendStream` into a single `AsyncRead+AsyncWrite`
**Option B: One iroh Connection per SSH session (new QUIC connection each time)**
- Each SSH session = one `endpoint.connect()` + the whole connection
- Wasteful: QUIC handshake + iroh relay discovery each time
- Not recommended
**Recommendation: Option A.** One iroh `Connection` per peer pair, one `open_bi()` stream per SSH session. The connection is long-lived; SSH sessions can be re-established cheaply on the same QUIC connection.
### 11.5 Combining `RecvStream + SendStream` into `AsyncRead + AsyncWrite`
QUIC splits streams into separate send and receive halves. russh needs a single duplex stream. Two approaches:
**Approach 1: `tokio::io::join()` (simplest)**
```rust
use tokio::io;
fn join_iroh_stream(
recv: iroh::endpoint::RecvStream,
send: iroh::endpoint::SendStream,
) -> impl AsyncRead + AsyncWrite + Unpin + Send {
io::join(recv, send)
}
```
`tokio::io::join` returns a `Join<A, B>` that implements both `AsyncRead` (from the first) and `AsyncWrite` (from the second). Since `RecvStream: AsyncRead` and `SendStream: AsyncWrite`, this works directly.
**Approach 2: Custom wrapper (more control)**
```rust
struct IrohStream {
recv: iroh::endpoint::RecvStream,
send: iroh::endpoint::SendStream,
}
impl AsyncRead for IrohStream { /* delegate to recv */ }
impl AsyncWrite for IrohStream { /* delegate to send */ }
```
**Recommendation: Start with `tokio::io::join`.** It's one line and has the right trait implementations. Only switch to a custom wrapper if profiling shows overhead (unlikely).
### 11.6 Relay Considerations
iroh provides two relay options:
1. **Default n0 relay servers** (`https://use1-1.relay.n0.iroh.network.`) — free, operated by n0. Good for getting started and testing.
2. **Self-hosted relay** (`iroh-relay` crate) — The relay server is part of the iroh project. Can be self-hosted for complete independence.
For this project:
- **Development/quick start**: Use n0 relays (they're free and reliable)
- **Production/privacy**: Self-host the relay server. It's a single binary (`iroh-relay`) that can run on any VPS. The relay sees only encrypted QUIC packets — it cannot read SSH traffic.
- **Paranoid**: Disable relay entirely. Both peers must have direct network connectivity. No third-party dependency.
The `RelayMode` enum handles this:
```rust
// Default n0 relays
let endpoint = Endpoint::builder(presets::N0).bind().await?;
// Self-hosted relay
let relay_map = RelayMap::from([(relay_url, Some(direct_addr))]);
let endpoint = Endpoint::builder(presets::Custom(relay_map)).bind().await?;
// No relay (direct only)
let endpoint = Endpoint::builder(presets::RelayDisabled).bind().await?;
```
### 11.7 Updated Architecture with iroh Transport
```
┌───────────────────────────────────────────────────────────┐
│ CLIENT │
│ │
│ ┌──────────┐ ┌───────────┐ ┌────────────────────┐ │
│ │ TUN / │ │ SSH │ │ Transport │ │
│ │ SOCKS5 / │───▶│ Client │───▶│ (selectable) │ │
│ │ Port- │ │ (russh) │ │ │ │
│ │ Forward │ │ │ │ ┌────────────────┐ │ │
│ └──────────┘ └───────────┘ │ │ TCP direct │ │ │
│ │ │ TLS (rustls) │ │ │
│ │ │ iroh (QUIC) │ │ │
│ │ └────────────────┘ │ │
│ └────────────────────┘ │
└───────────────────────────────────────────────────────────┘
┌───────────────────────────────────────────────────────────┐
│ SERVER │
│ │
│ ┌──────────┐ ┌───────────┐ ┌────────────────────┐ │
│ │ Outbound │ │ SSH │ │ Transport │ │
│ │ Proxy / │◀───│ Server │◀───│ (selectable) │ │
│ │ Direct │ │ (russh) │ │ │ │
│ │ Forward │ │ │ │ ┌────────────────┐ │ │
│ └──────────┘ └───────────┘ │ │ TCP listener │ │ │
│ │ │ TLS (rustls) │ │ │
│ │ │ iroh (QUIC) │ │ │
│ │ └────────────────┘ │ │
│ └────────────────────┘ │
└───────────────────────────────────────────────────────────┘
┌──────────────┐
│ iroh Relay │ (optional, for NAT)
│ (self-host │
│ or n0) │
└──────────────┘
Transport modes:
--transport tcp Direct TCP (default, simplest)
--transport tls TCP + TLS (obfuscation)
--transport iroh iroh QUIC (NAT traversal, no public IP)
--transport iroh+tls iroh QUIC + TLS (NAT traversal + obfuscation)
```
### 11.8 iroh Transport — Risk Assessment
| Risk | Likelihood | Impact | Mitigation |
|------|-----------|--------|------------|
| iroh API instability (it's v0.x) | Medium | Medium | Pin version; iroh's core stream API is stable (it's just QUIC) |
| Relay dependency for initial connectivity | Low | Low | Self-host relay; or direct-only mode for LAN |
| QUIC stream vs TCP semantics differences | Low | Medium | QUIC streams are reliable ordered byte streams, same semantics as TCP. russh won't know the difference. |
| Performance overhead of QUIC + SSH | Low | Low | QUIC is fast. SSH over QUIC might actually be *faster* than SSH over TCP due to QUIC's multipath and no head-of-line blocking. |
| iroh crate size / compile time | Low | Low | iroh pulls in quinn + rustls + lots of networking. But we already need rustls for TLS mode. The incremental cost is the QUIC stack. |
**Key observation**: QUIC streams have identical reliability and ordering guarantees to TCP. russh's `connect_stream()` / `run_stream()` will work correctly over iroh QUIC streams with no modifications.
### 11.9 Updated CLI Sketch with iroh
```bash
# Server — iroh mode (no public IP needed!)
ghost serve --key ~/.ssh/id_ed25519 --transport iroh
# Prints endpoint ID: e.g., "abc123..."
# Clients connect using this ID
# Server — iroh mode with self-hosted relay
ghost serve --key ~/.ssh/id_ed25519 --transport iroh \
--iroh-relay https://my-relay.example.com
# Client — connect via iroh (no IP needed!)
ghost connect --peer abc123def456... --transport iroh --socks5 1080
# Client — connect via iroh with TUN
ghost connect --peer abc123def456... --transport iroh --tun
# Client — traditional TCP mode (still works)
ghost connect --server 1.2.3.4:443 --transport tls --socks5 1080
```
### 11.10 Implementation Impact
Adding iroh as a transport option is **incremental** — it doesn't change the SSH layer at all:
1. **Transport trait**: Define a `Transport` trait that produces `Box<dyn AsyncRead + AsyncWrite + Unpin + Send>`:
```rust
trait Transport {
async fn connect(&self) -> Result<Box<dyn AsyncRead + AsyncWrite + Unpin + Send>>;
}
```
2. **Three implementations**:
- `TcpTransport` — plain TCP
- `TlsTransport` — TCP + tokio-rustls
- `IrohTransport` — iroh endpoint + `open_bi()` + `tokio::io::join(recv, send)`
3. **Server side**: Same trait, different direction:
```rust
trait TransportAcceptor {
async fn accept(&self) -> Result<Box<dyn AsyncRead + AsyncWrite + Unpin + Send>>;
}
```
4. **The SSH layer never changes.** russh's `connect_stream()` / `run_stream()` takes the transport stream, and everything else stays the same.
### 11.11 Dependency Impact
| Dependency | Added? | Size concern |
|------------|--------|-------------|
| `iroh` (includes iroh-base) | Yes, feature-gated | Yes — pulls in QUIC stack, DNS, relay client |
| `n0-error` | Yes (small) | No |
| `tokio` | Already present | No |
| `rustls` | Already present (for TLS mode) | No |
**Recommendation**: Make iroh a feature flag (`--features iroh`) so the base install stays lean. Users who want P2P capability opt in:
```toml
[features]
default = ["tls"]
tls = ["tokio-rustls", "rustls-pemfile"]
iroh = ["dep:iroh"]
tun = ["dep:tun-rs", "dep:smoltcp"]
```
### 11.12 The Compelling Narrative
With iroh as a transport option, this tool becomes something genuinely new:
- **Not just a VPN alternative** — it's a VPN alternative that doesn't need port forwarding, public IPs, or DNS records.
- **Not just SSH tunneling** — it's SSH tunneling that works between any two machines on the internet, regardless of NAT configuration.
- **Not just for censorship circumvention** — it's how you securely expose internal services (Postgres, Redis, admin panels) from machines behind corporate firewalls or home networks.
The "ghetto VPN" becomes a **zero-config mesh VPN**. Spin up `ghost serve` on any machine, share the public key, connect from anywhere. The relay server is optional (self-host or n0's free tier). And underneath it's just SSH, doing what SSH does best.
This isn't theoretical — the API compatibility is exact. iroh's `RecvStream + SendStream` implement `AsyncRead + AsyncWrite`, and russh's `connect_stream` / `run_stream` accept `AsyncRead + AsyncWrite`. Three lines of `tokio::io::join(recv, send)` and you have a transport stream that russh can use.

View File

@@ -0,0 +1,56 @@
# Certbot — dev1
## Overview
Let's Encrypt SSL certificates managed by certbot. Used by nginx for HTTPS.
## Installed
certbot (snap package on Ubuntu 24.04)
## Certificates
| Domain | Expiry | Path |
|--------|--------|------|
| git.alk.dev | 2026-06-18 | /etc/letsencrypt/live/git.alk.dev/ |
## File Locations
```
/etc/letsencrypt/live/git.alk.dev/
├── fullchain.pem # Server cert + chain
├── privkey.pem # Private key
├── cert.pem # Server cert only
├── chain.pem # Chain only
└── README
```
Renewal config: `/etc/letsencrypt/renewal/git.alk.dev.conf`
## Renewal
Certbot auto-renews via systemd timer. Certificates renew when <30 days remaining.
```bash
# Check certificates and expiry
sudo certbot certificates
# Dry run renewal
sudo certbot renew --dry-run
# Force renewal (if needed)
sudo certbot renew --force-renewal
# Reload nginx after renewal
sudo systemctl reload nginx
```
## Initial Certificate
If adding a new domain, obtain the cert with the standalone plugin (nginx doesn't need to be running):
```bash
sudo certbot certonly --standalone -d <domain> --agree-tos -m <email>
```
Port 80 must be open for the ACME challenge. The api.alk.dev UFW rule allows HTTP for this purpose.

View File

@@ -0,0 +1,106 @@
# Fail2ban — dev1
## Status
Active. 7 jails. Uses `nftables` backend with `systemd` journal.
## Active Jails
| Jail | Port | Filter | Max Retry | Find Time | Ban Time | Log Source |
|------|------|--------|-----------|-----------|----------|------------|
| sshd | ssh | sshd | default (5) | default (10m) | default (10m) | systemd journal |
| gitea | ssh | gitea | 5 | 10m | 1h | journald (CONTAINER_NAME=gitea) |
| nginx-badbots | http,https | nginx-badbots | 5 | 10m | 1h | /var/log/nginx/access.log |
| nginx-botsearch | http,https | nginx-botsearch | default | default | default | /var/log/nginx/access.log |
| nginx-limit-req | http,https | nginx-limit-req | default | default | default | /var/log/nginx/error.log |
| nginx-401 | http,https | nginx-401 | 5 | 10m | 1h | /var/log/nginx/access.log |
| nginx-403 | http,https | nginx-403 | 10 | 10m | 30m | /var/log/nginx/access.log |
## Configuration
Default settings in `/etc/fail2ban/jail.d/defaults-debian.conf`:
```ini
[DEFAULT]
banaction = nftables
banaction_allports = nftables[type=allports]
backend = systemd
```
Jail configs in `/etc/fail2ban/jail.d/`:
- `gitea.conf` — Gitea jail with Docker journald log driver
- `nginx.conf` — nginx-related jails
## Gitea Jail Details
Gitea runs in Docker with the `journald` log driver. The fail2ban filter uses `journalmatch` to read only Gitea container logs:
```ini
[gitea]
enabled = true
port = ssh
filter = gitea
backend = systemd
journalmatch = CONTAINER_NAME=gitea
maxretry = 5
findtime = 10m
bantime = 1h
action = iptables-allports[chain="DOCKER-USER"]
```
The `DOCKER-USER` chain ensures bans affect Docker traffic.
## Custom Filters
Default install includes `gitea.conf`, `nginx-401.conf`, `nginx-403.conf` in `/etc/fail2ban/filter.d/`. Custom filter:
### nginx-badbots (`/etc/fail2ban/filter.d/nginx-badbots.conf`)
Catches malicious requests that the other nginx jails miss: `.env`/`.git` probes, PROPFIND/CONNECT abuse, common exploit paths (`/actuator`, `/cgi-bin`, `/ecp`, `/SDK`), and binary/garbage requests. Matches 400/404/405/413 status codes for known-bad path patterns only — legitimate 404s (e.g. wrong Gitea repo name) are not matched.
## Lesson Learned: Default Filters Miss Most Scanner Traffic
The default fail2ban nginx filters (`nginx-botsearch`, `nginx-401`, `nginx-403`, `nginx-limit-req`) only catch a narrow subset of malicious requests:
- **nginx-botsearch** only matches `<webmail|phpmyadmin|wordpress|cgi-bin|mysqladmin>` paths returning **404**. Misses `.env`, `.git/config`, `/actuator`, `/SDK`, `/ecp`, crypto mining RPC, PROPFIND/CONNECT abuse, and binary garbage — all of which return 400/405 instead of 404.
- **nginx-401/403** only trigger on those specific status codes. Most scanners get 400 or 405.
- **nginx-limit-req** only triggers when the rate limiter in nginx actually rejects a request.
**Result**: A site with heavy scanner traffic can show zero bans from all four default jails. The `nginx-badbots` custom filter closes this gap by matching known-bad path patterns regardless of status code.
### Verifying Jail Coverage
When setting up fail2ban on a new host:
1. Install jails and filters first
2. Let traffic flow for a few hours
3. Run `sudo fail2ban-regex /var/log/nginx/access.log /etc/fail2ban/filter.d/<filter>.conf` to verify each filter matches expected lines
4. Check `sudo fail2ban-client status` to confirm jails show `Total failed > 0` — if any jail stays at 0 for hours on a public-facing host, the filter likely has a gap
5. Inspect logs manually: `awk '$9>=400' /var/log/nginx/access.log | awk '{print $9}' | sort | uniq -c | sort -rn` shows which status codes scanners are hitting
### Adding the nginx-badbots Filter to a New Host
1. Copy `/etc/fail2ban/filter.d/nginx-badbots.conf` to the new host
2. Append the jail config to `/etc/fail2ban/jail.d/nginx.conf`:
```ini
[nginx-badbots]
enabled = true
port = http,https
filter = nginx-badbots
logpath = /var/log/nginx/access.log
maxretry = 5
findtime = 10m
bantime = 1h
```
3. `sudo fail2ban-client reload`
## Commands
```bash
sudo fail2ban-client status
sudo fail2ban-client status gitea
sudo fail2ban-client set gitea unbanip <IP>
sudo journalctl -u fail2ban -f
```