docs: complete Phase 0 architecture — spec updates, review fixes, and link portability
Update four existing specs (overview, server, napi-and-pubsub, call-protocol) to reflect Phase 0 decisions: three-layer model, IdentityProvider, ForwardingPolicy, OperationEnv, static/dynamic config split. Review all 9 Phase 0a ADRs (026-034) for consistency. Fix 4 critical issues from architecture review: missing OQ-SVC-05 in open-questions.md, deprecated hub terminology, undefined AuthService and noq terms. Replace inline OQ text with cross-references per format rules. Add ConfigServiceImpl definition to configuration.md. Port absolute workspace paths to project-relative links by copying referenced docs (feasibility, certbot, fail2ban, event_source_types) into docs/research/.
This commit is contained in:
@@ -7,25 +7,26 @@ last_updated: 2026-06-07
|
||||
|
||||
## Current State
|
||||
|
||||
Architecture specification in active development. Phase 0 foundation ADRs
|
||||
completed (026–034). New spec documents created for identity, services,
|
||||
interface, configuration, storage, flowgraph, and secret service. Existing
|
||||
specs updated for the three-layer model, crate decomposition, and unified
|
||||
identity. See [open-questions.md](open-questions.md) for remaining open
|
||||
questions.
|
||||
Architecture specification in active development. Phase 0 foundation complete:
|
||||
ADRs 001–034 accepted, new spec documents created for all components, existing
|
||||
specs updated for the three-layer model, crate decomposition, unified identity,
|
||||
OperationEnv, and forwarding policy. Remaining open questions: OQ-15 (QUIC
|
||||
coexistence), OQ-19 (WebTransport TLS), OQ-20 (worker registration), OQ-IF-01
|
||||
(Interface session/EventEnvelope), OQ-IF-02 (ForwardingPolicy placement). See
|
||||
[open-questions.md](open-questions.md).
|
||||
|
||||
## Architecture Documents
|
||||
|
||||
| Document | Status | Description |
|
||||
|----------|--------|-------------|
|
||||
| [overview.md](overview.md) | reviewed | Package purpose, exports, dependencies |
|
||||
| [overview.md](overview.md) | reviewed | Package purpose, crate structure, three-layer model, exports, dependencies |
|
||||
| [transport.md](transport.md) | reviewed | Transport abstraction: TCP, TLS, iroh |
|
||||
| [auth.md](auth.md) | draft | Unified auth: SSH + token, IdentityProvider trait |
|
||||
| [call-protocol.md](call-protocol.md) | draft | Bidirectional call/event protocol, operation registry |
|
||||
| [call-protocol.md](call-protocol.md) | draft | Bidirectional call/event protocol, OperationEnv, three dispatch paths |
|
||||
| [client.md](client.md) | reviewed | Client connection, SOCKS5, port forwarding |
|
||||
| [server.md](server.md) | reviewed | Server acceptance, channel handling, proxy |
|
||||
| [server.md](server.md) | reviewed | Server acceptance, IdentityProvider, ForwardingPolicy, channel handling |
|
||||
| [tun-shim.md](tun-shim.md) | deprecated | TUN interface wrapper — **deferred**, use tun2proxy |
|
||||
| [napi-and-pubsub.md](napi-and-pubsub.md) | reviewed | NAPI wrapper and pubsub event target adapter |
|
||||
| [napi-and-pubsub.md](napi-and-pubsub.md) | reviewed | NAPI wrapper, reload API, pubsub event target adapter |
|
||||
| [identity.md](identity.md) | draft | Identity type, IdentityProvider trait, auth flows |
|
||||
| [services.md](services.md) | draft | irpc service layer, OperationEnv, three dispatch paths |
|
||||
| [interface.md](interface.md) | draft | Layer 2: Interface trait, SshInterface, RawFramingInterface |
|
||||
@@ -44,6 +45,9 @@ questions.
|
||||
| [storage.md](../research/storage.md) | draft | Metagraph, identity, ACL, secrets, honker |
|
||||
| [flow.md](../research/flow.md) | draft | FlowGraph, operation graph, call graph, petgraph mapping |
|
||||
| [integration-plan.md](../research/integration-plan.md) | draft | Phased integration plan for services, pubsub, and operations |
|
||||
| [feasibility/](../research/feasibility/) | — | SSH tunnel feasibility assessment and related analyses |
|
||||
| [event-sourcing/](../research/event-sourcing/) | — | Event sourcing patterns and event-driven architecture reference |
|
||||
| [ops/](../research/ops/) | — | Production ops reference: certbot, fail2ban |
|
||||
|
||||
## ADR Table
|
||||
|
||||
@@ -81,6 +85,9 @@ questions.
|
||||
| [033](decisions/033-operationenv-irpc-call-protocol.md) | OperationEnv as universal composition mechanism | Accepted |
|
||||
| [034](decisions/034-head-worker-terminology.md) | Head/worker terminology replacing hub/spoke | Accepted |
|
||||
|
||||
> ADR numbers 020–022 were allocated to proposals that were withdrawn before
|
||||
> acceptance and are not listed.
|
||||
|
||||
## Open Questions
|
||||
|
||||
See [open-questions.md](open-questions.md) for all open and resolved questions.
|
||||
|
||||
@@ -13,6 +13,11 @@ subscriptions, and unidirectional events — all using the same wire format. The
|
||||
protocol is defined as a spec + handler + registry; downstream consumers (NAPI,
|
||||
Python, head/worker) register their own operations without modifying core.
|
||||
|
||||
OperationEnv extends the call protocol with a universal composition mechanism
|
||||
that unifies local dispatch, irpc service dispatch, and remote dispatch. A
|
||||
handler receives `context.env.invoke(namespace, op, input)` and doesn't know
|
||||
whether the operation runs locally, in-cluster, or on a remote node.
|
||||
|
||||
## Why
|
||||
|
||||
The current control channel (ADR-018) is unidirectional (client → server) and
|
||||
@@ -21,6 +26,10 @@ The call protocol generalizes it to support bidirectional calls (ADR-024) and
|
||||
downstream service registration (ADR-025), enabling the head/worker model where
|
||||
workers expose operations the head invokes.
|
||||
|
||||
Without OperationEnv, handlers calling other operations would need to know
|
||||
whether the target is local, in-cluster, or on a remote node. OperationEnv
|
||||
abstracts this away — one handler-facing API, three dispatch backends (ADR-033).
|
||||
|
||||
## Architecture
|
||||
|
||||
### Operation Paths
|
||||
@@ -316,6 +325,101 @@ that carries `EventEnvelope` frames:
|
||||
The framing is always: 4-byte BE length prefix + JSON. The envelope shape is
|
||||
the same regardless of transport.
|
||||
|
||||
### OperationEnv — Universal Composition Mechanism
|
||||
|
||||
OperationEnv provides the handler-facing API for composing operations. A handler
|
||||
receives `context.env.invoke(namespace, operation, input)` and gets back a
|
||||
`ResponseEnvelope` — regardless of which dispatch path the operation takes
|
||||
(ADR-033).
|
||||
|
||||
Three dispatch paths, one API:
|
||||
|
||||
| Path | Mechanism | Serialization | Scope |
|
||||
|------|-----------|---------------|-------|
|
||||
| **Local** | Direct function call through registry | None (in-process) | Same process |
|
||||
| **Service** | irpc protocol enum dispatch | postcard (binary) | Same cluster |
|
||||
| **Remote** | Call protocol `EventEnvelope` | JSON | Cross-node |
|
||||
|
||||
All three produce the same `ResponseEnvelope`. Service assembly determines
|
||||
which path each operation uses:
|
||||
|
||||
```rust
|
||||
// Minimal deployment (Phase 1: single node, all local)
|
||||
let env = OperationEnv::local(local_registry);
|
||||
|
||||
// Production deployment (Phase 2+: mix of local and remote)
|
||||
let env = OperationEnv::new()
|
||||
.local("auth", auth_registry)
|
||||
.local("config", config_registry)
|
||||
.service("secrets", secret_irpc_client)
|
||||
.remote("worker-1", call_protocol_conn);
|
||||
```
|
||||
|
||||
**Phase boundary**: Phase 1 ships with local dispatch only (direct function
|
||||
calls through the operation registry). The irpc service dispatch and remote
|
||||
dispatch paths are contracted here but not built yet. irpc service protocols
|
||||
(`AuthProtocol`, `SecretProtocol`, etc.) are defined in the specs but the
|
||||
implementations are Phase 2+ work.
|
||||
|
||||
**irpc is one dispatch backend for OperationEnv, not a replacement for the
|
||||
call protocol or for OperationEnv.** A call protocol handler can call an irpc
|
||||
service internally (e.g., `/head/auth/verify` calls
|
||||
`AuthProtocol::VerifyPubkey`) — the layers compose. irpc is behind a feature
|
||||
flag in alknet-core. See [services.md](services.md) for full OperationEnv and
|
||||
irpc service details.
|
||||
|
||||
### OperationContext
|
||||
|
||||
Every handler receives an `OperationContext`:
|
||||
|
||||
```rust
|
||||
pub struct OperationContext {
|
||||
pub request_id: String,
|
||||
pub parent_request_id: Option<String>,
|
||||
pub identity: Option<Identity>,
|
||||
pub metadata: HashMap<String, Value>,
|
||||
pub env: OperationEnv,
|
||||
pub trusted: bool, // set by buildEnv(), not by callers
|
||||
}
|
||||
```
|
||||
|
||||
- **`identity`**: The authenticated identity making the call. Populated by
|
||||
`IdentityProvider` from the interface layer ([identity.md](identity.md)).
|
||||
- **`env`**: The operation environment — namespaced access to other operations.
|
||||
- **`trusted`**: When a handler calls another operation through `env`, the
|
||||
nested call is `trusted` (skips ACL checks). This prevents double-checking:
|
||||
if `/head/agent/chat` is allowed, and it internally calls
|
||||
`/head/auth/verify`, the auth check is trusted.
|
||||
|
||||
Handler signature:
|
||||
|
||||
```rust
|
||||
fn handle(input: Value, context: OperationContext) -> ResponseEnvelope;
|
||||
```
|
||||
|
||||
### ResponseEnvelope
|
||||
|
||||
The universal return type from all three dispatch paths:
|
||||
|
||||
```rust
|
||||
pub struct ResponseEnvelope {
|
||||
pub request_id: String,
|
||||
pub result: Result<Value, CallError>,
|
||||
}
|
||||
|
||||
pub struct CallError {
|
||||
pub code: String,
|
||||
pub message: String,
|
||||
pub retryable: bool,
|
||||
}
|
||||
```
|
||||
|
||||
Local dispatch produces `ResponseEnvelope` with no serialization. irpc service
|
||||
dispatch produces postcard-encoded results that are decoded into
|
||||
`ResponseEnvelope`. Remote dispatch receives `call.responded` EventEnvelope
|
||||
frames and maps them to `ResponseEnvelope`. The handler always gets the same
|
||||
type back.
|
||||
|
||||
### Relationship to @alkdev/pubsub and @alkdev/operations
|
||||
|
||||
The call protocol in core is a Rust reimplementation of the same protocol
|
||||
@@ -335,11 +439,11 @@ through core, out over SSH channel, into a JavaScript pubsub adapter, and
|
||||
be dispatched through `@alkdev/operations`'s call handler** — with zero
|
||||
translation at the wire level.
|
||||
|
||||
### Agent Service Pattern (Future)
|
||||
### Agent Service Pattern (Downstream Application Concern)
|
||||
|
||||
An agent service — coordinating between LLM providers and tool calls — is a
|
||||
primary use case for the call protocol. It would be just another set of
|
||||
registered operations with no special treatment:
|
||||
primary downstream use case for the call protocol. It would be just another set
|
||||
of registered operations with no special treatment:
|
||||
|
||||
- `/head/agent/chat` — send a message, get a completion. Routes to the
|
||||
appropriate LLM provider based on available workers and configuration.
|
||||
@@ -348,12 +452,10 @@ registered operations with no special treatment:
|
||||
durable storage).
|
||||
- `/head/sessions/history` — retrieve a specific session's message history.
|
||||
|
||||
The agent service would use the same call protocol to invoke tools on workers
|
||||
(e.g., `/dev1/fs/readFile` for file access, `/dev1/bash/exec` for shell
|
||||
commands). This is a **downstream application concern**, not a core
|
||||
requirement. The call protocol enables it by providing the universal composition
|
||||
mechanism (OperationEnv, ADR-033), but the agent service itself is built on
|
||||
top, not into the core.
|
||||
The agent service uses OperationEnv to invoke tools on workers. **This is a
|
||||
downstream application concern, not a core requirement.** The call protocol
|
||||
enables it by providing the universal composition mechanism (ADR-033), but the
|
||||
agent service itself is built on top, not into the core.
|
||||
|
||||
## Constraints
|
||||
|
||||
@@ -370,6 +472,16 @@ top, not into the core.
|
||||
boundary. ACL is enforced at the `AccessControl` level, not by path prefix
|
||||
alone. A worker that exposes `/dev1/bash/exec` can restrict access via
|
||||
`required_scopes` — not every authenticated identity should have shell access.
|
||||
- **OperationEnv composition model matches the `@alkdev/operations` behavioral
|
||||
contract**: namespace + operation name → invoke with input, return output.
|
||||
The Rust implementation may differ in structure but must preserve this
|
||||
contract (ADR-033).
|
||||
- **irpc is explicitly positioned as one dispatch backend for OperationEnv**
|
||||
(ADR-033, ADR-028). It is not a replacement for the call protocol or for
|
||||
OperationEnv.
|
||||
- **Phase 1 is local dispatch only.** irpc service dispatch and remote dispatch
|
||||
are contracted in this spec but not built yet. The `OperationEnv::local()`
|
||||
path is the Phase 1 implementation.
|
||||
|
||||
## Open Questions
|
||||
|
||||
@@ -378,9 +490,13 @@ top, not into the core.
|
||||
disconnect, or heartbeat-based discovery? See
|
||||
[open-questions.md](open-questions.md).
|
||||
|
||||
- **OQ-22**: Should the call protocol support streaming inputs (client streaming
|
||||
in gRPC terms), or is client→server always a single request payload with
|
||||
streaming only server→client? See [open-questions.md](open-questions.md).
|
||||
- **OQ-22**: ~~Should the call protocol support streaming inputs (client streaming
|
||||
in gRPC terms)?~~ Resolved — deferred. Current model covers all identified use
|
||||
cases. See [open-questions.md](open-questions.md).
|
||||
|
||||
- **OQ-IF-01**: How does the `Interface` session type relate to the call
|
||||
protocol's `EventEnvelope` stream? This needs design during Phase 1.8
|
||||
implementation. See [open-questions.md](open-questions.md).
|
||||
|
||||
## Design Decisions
|
||||
|
||||
@@ -389,6 +505,8 @@ top, not into the core.
|
||||
| [018](decisions/018-control-channel-for-pubsub.md) | Control channel for pubsub | Reserved destination for event bus |
|
||||
| [024](decisions/024-bidirectional-call-protocol.md) | Bidirectional call protocol | Generalizes ADR-018, both sides can call |
|
||||
| [025](decisions/025-handler-spec-separation.md) | Handler/spec separation | Downstream registers operations without modifying core |
|
||||
| [028](decisions/028-auth-irpc-service.md) | Auth as irpc service | irpc is one dispatch backend for OperationEnv |
|
||||
| [033](decisions/033-operationenv-irpc-call-protocol.md) | OperationEnv | Universal composition with three dispatch paths |
|
||||
|
||||
## References
|
||||
|
||||
@@ -396,7 +514,10 @@ top, not into the core.
|
||||
- [napi-and-pubsub.md](napi-and-pubsub.md) — NAPI wrapper and pubsub adapter
|
||||
- [server.md](server.md) — Channel handling and control channel routing
|
||||
- [transport.md](transport.md) — Transport abstraction
|
||||
- [configuration.md](../research/configuration.md) — ForwardingPolicy, service metadata
|
||||
- [identity.md](identity.md) — Identity struct, IdentityProvider trait
|
||||
- [interface.md](interface.md) — Interface layer, EventEnvelope stream from interfaces
|
||||
- [configuration.md](configuration.md) — ForwardingPolicy, service metadata
|
||||
- [services.md](services.md) — OperationEnv, OperationContext, irpc service layer
|
||||
- `@alkdev/pubsub` — TypeScript event target adapters and `EventEnvelope`
|
||||
- `@alkdev/operations` — TypeScript call protocol, `OperationSpec`, registry
|
||||
- `@alkdev/storage` — `peer_credentials` table, ACL graph, `Identity`
|
||||
|
||||
@@ -69,6 +69,39 @@ impl ConfigReloadHandle {
|
||||
|
||||
Obtained from `Server::run()`. Passed to NAPI or CLI for explicit reload.
|
||||
|
||||
### ConfigServiceImpl
|
||||
|
||||
The Phase 1 implementation of config service logic, backed by
|
||||
`ArcSwap<DynamicConfig>`. Where `ConfigIdentityProvider` wraps the auth section
|
||||
of `DynamicConfig`, `ConfigServiceImpl` wraps the forwarding and rate-limit
|
||||
sections. Both are ArcSwap-backed and share the same `DynamicConfig` instance.
|
||||
|
||||
```rust
|
||||
pub struct ConfigServiceImpl {
|
||||
dynamic: Arc<ArcSwap<DynamicConfig>>,
|
||||
}
|
||||
|
||||
impl ConfigServiceImpl {
|
||||
pub fn forwarding_policy(&self) -> Arc<ForwardingPolicy> {
|
||||
self.dynamic.load().forwarding.clone()
|
||||
}
|
||||
|
||||
pub fn rate_limits(&self) -> Arc<RateLimitConfig> {
|
||||
self.dynamic.load().rate_limits.clone()
|
||||
}
|
||||
|
||||
pub fn reload(&self, new_config: DynamicConfig) {
|
||||
self.dynamic.store(Arc::new(new_config));
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Phase 1 deploys `ConfigServiceImpl` directly — no irpc service boundary. The
|
||||
`ConfigProtocol` irpc service (behind feature flag) wraps `ConfigServiceImpl`
|
||||
for production deployments that use the service layer. This mirrors the
|
||||
`ConfigIdentityProvider` / `AuthProtocol` pattern from [identity.md](identity.md)
|
||||
and ADR-028.
|
||||
|
||||
### ConfigService irpc Service
|
||||
|
||||
```rust
|
||||
@@ -155,7 +188,7 @@ iroh_relay = "https://relay.alk.dev"
|
||||
| Interface | Static config | Dynamic config | Reload mechanism |
|
||||
|-----------|--------------|----------------|------------------|
|
||||
| CLI | Flags + optional `--config` file | Loaded at startup from `--authorized-keys` | None (restart to change) |
|
||||
| Core Rust | `StaticConfig` struct | `AuthService` (irpc) or `ArcSwap<DynamicConfig>` (minimal) | `ConfigService::reload()` or `ConfigReloadHandle::reload()` |
|
||||
| Core Rust | `StaticConfig` struct | `AuthProtocol` (irpc) or `ConfigIdentityProvider` (ArcSwap) | `ConfigProtocol::ReloadDynamicConfig` or `ConfigReloadHandle::reload()` |
|
||||
| NAPI | `serve()` options | Same | `server.reloadAuth()`, `server.reloadForwarding()` |
|
||||
|
||||
## Constraints
|
||||
|
||||
@@ -23,4 +23,4 @@ This makes adding a new transport (e.g., WebSocket, QUIC directly) a matter of i
|
||||
|
||||
## References
|
||||
- [transport.md](../transport.md)
|
||||
- [Feasibility assessment §3](../../../../conversations/research/ssh-tunnel-vpn-alternative-feasibility.md)
|
||||
- [Feasibility assessment §3](../../research/feasibility/ssh-tunnel-vpn-alternative-feasibility.md)
|
||||
@@ -28,4 +28,4 @@ Option 3 was rejected because it would require modifying russh to understand iro
|
||||
|
||||
## References
|
||||
- [transport.md](../transport.md)
|
||||
- [Feasibility assessment §11](../../../../conversations/research/ssh-tunnel-vpn-alternative-feasibility.md)
|
||||
- [Feasibility assessment §11](../../research/feasibility/ssh-tunnel-vpn-alternative-feasibility.md)
|
||||
@@ -25,4 +25,4 @@ This is directly enabled by russh's `connect_stream()` and `run_stream()` APIs,
|
||||
|
||||
## References
|
||||
- [transport.md](../transport.md)
|
||||
- [Feasibility assessment §3.4](../../../../conversations/research/ssh-tunnel-vpn-alternative-feasibility.md)
|
||||
- [Feasibility assessment §3.4](../../research/feasibility/ssh-tunnel-vpn-alternative-feasibility.md)
|
||||
@@ -4,7 +4,7 @@
|
||||
Accepted
|
||||
|
||||
## Context
|
||||
TLS transport mode requires certificates. Manual certificate management is error-prone — users need to obtain, install, and renew certificates. Our production setup uses certbot with Let's Encrypt (documented in `/workspace/system/dev1/certbot.md`), which automates this via the ACME protocol.
|
||||
TLS transport mode requires certificates. Manual certificate management is error-prone — users need to obtain, install, and renew certificates. Our production setup uses certbot with Let's Encrypt (documented in [certbot.md](../../research/ops/certbot.md)), which automates this via the ACME protocol.
|
||||
|
||||
There are two ACME flows:
|
||||
1. **Domain-based**: Standard flow with DNS-01 or HTTP-01 challenge. Certificate is tied to a domain name, auto-renews via certbot/systemd timer. Requires port 80 or DNS access for challenges.
|
||||
@@ -35,4 +35,4 @@ The implementation should use the `rustls-acme` crate (or similar pure-Rust ACME
|
||||
- [server.md](../server.md)
|
||||
- [OQ-01](../open-questions.md) — resolved by this ADR
|
||||
- [OQ-07](../open-questions.md) — resolved by this ADR
|
||||
- Production certbot setup: `/workspace/system/dev1/certbot.md`
|
||||
- Production certbot setup: [certbot.md](../../research/ops/certbot.md)
|
||||
@@ -4,7 +4,7 @@
|
||||
Accepted
|
||||
|
||||
## Context
|
||||
The server needs to handle abuse on public-facing deployments. Our production infrastructure uses fail2ban on Linux (documented in `/workspace/system/dev1/fail2ban.md`) with nftables and systemd journal backend. fail2ban needs structured, parseable logs to identify abusive IP addresses.
|
||||
The server needs to handle abuse on public-facing deployments. Our production infrastructure uses fail2ban on Linux (documented in [fail2ban.md](../../research/ops/fail2ban.md)) with nftables and systemd journal backend. fail2ban needs structured, parseable logs to identify abusive IP addresses.
|
||||
|
||||
However, fail2ban is Linux-specific. On other platforms (macOS, Windows, BSD), users need a different approach to reject abusive connections. The server should provide enough logging for fail2ban on Linux and enough built-in protection for other platforms.
|
||||
|
||||
@@ -36,4 +36,4 @@ This ensures that even without fail2ban, the server rejects obviously abusive co
|
||||
## References
|
||||
- [server.md](../server.md)
|
||||
- [OQ-08](../open-questions.md) — resolved by this ADR
|
||||
- Production fail2ban setup: `/workspace/system/dev1/fail2ban.md`
|
||||
- Production fail2ban setup: [fail2ban.md](../../research/ops/fail2ban.md)
|
||||
@@ -64,17 +64,30 @@ format, but not as a crate dependency.
|
||||
### Dependency Graph
|
||||
|
||||
```
|
||||
alknet-secret
|
||||
/ \
|
||||
/ \
|
||||
alknet-core ←──── ←── alknet-storage
|
||||
↑ \ /
|
||||
│ alknet-flowgraph
|
||||
│
|
||||
alknet-napi
|
||||
alknet (CLI binary — assembles everything)
|
||||
alknet-secret alknet-storage alknet-flowgraph
|
||||
(standalone) (standalone) (standalone)
|
||||
│ │ │
|
||||
│ (feature flags │ (trait impl │ (type compat
|
||||
│ in CLI binary) │ via CLI wire) │ via JSON)
|
||||
▼ ▼ ▼
|
||||
┌─────────────────────┐
|
||||
│ alknet-core │
|
||||
│ (transport, SSH, │
|
||||
│ call protocol, │
|
||||
│ Identity, Config) │
|
||||
└─────────┬───────────┘
|
||||
│
|
||||
┌────────────┼────────────┐
|
||||
▼ ▼ ▼
|
||||
alknet-napi alknet (CLI binary — assembles everything)
|
||||
```
|
||||
|
||||
All four library crates (core, secret, storage, flowgraph) are independent of
|
||||
each other. Dependencies flow **upward** only. The CLI binary sits at the top
|
||||
and wires concrete implementations together. alknet-storage implements
|
||||
alknet-core's `IdentityProvider` trait without a crate dependency — the CLI
|
||||
binary provides the bridge.
|
||||
|
||||
### Narrow Interface Points
|
||||
|
||||
Three types serve as the narrow interface points between crates:
|
||||
@@ -147,4 +160,5 @@ alknet-storage does NOT depend on alknet-secret as a crate. Instead:
|
||||
- [research/services.md](../../research/services.md) — Service protocols
|
||||
- [research/storage.md](../../research/storage.md) — alknet-storage contents
|
||||
- [research/flow.md](../../research/flow.md) — alknet-flowgraph contents
|
||||
- [ADR-028](028-auth-irpc-service.md) — Auth as irpc service (service protocol enabled by decomposition)
|
||||
- [ADR-029](029-identity-core-type.md) — Identity as core type (narrow interface point)
|
||||
@@ -93,4 +93,4 @@ propagate beyond the service boundary without projection.
|
||||
- [research/services.md](../../research/services.md) — Event boundary discipline section
|
||||
- [research/storage.md](../../research/storage.md) — Honker integration, event boundaries
|
||||
- [research/integration-plan.md](../../research/integration-plan.md) — ADR 032 entry
|
||||
- [event_source_types.md](/workspace/research/event_sourcing/event_source_types.md) — Event-driven architecture patterns
|
||||
- [event_source_types.md](../../research/event-sourcing/event_source_types.md) — Event-driven architecture patterns
|
||||
@@ -125,6 +125,8 @@ operations universally composable across all interfaces.
|
||||
|
||||
- [research/services.md](../../research/services.md) — OperationContext, OperationEnv
|
||||
- [research/integration-plan.md](../../research/integration-plan.md) — Phase 1.5, OperationEnv wiring
|
||||
- [ADR-026](026-transport-interface-separation.md) — Three-layer model (OperationEnv is Layer 3)
|
||||
- [ADR-028](028-auth-irpc-service.md) — Auth as irpc service (one dispatch backend)
|
||||
- [ADR-032](032-event-boundary-discipline.md) — Event boundary discipline
|
||||
- [ADR-024](024-bidirectional-call-protocol.md) — Bidirectional call protocol
|
||||
- [ADR-025](025-handler-spec-separation.md) — Handler/spec separation
|
||||
@@ -1,6 +1,6 @@
|
||||
---
|
||||
status: reviewed
|
||||
last_updated: 2026-06-02
|
||||
last_updated: 2026-06-07
|
||||
---
|
||||
|
||||
# NAPI Wrapper & PubSub Event Target
|
||||
@@ -71,11 +71,36 @@ function serve(options: AlknetServeOptions): Promise<AlknetServer>;
|
||||
interface AlknetServer {
|
||||
close(): Promise<void>;
|
||||
onConnection(callback: (stream: Duplex, info: ConnectionInfo) => void): void;
|
||||
// Dynamic config reload (ADR-030)
|
||||
reloadAuth(auth: { authorizedKeys?: Buffer, certAuthority?: Buffer }): void;
|
||||
reloadForwarding(policy: ForwardingPolicyConfig): void;
|
||||
reloadAll(config: DynamicConfig): void;
|
||||
}
|
||||
|
||||
interface ForwardingPolicyConfig {
|
||||
default: 'allow' | 'deny';
|
||||
rules: ForwardingRuleConfig[];
|
||||
}
|
||||
|
||||
interface ForwardingRuleConfig {
|
||||
target: string; // "localhost:*", "10.0.0.0/8:80", "alknet-*"
|
||||
action: 'allow' | 'deny';
|
||||
principals?: string[]; // default ["*"]
|
||||
}
|
||||
```
|
||||
|
||||
The NAPI layer is **transport-agnostic** — it doesn't know about pubsub's `EventEnvelope`. The pubsub adapter wraps the `Duplex` stream to implement `TypedEventTarget`. This separation ensures the NAPI wrapper is reusable for any stream-based protocol, not tied specifically to pubsub.
|
||||
|
||||
### NAPI Call Protocol Integration
|
||||
|
||||
NAPI consumers can register operation handlers to participate in the call protocol. The `Duplex` stream from `connect()` or `serve()` carries `EventEnvelope` frames (4-byte BE length prefix + JSON). A TypeScript consumer can implement a call protocol handler that reads these frames and dispatches to registered operations — the same wire protocol used by `@alkdev/operations`.
|
||||
|
||||
See [call-protocol.md](call-protocol.md) for the call protocol spec and [services.md](services.md) for OperationEnv and dispatch paths.
|
||||
|
||||
### NAPI irpc Service Creation
|
||||
|
||||
Behind the `irpc` feature flag, NAPI consumers can create irpc service instances for in-cluster communication. This is a Phase 2+ capability — Phase 1 uses `ConfigIdentityProvider` and direct `ConfigReloadHandle` calls. See [services.md](services.md) for the irpc service layer and ADR-027 for crate decomposition.
|
||||
|
||||
### NAPI `connect()` vs CLI `alknet connect`
|
||||
|
||||
The NAPI `connect()` function and the CLI `alknet connect` command are fundamentally different operations despite sharing the same name:
|
||||
@@ -155,3 +180,10 @@ None — all resolved.
|
||||
| [015](decisions/015-napi-rs-for-ffi-bridge.md) | napi-rs for FFI | Standard Node.js native addon tooling |
|
||||
| [016](decisions/016-napi-expose-connect-and-serve.md) | Both connect() and serve() | NAPI exposes client and server sides from the start |
|
||||
| [018](decisions/018-control-channel-for-pubsub.md) | Control channel for pubsub | Reserved `alknet-control` destination for event bus |
|
||||
| [030](decisions/030-static-dynamic-config-split.md) | Static/dynamic config split | NAPI reload methods for auth, forwarding, and all dynamic config |
|
||||
|
||||
## References
|
||||
|
||||
- [configuration.md](configuration.md) — DynamicConfig, ForwardingPolicy, reload mechanism
|
||||
- [services.md](services.md) — OperationEnv, irpc service layer
|
||||
- [call-protocol.md](call-protocol.md) — Call protocol wire format and operation registry
|
||||
@@ -105,7 +105,7 @@ last_updated: 2026-06-07
|
||||
- **Origin**: [research/configuration.md](../research/configuration.md)
|
||||
- **Status**: resolved
|
||||
- **Priority**: low
|
||||
- **Resolution**: No file watching. CLI loads once at startup; NAPI/hub reload explicitly. File watching is a potential attack vector and unnecessary complexity for a security tool.
|
||||
- **Resolution**: No file watching. CLI loads once at startup; NAPI/head reload explicitly. File watching is a potential attack vector and unnecessary complexity for a security tool.
|
||||
- **Cross-references**: configuration.md
|
||||
|
||||
### OQ-14: ArcSwap vs RwLock for dynamic config
|
||||
@@ -221,11 +221,18 @@ last_updated: 2026-06-07
|
||||
|
||||
### OQ-SVC-04: Should workers cache derived keys locally?
|
||||
- **Origin**: [secret-service.md](secret-service.md)
|
||||
- **Status**: open
|
||||
- **Priority**: low
|
||||
- **Status**: ~~resolved~~
|
||||
- **Priority**: low —
|
||||
- **Resolution**: Yes, with a TTL (default: 1 hour). The head can revoke by invalidating the session.
|
||||
- **Cross-references**: [secret-service.md](secret-service.md)
|
||||
|
||||
### OQ-SVC-05: How does the NFT-based ACL smart contract interact with the secret service?
|
||||
- **Origin**: [storage.md](storage.md)
|
||||
- **Status**: open
|
||||
- **Priority**: low
|
||||
- **Resolution**: The Ethereum signing key (`m/44'/60'/0'/0/0`) is derived from the same seed as the secret service. The smart contract is a separate concern — it reads on-chain ACL state, it doesn't call the secret service.
|
||||
- **Cross-references**: [storage.md](storage.md), [secret-service.md](secret-service.md)
|
||||
|
||||
## Interface
|
||||
|
||||
### OQ-IF-01: How does the Interface session type relate to the call protocol's EventEnvelope stream?
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
---
|
||||
status: reviewed
|
||||
last_updated: 2026-06-02
|
||||
last_updated: 2026-06-07
|
||||
---
|
||||
|
||||
# Alknet Overview
|
||||
@@ -16,6 +16,64 @@ Alknet is a self-hostable SSH-based tunnel tool that provides VPN-like functiona
|
||||
|
||||
The core insight: SSH tunnels work because SSH is fundamental infrastructure. Blocking it breaks the internet. Alknet makes SSH tunneling accessible through a simple CLI with pluggable transports.
|
||||
|
||||
## Crate Structure
|
||||
|
||||
Alknet is decomposed into six crates with a strict acyclic dependency graph (ADR-027):
|
||||
|
||||
| Crate | Purpose | Exists Now? |
|
||||
|-------|---------|-------------|
|
||||
| **alknet-core** | Transport, SSH, call protocol, config, auth types, `OperationSpec`, `Interface` trait | Yes |
|
||||
| **alknet-napi** | Node.js native addon via napi-rs | Yes |
|
||||
| **alknet-secret** | BIP39, SLIP-0010 HD key derivation, AES-256-GCM, `SecretProtocol` irpc service | Phase 2+ |
|
||||
| **alknet-storage** | SQLite-backed metagraph, identity tables, ACL graph, honker, `StorageProtocol` | Phase 2+ |
|
||||
| **alknet-flowgraph** | `FlowGraph<N,E>` over petgraph, operation graph, call graph | Phase 2+ |
|
||||
| **alknet** (CLI) | Binary that assembles everything with feature flags | Yes |
|
||||
|
||||
The four library crates (core, secret, storage, flowgraph) are independent of each other. Dependencies flow upward only: the CLI binary sits at the top and wires concrete implementations together. alknet-storage implements alknet-core's `IdentityProvider` trait without a crate dependency — the CLI binary provides the bridge.
|
||||
|
||||
irpc is behind a feature flag in alknet-core. Nodes that only do SSH tunneling don't need the service layer overhead.
|
||||
|
||||
## Three-Layer Model
|
||||
|
||||
Alknet uses a three-layer model (ADR-026):
|
||||
|
||||
| Layer | Responsibility | Examples |
|
||||
|-------|---------------|----------|
|
||||
| **Layer 1: Transport** | Produces byte streams (`AsyncRead + AsyncWrite + Unpin + Send`) | TCP, TLS, iroh, DNS (future), WebTransport (future) |
|
||||
| **Layer 2: Interface** | Consumes a transport stream and produces call protocol sessions | SSH (handshake + auth + channel multiplexing), raw framing (length-prefix + JSON) |
|
||||
| **Layer 3: Protocol** | Carries semantics — operation registry, service calls, events | Call protocol, OperationEnv, operation dispatch |
|
||||
|
||||
SSH is an interface, not a transport. The three-layer model enables DNS control channels (DNS transport + raw framing), local service mesh (TCP + raw framing), and browser direct call protocol (WebTransport + raw framing) without wrapping SSH inside those transports.
|
||||
|
||||
A connection is always a (Transport, Interface) pair. The protocol layer is agnostic to both.
|
||||
|
||||
## Service Layer
|
||||
|
||||
The irpc service layer decomposes alknet's core responsibilities into independently testable, deployable, and replaceable components (ADR-033, [services.md](services.md)):
|
||||
|
||||
- **Auth** (`AuthProtocol`) — verify identities, check credentials
|
||||
- **Secret** (`SecretProtocol`) — derive keys, encrypt/decrypt
|
||||
- **Config** (`ConfigProtocol`) — dynamic config reload
|
||||
- **Storage** (`StorageProtocol`) — graph CRUD, metagraph operations
|
||||
|
||||
**OperationEnv** is the universal composition mechanism. A handler receives `context.env.invoke("secrets", "derive", input)` and doesn't know whether the dispatch is local (direct function call), in-cluster (irpc service), or cross-node (call protocol `EventEnvelope`). Three dispatch paths, one handler-facing API.
|
||||
|
||||
**Phase boundary**: Phase 1 ships `ConfigIdentityProvider` (ArcSwap-backed) and `ConfigServiceImpl` (ArcSwap-backed) as the only auth and config implementations. The irpc service protocols (`AuthProtocol`, `SecretProtocol`, etc.) and the production deployment topology (multi-node with `StorageIdentityProvider`) are contracted in the specs but will be implemented in Phase 2+. Application services (DockerService, NodeService, agent services) are downstream concerns that build on top of the call protocol and OperationEnv.
|
||||
|
||||
## Identity
|
||||
|
||||
`Identity` struct and `IdentityProvider` trait are core types in alknet-core (ADR-029, [identity.md](identity.md)):
|
||||
|
||||
```rust
|
||||
pub struct Identity {
|
||||
pub id: String, // Fingerprint (config auth) or account UUID (database auth)
|
||||
pub scopes: Vec<String>, // Authorization scope strings
|
||||
pub resources: HashMap<String, Vec<String>>, // Resource-level authorization
|
||||
}
|
||||
```
|
||||
|
||||
`IdentityProvider` decouples alknet-core from identity storage. Phase 1 ships `ConfigIdentityProvider` (reads from `ArcSwap<DynamicConfig.auth>`). `StorageIdentityProvider` (Phase 2+, backed by SQLite) replaces it for production deployments. Both produce the same `Identity` result.
|
||||
|
||||
## Exports
|
||||
|
||||
### Binary: `alknet`
|
||||
@@ -35,24 +93,40 @@ The `alknet-core` crate exports the pluggable components for embedding or progra
|
||||
- `TcpTransport` — direct TCP connection
|
||||
- `TlsTransport` — TCP + tokio-rustls TLS
|
||||
- `IrohTransport` — iroh QUIC P2P connection
|
||||
- `Interface` trait — consumes transport stream, produces call protocol session
|
||||
- `Socks5Server` — local SOCKS5 proxy that forwards through SSH channels
|
||||
- `PortForwarder` — manages local/remote port forwards
|
||||
- `ServerHandler` — russh server handler with configurable auth and channel policies
|
||||
- `ConnectOptions` / `ServeOptions` — programmatic configuration structs (no file parsing)
|
||||
- `Identity` / `IdentityProvider` — core identity types (ADR-029)
|
||||
- `OperationSpec` — operation registration for call protocol (ADR-025)
|
||||
- `ConnectOptions` / `ServeOptions` — programmatic configuration structs
|
||||
- `StaticConfig` / `DynamicConfig` — static/immutable vs. hot-reloadable config (ADR-030)
|
||||
- `ConfigReloadHandle` — programmatic reload of dynamic config
|
||||
|
||||
## Dependencies
|
||||
|
||||
| Dependency | Purpose | Feature-gated |
|
||||
|------------|---------|---------------|
|
||||
| `russh` | SSH client & server | No (core) |
|
||||
| `tokio` | Async runtime | No (core) |
|
||||
| `tokio-rustls` | TLS wrapping | Yes (`tls`) |
|
||||
| `rustls` | TLS implementation | Yes (`tls`) |
|
||||
| `rustls-acme` | ACME/Let's Encrypt auto-cert | Yes (`acme`) |
|
||||
| `iroh` | P2P QUIC transport | Yes (`iroh`) |
|
||||
| `clap` | CLI argument parsing | No (core) |
|
||||
| `tracing` | Structured logging | No (core) |
|
||||
| `anyhow` / `thiserror` | Error handling | No (core) |
|
||||
| Dependency | Purpose | Crate | Feature-gated |
|
||||
|------------|---------|-------|---------------|
|
||||
| `russh` | SSH client & server | core | No (core) |
|
||||
| `tokio` | Async runtime | core | No (core) |
|
||||
| `tokio-rustls` | TLS wrapping | core | Yes (`tls`) |
|
||||
| `rustls` | TLS implementation | core | Yes (`tls`) |
|
||||
| `rustls-acme` | ACME/Let's Encrypt auto-cert | core | Yes (`acme`) |
|
||||
| `iroh` | P2P QUIC transport | core | Yes (`iroh`) |
|
||||
| `irpc` | Streaming RPC service layer | core | Yes (`irpc`) |
|
||||
| `arc-swap` | Lock-free dynamic config | core | No (core) |
|
||||
| `serde` | Serialization | core | No (core) |
|
||||
| `clap` | CLI argument parsing | CLI | No (CLI) |
|
||||
| `toml` | TOML config file | CLI | No (CLI) |
|
||||
| `tracing` | Structured logging | core | No (core) |
|
||||
| `anyhow` / `thiserror` | Error handling | core | No (core) |
|
||||
| `bip39` | Mnemonic generation | secret | No (secret) |
|
||||
| `ed25519-bip32` | HD key derivation | secret | No (secret) |
|
||||
| `aes-gcm` | AES-256-GCM encryption | secret | No (secret) |
|
||||
| `rusqlite` | SQLite (via honker) | storage | No (storage) |
|
||||
| `honker` | Event-sourced storage | storage | No (storage) |
|
||||
| `petgraph` | Graph data structure | storage, flowgraph | No |
|
||||
| `jsonschema` | JSON Schema validation | storage, flowgraph | No |
|
||||
|
||||
> Note: `tun-rs` is no longer a dependency. TUN support is deferred in favor of the external `tun2proxy` tool (ADR-014).
|
||||
|
||||
@@ -60,19 +134,29 @@ The `alknet-core` crate exports the pluggable components for embedding or progra
|
||||
|
||||
1. **SSH runs over transport, not alongside** — The transport layer produces a single `AsyncRead+AsyncWrite+Unpin+Send` stream. SSH runs over that stream via `russh::client::connect_stream()` / `russh::server::run_stream()`. The SSH layer never knows what transport it's on. (ADR-001, ADR-004)
|
||||
|
||||
2. **SOCKS5 is the primary client interface** — Port forwarding is built on top of SOCKS5-like channel management. For VPN-like "route all traffic" behavior, users run `tun2proxy` alongside alknet's SOCKS5 proxy. TUN is not in the project scope. (ADR-005, ADR-014)
|
||||
2. **Three-layer model: Transport, Interface, Protocol** — SSH is an interface (Layer 2), not a transport (Layer 1). A connection is always a (Transport, Interface) pair. The call protocol (Layer 3) is agnostic to both. This enables DNS control channels, raw framing, and WebTransport direct call protocol without wrapping SSH inside those transports. (ADR-026)
|
||||
|
||||
3. **No logging of tunnel destinations** — The server logs auth attempts and connections (for fail2ban) but does not log `channel_open_direct_tcpip` destinations, DNS lookups, or bytes transferred. (ADR-006, ADR-013)
|
||||
3. **SOCKS5 is the primary client interface** — Port forwarding is built on top of SOCKS5-like channel management. For VPN-like "route all traffic" behavior, users run `tun2proxy` alongside alknet's SOCKS5 proxy. TUN is not in the project scope. (ADR-005, ADR-014)
|
||||
|
||||
4. **Programmatic-first API** — Configuration via CLI flags, library API structs (`ConnectOptions`, `ServeOptions`), and environment variables. No `~/.ssh/config` parsing, no custom config files. (ADR-011)
|
||||
4. **No logging of tunnel destinations** — The server logs auth attempts and connections (for fail2ban) but does not log `channel_open_direct_tcpip` destinations, DNS lookups, or bytes transferred. (ADR-006, ADR-013)
|
||||
|
||||
5. **Feature flags control transport inclusion** — `tls`, `iroh`, `acme` are feature-gated so the base install is lean. Users opt in to heavier dependencies.
|
||||
5. **Programmatic-first API** — Configuration via CLI flags, library API structs (`ConnectOptions`, `ServeOptions`), and environment variables. No `~/.ssh/config` parsing. Optional `--config` TOML file for reproducible deployments. (ADR-011, ADR-030)
|
||||
|
||||
6. **Authentication is key-based** — Ed25519 public key (default) and OpenSSH certificate authority. No password authentication over SSH. (ADR-012)
|
||||
6. **Feature flags control transport inclusion** — `tls`, `iroh`, `acme`, `irpc` are feature-gated so the base install is lean. Users opt in to heavier dependencies.
|
||||
|
||||
7. **NAPI exposes both connect() and serve()** — The napi-rs wrapper provides client and server functionality, using napi-rs as the FFI bridge. The NAPI layer is transport-agnostic and not tied to pubsub. (ADR-015, ADR-016)
|
||||
7. **Authentication is key-based and unified** — Ed25519 public key (default) and OpenSSH certificate authority. Same key material for SSH and token auth. Identity resolves through `IdentityProvider` trait, decoupling core from identity storage. (ADR-012, ADR-023, ADR-029)
|
||||
|
||||
8. **Error handling follows a consistent layered pattern** — Transport and auth errors cause reconnection (client, with exponential backoff) or connection rejection (server). Channel-level errors (target unreachable, proxy failure) close the individual channel without killing the session. Library API errors propagate via `anyhow::Result` / `thiserror` types. CLI reports errors to stderr with appropriate exit codes. NAPI errors are marshalled as JavaScript exceptions.
|
||||
8. **NAPI exposes both connect() and serve()** — The napi-rs wrapper provides client and server functionality, using napi-rs as the FFI bridge. The NAPI layer is transport-agnostic and not tied to pubsub. (ADR-015, ADR-016)
|
||||
|
||||
9. **Static/dynamic config split** — Transport-level settings (listen address, TLS certs) are immutable after startup. Auth, forwarding policy, and rate limits are hot-reloadable via `ArcSwap<DynamicConfig>`. (ADR-030)
|
||||
|
||||
10. **Forwarding policy enforced before proxy spawn** — Each `channel_open_direct_tcpip` is checked against `ForwardingPolicy` before a TCP connection is made. Default-allow preserves current behavior. (ADR-031)
|
||||
|
||||
11. **OperationEnv as universal composition mechanism** — Handlers call `context.env.invoke(namespace, op, input)` regardless of dispatch path (local, irpc service, remote call protocol). (ADR-033)
|
||||
|
||||
12. **Event boundary discipline** — Domain events (Honker streams) stay within the owning service. irpc calls are synchronous and in-cluster. Call protocol `EventEnvelope` is the only thing that crosses node boundaries. (ADR-032)
|
||||
|
||||
13. **Error handling follows a consistent layered pattern** — Transport and auth errors cause reconnection (client, with exponential backoff) or connection rejection (server). Channel-level errors (target unreachable, proxy failure) close the individual channel without killing the session. Library API errors propagate via `anyhow::Result` / `thiserror` types. CLI reports errors to stderr with appropriate exit codes. NAPI errors are marshalled as JavaScript exceptions.
|
||||
|
||||
## Design Decisions
|
||||
|
||||
@@ -88,7 +172,7 @@ The `alknet-core` crate exports the pluggable components for embedding or progra
|
||||
| [008](decisions/008-acme-lets-encrypt.md) | ACME/Let's Encrypt | Auto-provision TLS certs, domain and IP paths |
|
||||
| [009](decisions/009-default-iroh-relay.md) | Default iroh relay | n0 relay by default, `--iroh-relay` override |
|
||||
| [010](decisions/010-transport-chaining-cli.md) | Transport chaining | `--proxy` works with all transports natively |
|
||||
| [011](decisions/011-no-ssh-config-programmatic-api.md) | Programmatic-first | No file-based config; options are structs, env vars, CLI flags |
|
||||
| [011](decisions/011-no-ssh-config-programmatic-api.md) | Programmatic-first | No SSH config files; options are structs, env vars, CLI flags (amended by ADR-030 for optional TOML) |
|
||||
| [012](decisions/012-auth-ed25519-and-cert-authority.md) | Key + cert-authority | Ed25519 keys + OpenSSH CA; no password auth |
|
||||
| [013](decisions/013-fail2ban-friendly-logging.md) | Fail2ban-friendly | Structured auth logs + built-in rate limiting |
|
||||
| [014](decisions/014-defer-tun-recommend-socks5-proxy.md) | Defer TUN | Use tun2proxy for VPN-like behavior; no alknet-tun binary |
|
||||
@@ -97,17 +181,46 @@ The `alknet-core` crate exports the pluggable components for embedding or progra
|
||||
| [017](decisions/017-stealth-mode-protocol-multiplexing.md) | Stealth mode | Protocol multiplexing on port 443 |
|
||||
| [018](decisions/018-control-channel-for-pubsub.md) | Control channel | Reserved `alknet-control` destination for pubsub |
|
||||
| [019](decisions/019-proxy-dual-semantics.md) | Proxy dual semantics | `--proxy` routes transport on client, data on server |
|
||||
| [023](decisions/023-unified-auth-shared-key-material.md) | Unified auth | Same key material for SSH and token auth |
|
||||
| [024](decisions/024-bidirectional-call-protocol.md) | Bidirectional call protocol | Both sides can initiate calls |
|
||||
| [025](decisions/025-handler-spec-separation.md) | Handler/spec separation | Downstream registers operations without modifying core |
|
||||
| [026](decisions/026-transport-interface-separation.md) | Three-layer model | SSH is Layer 2, not Layer 1 |
|
||||
| [027](decisions/027-crate-decomposition.md) | Crate decomposition | Six crates, acyclic deps, feature-gated irpc |
|
||||
| [028](decisions/028-auth-irpc-service.md) | Auth as irpc service | IdentityProvider is the contract, irpc is one backend |
|
||||
| [029](decisions/029-identity-core-type.md) | Identity as core type | `Identity` and `IdentityProvider` in alknet-core |
|
||||
| [030](decisions/030-static-dynamic-config-split.md) | Static/dynamic config | ArcSwap for hot-reloadable auth and forwarding |
|
||||
| [031](decisions/031-forwarding-policy.md) | Forwarding policy | Per-identity, per-destination, per-transport rules |
|
||||
| [032](decisions/032-event-boundary-discipline.md) | Event boundary | Domain events never cross service boundaries |
|
||||
| [033](decisions/033-operationenv-irpc-call-protocol.md) | OperationEnv | Universal composition, three dispatch paths |
|
||||
| [034](decisions/034-head-worker-terminology.md) | Head/worker | Replaces hub/spoke terminology |
|
||||
|
||||
## Open Questions
|
||||
|
||||
All open questions have been resolved. See [open-questions.md](open-questions.md) for resolution details.
|
||||
See [open-questions.md](open-questions.md) for all open and resolved questions.
|
||||
Key open questions: OQ-15 (QUIC coexistence), OQ-19 (WebTransport TLS),
|
||||
OQ-20 (worker registration), OQ-IF-01 (Interface session / EventEnvelope
|
||||
relationship).
|
||||
|
||||
## References
|
||||
|
||||
- [Feasibility Assessment](../../../conversations/research/ssh-tunnel-vpn-alternative-feasibility.md)
|
||||
- [transport.md](transport.md) — Transport abstraction (Layer 1)
|
||||
- [interface.md](interface.md) — Interface layer (Layer 2)
|
||||
- [call-protocol.md](call-protocol.md) — Call protocol (Layer 3)
|
||||
- [auth.md](auth.md) — Unified authentication
|
||||
- [identity.md](identity.md) — Identity and IdentityProvider
|
||||
- [configuration.md](configuration.md) — StaticConfig, DynamicConfig, ForwardingPolicy
|
||||
- [services.md](services.md) — irpc service layer, OperationEnv
|
||||
- [server.md](server.md) — Server acceptance, channel handling
|
||||
- [client.md](client.md) — Client connection, SOCKS5, port forwarding
|
||||
- [napi-and-pubsub.md](napi-and-pubsub.md) — NAPI wrapper and pubsub adapter
|
||||
- [storage.md](storage.md) — alknet-storage: metagraph, identity, ACL
|
||||
- [flowgraph.md](flowgraph.md) — alknet-flowgraph: call graph, operation graph
|
||||
- [secret-service.md](secret-service.md) — alknet-secret: BIP39, SLIP-0010, AES-GCM
|
||||
- [Feasibility Assessment](../research/feasibility/ssh-tunnel-vpn-alternative-feasibility.md)
|
||||
- [russh API](/workspace/russh) — SSH client/server library
|
||||
- [Dispatch](/workspace/@alkdev/dispatch) — Reference implementation of russh port forwarding
|
||||
- [iroh](/workspace/iroh) — P2P QUIC connections
|
||||
- [tun2proxy](https://github.com/tun2proxy/tun2proxy) — Recommended external TUN-to-SOCKS5 tool
|
||||
- [Production certbot setup](/workspace/system/dev1/certbot.md) — Let's Encrypt on our infrastructure
|
||||
- [Production fail2ban setup](/workspace/system/dev1/fail2ban.md) — fail2ban with nftables on our infrastructure
|
||||
- [irpc](/workspace/irpc) — iroh streaming RPC
|
||||
- [Production certbot setup](../research/ops/certbot.md) — Let's Encrypt on our infrastructure
|
||||
- [Production fail2ban setup](../research/ops/fail2ban.md) — fail2ban with nftables on our infrastructure
|
||||
@@ -166,20 +166,16 @@ never leaves the secret service node.
|
||||
|
||||
## Open Questions
|
||||
|
||||
- **OQ-SVC-01**: Should the secret service support multiple seed phrases (one per
|
||||
tenant)? See [open-questions.md](open-questions.md).
|
||||
- **OQ-SVC-01**: Should the secret service support multiple seed phrases (one
|
||||
per tenant)? See [open-questions.md](open-questions.md).
|
||||
|
||||
- **OQ-SVC-02**: Should service protocols use postcard (binary) or JSON for
|
||||
remote calls? Postcard for irpc (Rust-to-Rust), JSON for call protocol
|
||||
(cross-language). See [open-questions.md](open-questions.md).
|
||||
remote calls? See [open-questions.md](open-questions.md).
|
||||
|
||||
- **OQ-SVC-03**: How does the secret service integrate with the existing
|
||||
`EncryptedDataSchema` from `@alkdev/storage`? The Rust implementation replaces
|
||||
PBKDF2 password-based encryption with derived AES-256-GCM keys. The
|
||||
`EncryptedData` format is a superset.
|
||||
`EncryptedDataSchema` from `@alkdev/storage`? See [open-questions.md](open-questions.md).
|
||||
|
||||
- **OQ-SVC-04**: Should workers cache derived keys locally? Yes, with a TTL
|
||||
(default: 1 hour). The head can revoke by invalidating the session.
|
||||
- **OQ-SVC-04**: Should workers cache derived keys locally? See [open-questions.md](open-questions.md).
|
||||
|
||||
## Design Decisions
|
||||
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
---
|
||||
status: reviewed
|
||||
last_updated: 2026-06-02
|
||||
last_updated: 2026-06-07
|
||||
---
|
||||
|
||||
# Server
|
||||
@@ -51,21 +51,30 @@ The server is the tunnel endpoint. It receives SSH channels requesting TCP conne
|
||||
|
||||
### Authentication
|
||||
|
||||
The server supports Ed25519 public key authentication (default) and OpenSSH certificate authority authentication (ADR-012):
|
||||
The server authenticates connections through the `IdentityProvider` trait (ADR-029, [identity.md](identity.md)). `IdentityProvider` decouples the server from any specific identity storage — the server resolves an identity, it doesn't manage keys.
|
||||
|
||||
**Ed25519 public key** (default):
|
||||
1. Load authorized keys from a specified path or in-memory data
|
||||
2. `auth_publickey()` checks the presented key against the authorized set
|
||||
3. Uses constant-time comparison to prevent timing attacks
|
||||
**Phase 1 implementation**: `ConfigIdentityProvider` (in alknet-core) reads from `ArcSwap<DynamicConfig.auth>` (ADR-030). Every authorized key gets a default scope set. No database required. This is the default for CLI and single-node deployments.
|
||||
|
||||
**OpenSSH certificate authority** (ADR-012):
|
||||
1. Load a trusted CA public key (`--cert-authority <path>`)
|
||||
2. `auth_publickey()` validates the presented certificate: checks CA signature, expiry, and principal restrictions
|
||||
3. Supports certificate options: `permit-port-forwarding`, `no-pty`, `source-address`
|
||||
**Future implementation**: `StorageIdentityProvider` (in alknet-storage, not yet built) backed by SQLite `peer_credentials` and `api_keys` tables plus the ACL graph. The server doesn't need to know which implementation is active — it goes through the trait.
|
||||
|
||||
This enables multi-user deployments where adding one CA line to `authorized_keys` is simpler than managing individual keys for every user.
|
||||
The server supports two auth presentation paths (ADR-023, [auth.md](auth.md)):
|
||||
|
||||
**No password authentication over SSH.** Keys and certificates are sufficient and more secure. If a local SOCKS5 proxy needs its own auth layer, that's a separate concern.
|
||||
**SSH public key auth** (SSH transports):
|
||||
1. `auth_publickey()` callback receives the presented key
|
||||
2. Delegates to `IdentityProvider::resolve_from_fingerprint()` with the key fingerprint
|
||||
3. Returns `Accept` (with `Identity` attached) or `Reject`
|
||||
|
||||
**Ed25519 + OpenSSH certificate authority** (ADR-012):
|
||||
1. If no direct key match, validate the presented certificate against trusted cert-authorities
|
||||
2. Check CA signature, expiry, and principal restrictions
|
||||
3. Certificate options: `permit-port-forwarding`, `no-pty`, `source-address`
|
||||
|
||||
**Token auth** (non-SSH transports, WebTransport):
|
||||
1. Extract token from URL path or `Authorization` header
|
||||
2. Delegate to `IdentityProvider::resolve_from_token()`
|
||||
3. Same verification: same authorized keys set, same `Identity` result (ADR-023)
|
||||
|
||||
**No password authentication over SSH channels.** Keys and certificates are sufficient and more secure. If a local SOCKS5 proxy needs its own auth layer, that's a separate concern.
|
||||
|
||||
### Key Material Format
|
||||
|
||||
@@ -87,7 +96,9 @@ When a client opens a `channel_open_direct_tcpip(host, port, originator_addr, or
|
||||
|
||||
**Reserved destination** — If `host` starts with `alknet-` (e.g., `alknet-control`), the server routes the channel internally instead of connecting to a TCP target. The primary reserved destination is `alknet-control:0`, which bridges the channel to the local pubsub event bus (ADR-018).
|
||||
|
||||
**Regular destination** — For all other targets:
|
||||
**Forwarding policy check** — Before the proxy task is spawned for any non-reserved destination, the server evaluates `ForwardingPolicy` against the authenticated `Identity` (ADR-031, [configuration.md](configuration.md)). The policy check uses `Identity.id` and `Identity.scopes` from the identity resolved during auth. If the policy denies the destination, the channel open is rejected — no TCP connection is attempted. The default policy (`ForwardingPolicy::allow_all()`) preserves current behavior.
|
||||
|
||||
**Regular destination** — For targets that pass the forwarding policy check:
|
||||
|
||||
1. **Connection** — connect to `host:port`, either directly or via the configured outbound proxy
|
||||
2. **Outbound connection** — connect to the target, either directly or via the configured outbound proxy
|
||||
@@ -122,17 +133,23 @@ This makes the server appear as an ordinary web server to port scanners and DPI
|
||||
The server handler implements `russh::server::Handler` with two primary responsibilities:
|
||||
|
||||
**Authentication (`auth_publickey`)**:
|
||||
- Check the presented key against the configured `authorized_keys` set (constant-time comparison)
|
||||
- If no direct match, check whether the key is a certificate signed by a trusted cert-authority
|
||||
- Validate certificate signature, expiry, and principal restrictions (e.g., `permit-port-forwarding`, `no-pty`, `source-address`)
|
||||
- Delegate to `IdentityProvider::resolve_from_fingerprint()` with the presented key fingerprint
|
||||
- If identity resolved, return `Accept` with the `Identity` attached to the session
|
||||
- If no identity, check certificate authority: validate CA signature, expiry, principals
|
||||
- Return `Accept` or `Reject`
|
||||
|
||||
**Channel handling (`channel_open_direct_tcpip`)**:
|
||||
- If the destination host starts with `alknet-`, route internally (control channel, ADR-018)
|
||||
- Otherwise, connect to `host:port` (directly or via the configured outbound proxy)
|
||||
- Otherwise, evaluate `ForwardingPolicy` against the session's `Identity` (ADR-031)
|
||||
- If denied, reject the channel open
|
||||
- If allowed, connect to `host:port` (directly or via the configured outbound proxy)
|
||||
- Spawn a bidirectional proxy task between the SSH channel and the outbound TCP stream
|
||||
- Return the channel for data flow
|
||||
|
||||
### Interface Abstraction
|
||||
|
||||
SSH is one interface at Layer 2 in the three-layer model (ADR-026, [interface.md](interface.md)). The current `ServerHandler` will be refactored into `SshInterface` — it manages SSH session concerns (handshake, auth delegation, channel multiplexing). Forwarding policy, operation routing, and call protocol handling are Layer 3 concerns that live outside the interface. This refactoring is the most invasive code change in Phase 1 (integration-plan, Phase 1.8).
|
||||
|
||||
### Logging and Rate Limiting
|
||||
|
||||
**Logging** (for fail2ban integration on Linux):
|
||||
@@ -159,6 +176,25 @@ These provide abuse protection on platforms without fail2ban (macOS, Windows, BS
|
||||
|
||||
### CLI Interface
|
||||
|
||||
Configuration sources (in priority order): CLI flags, environment variables, optional `--config` TOML file (ADR-030). The TOML config file is a convenience input for reproducible deployments; it does not replace `ServeOptions` (ADR-011).
|
||||
|
||||
Multi-transport listeners use `[[listeners]]` in the TOML config (ADR-030):
|
||||
|
||||
```toml
|
||||
[[listeners]]
|
||||
transport = "tls"
|
||||
listen = "0.0.0.0:443"
|
||||
|
||||
[listeners.tls]
|
||||
cert = "/etc/alknet/tls/cert.pem"
|
||||
key = "/etc/alknet/tls/key.pem"
|
||||
|
||||
[[listeners]]
|
||||
transport = "iroh"
|
||||
```
|
||||
|
||||
Currently, the server binds to a single transport at a time. Multi-transport via `[[listeners]]` is coming per ADR-030.
|
||||
|
||||
```bash
|
||||
# Basic server (SSH on port 22)
|
||||
alknet serve --key ~/.ssh/ssh_host_ed25519_key
|
||||
@@ -230,7 +266,9 @@ No listening port is needed. The server connects outbound to the iroh relay (def
|
||||
- The server does not log tunnel destinations (ADR-006). Auth events and connection events are logged for fail2ban integration (ADR-013).
|
||||
- Destination strings beginning with `alknet-` are reserved for internal use (ADR-018). The server must not attempt TCP connections to `alknet-*` destinations — these are intercepted for control channel routing.
|
||||
- One `ServerHandler` instance per connection. Handler state is not shared between connections (unless explicitly configured via `Arc` shared state for things like connection limits).
|
||||
- The server binds to a single transport at a time. Running multiple transports (e.g., TCP + iroh) simultaneously requires separate processes or a future multiplexing feature.
|
||||
- The server currently binds to a single transport at a time. Multi-transport via `[[listeners]]` is coming per ADR-030.
|
||||
- Forwarding policy is evaluated before every channel proxy spawn. Denied channels are rejected immediately (ADR-031).
|
||||
- Auth resolves through `IdentityProvider` (ADR-029). Phase 1 uses `ConfigIdentityProvider` backed by `ArcSwap<DynamicConfig>` (ADR-030). `StorageIdentityProvider` (Phase 2+) replaces it for production deployments with SQLite.
|
||||
- ACME support requires the `acme` feature flag. Without it, only manual TLS certs are supported.
|
||||
- No password authentication over SSH channels. Key-based and cert-authority only (ADR-012).
|
||||
- Stealth mode (`--stealth`) requires TLS transport. It has no effect on TCP or iroh transports (ADR-017).
|
||||
@@ -273,3 +311,15 @@ None — all resolved.
|
||||
| [017](decisions/017-stealth-mode-protocol-multiplexing.md) | Stealth mode | Protocol multiplexing on port 443 |
|
||||
| [018](decisions/018-control-channel-for-pubsub.md) | Control channel | Reserved `alknet-control` destination for pubsub |
|
||||
| [019](decisions/019-proxy-dual-semantics.md) | Proxy dual semantics | `--proxy` routes transport on client, data on server |
|
||||
| [026](decisions/026-transport-interface-separation.md) | Three-layer model | SSH is Layer 2 interface, ServerHandler → SshInterface |
|
||||
| [028](decisions/028-auth-irpc-service.md) | Auth as irpc service | IdentityProvider is the contract; irpc service is one backend |
|
||||
| [029](decisions/029-identity-core-type.md) | Identity as core type | IdentityProvider trait in alknet-core |
|
||||
| [030](decisions/030-static-dynamic-config-split.md) | Static/dynamic config split | ArcSwap for dynamic config, ConfigReloadHandle |
|
||||
| [031](decisions/031-forwarding-policy.md) | Forwarding policy | Evaluated before channel proxy spawn |
|
||||
|
||||
## References
|
||||
|
||||
- [configuration.md](configuration.md) — DynamicConfig, ForwardingPolicy, ConfigReloadHandle
|
||||
- [identity.md](identity.md) — IdentityProvider trait, Identity struct
|
||||
- [auth.md](auth.md) — Unified auth, AuthPolicy, token auth
|
||||
- [interface.md](interface.md) — Interface trait, SshInterface, three-layer model
|
||||
@@ -20,8 +20,8 @@ last_updated: 2026-06-07
|
||||
The irpc service layer decomposes alknet's core responsibilities into
|
||||
independently testable, deployable, and replaceable components. Auth, Secret,
|
||||
Config, and Storage are irpc protocol enums that work both as in-process async
|
||||
boundaries (tokio channels) and cross-process/cross-network (QUIC streams via
|
||||
noq). OperationEnv is the universal composition mechanism that unifies local
|
||||
boundaries (tokio channels) and cross-process/cross-network (irpc over iroh
|
||||
QUIC streams). OperationEnv is the universal composition mechanism that unifies local
|
||||
dispatch, irpc service dispatch, and remote call protocol dispatch.
|
||||
|
||||
## Why
|
||||
@@ -209,13 +209,10 @@ layer to be built — they are Phase 2+ concerns.
|
||||
## Open Questions
|
||||
|
||||
- **OQ-SVC-01**: Should the secret service support multiple seed phrases (one
|
||||
per tenant)? Defer for now — one seed per node. Multi-seed can be added
|
||||
later by indexing the `Unlock` call with a tenant ID.
|
||||
per tenant)? See [open-questions.md](open-questions.md).
|
||||
|
||||
- **OQ-SVC-02**: Should service protocols use postcard (binary) or JSON for
|
||||
remote calls? Postcard for irpc (Rust-to-Rust, efficient). JSON for call
|
||||
protocol (cross-language, universal). The irpc remote path naturally uses
|
||||
postcard.
|
||||
remote calls? See [open-questions.md](open-questions.md).
|
||||
|
||||
## Design Decisions
|
||||
|
||||
|
||||
@@ -197,17 +197,12 @@ dependency.
|
||||
## Open Questions
|
||||
|
||||
- **OQ-SVC-03**: How does the secret service integrate with the existing
|
||||
`EncryptedDataSchema` from `@alkdev/storage`? The Rust implementation replaces
|
||||
PBKDF2 password-based encryption with derived AES-256-GCM keys. The
|
||||
`EncryptedData` format is a superset — old format can be migrated by
|
||||
re-encrypting with the new key.
|
||||
`EncryptedDataSchema` from `@alkdev/storage`? See [open-questions.md](open-questions.md).
|
||||
|
||||
- **OQ-SVC-04**: Should workers cache derived keys locally? Yes, with a TTL
|
||||
(default: 1 hour). The head can revoke by invalidating the session.
|
||||
- **OQ-SVC-04**: Should workers cache derived keys locally? See [open-questions.md](open-questions.md).
|
||||
|
||||
- **OQ-SVC-05**: How does the smart contract (NFT-based ACL) interact with the
|
||||
secret service? The Ethereum signing key (`m/44'/60'/0'/0/0`) is derived from
|
||||
the same seed. The smart contract is a separate concern.
|
||||
- **OQ-SVC-05**: How does the NFT-based ACL smart contract interact with the
|
||||
secret service? See [open-questions.md](open-questions.md).
|
||||
|
||||
## Design Decisions
|
||||
|
||||
|
||||
91
docs/research/event-sourcing/event_source_types.md
Normal file
91
docs/research/event-sourcing/event_source_types.md
Normal file
@@ -0,0 +1,91 @@
|
||||
|
||||
|
||||
|
||||
Here is an article tailored specifically to untangle these concepts. It is structured not just as a conceptual guide, but as a **diagnostic tool**—perfect for feeding into an AI coding CLI to sniff out architectural smells and "spaghetti concepts" in a codebase.
|
||||
|
||||
***
|
||||
|
||||
# Deconstructing Event-Driven Architecture: Untangling "Spaghetti Concepts"
|
||||
|
||||
In modern software architecture, the term "Event" has fallen victim to *semantic diffusion*—a concept popularized by Martin Fowler where a term becomes so widely used that it loses its original, specific meaning. When developers use the same word to describe state persistence, data distribution, and asynchronous notifications, the result is "Spaghetti Concepts."
|
||||
|
||||
Just like spaghetti code, spaghetti concepts lead to tight coupling, brittle systems, and unpredictable side effects. To fix an Event-Driven Architecture (EDA), we must draw hard boundaries around what an "event" is actually doing in any given context.
|
||||
|
||||
This guide breaks down the distinct types of events, their proper use cases, and the structural anti-patterns (Conflation Points) that occur when they are mixed up.
|
||||
|
||||
---
|
||||
|
||||
## 1. Event Sourcing (State Persistence)
|
||||
**The Concept:** Event Sourcing is a method of persisting state. Instead of saving the *current* state of an entity (e.g., `Quantity: 27`) in a database row, you save the *history of facts* that led to that state (e.g., `Received 30`, `Shipped 5`, `Adjusted +2`). The current state is derived by replaying these facts.
|
||||
|
||||
**The Golden Rule:** Event Sourcing is an **internal implementation detail** of a specific service or Aggregate. It is highly specific to the domain logic.
|
||||
|
||||
**How to Identify It:**
|
||||
* Uses a specialized stream database (like EventStoreDB).
|
||||
* Events are named in the past tense representing highly specific domain actions (`InventoryAdjusted`, `OrderPlaced`).
|
||||
* The system reads a stream of these events to reconstruct an object in memory before applying new business rules.
|
||||
|
||||
### 🚨 Conflation Point: Leaking the Event Store (The Database Reach-In)
|
||||
**The Smell:** Service B connects directly to Service A’s event store to read its events and react to them.
|
||||
**Why it’s bad:** Because Event Sourcing events are internal state, exposing them externally completely shatters Service A's encapsulation. If Service A refactors how it calculates inventory, Service B breaks.
|
||||
**The Fix:** Service A should project its internal Event Sourcing events into generalized **Integration Events** (see below) and publish those to a message broker (like RabbitMQ or Kafka) for Service B to consume.
|
||||
|
||||
---
|
||||
|
||||
## 2. Event-Carried State Transfer (Data Distribution)
|
||||
**The Concept:** Also known as "Fat Events," this pattern is used to distribute data across services to avoid synchronous API calls (temporal coupling). If Service B needs to know about a Product's price to calculate a shopping cart total, Service A publishes an event containing the *entire* current state of that product. Service B listens to this event and builds a local, read-only cache (a projection).
|
||||
|
||||
**The Golden Rule:** These events exist to answer the question, *"What does the data look like now?"* without requiring a synchronous HTTP callback.
|
||||
|
||||
**How to Identify It:**
|
||||
* Events often have generic CRUD-like names (`ProductUpdated`, `CustomerCreated`).
|
||||
* Payloads are "fat"—they contain a lot of data (ID, Name, Price, Category, etc.).
|
||||
* Often implemented using Change Data Capture (CDC) tools like Debezium reading from a primary database and publishing to Kafka.
|
||||
|
||||
### 🚨 Conflation Point: Event Sourcing vs. State Transfer
|
||||
**The Smell:** Using a state transfer tool (like Debezium publishing `RowUpdated` events) as a makeshift Event Sourcing log to derive business logic.
|
||||
**Why it’s bad:** A database row update doesn't tell you *why* the data changed. Was a user's address updated because they moved, or because there was a typo? Business intent is lost.
|
||||
**The Fix:** Keep CDC and state transfer events strictly for updating local read-caches in downstream services. Do not use them to drive complex business workflows that rely on "intent."
|
||||
|
||||
---
|
||||
|
||||
## 3. Notification Events (Behavioral Triggers)
|
||||
**The Concept:** Also known as "Thin Events," these are lean messages broadcasted to notify the system that a business milestone has occurred. They usually contain minimal data—often just an Entity ID and an action.
|
||||
|
||||
**The Golden Rule:** They act as an asynchronous "tap on the shoulder" to tell downstream services to trigger their own workflows (Choreography).
|
||||
|
||||
**How to Identify It:**
|
||||
* Payloads are "thin" (e.g., `{ "Event": "OrderShipped", "OrderId": "123" }`).
|
||||
* Used heavily in integrations (e.g., triggering an email via AWS SES, or notifying a shipping warehouse).
|
||||
|
||||
### 🚨 Conflation Point: The Synchronous Callback Trap (Boomerang Coupling)
|
||||
**The Smell:** Service A publishes a thin `OrderPlaced` event. Service B receives it, but to do its job, it must immediately make a synchronous HTTP REST call back to Service A to fetch the order details.
|
||||
**Why it’s bad:** If Service A goes down, Service B fails. You have successfully implemented Event-Driven Architecture, but kept the exact synchronous temporal coupling you were trying to eliminate. Furthermore, a flood of events can cause a DDoS attack on your own service.
|
||||
**The Fix:** If downstream services *always* need the data to process the event, upgrade the Notification Event to an Event-Carried State Transfer ("Fat Event") by including the required data in the payload.
|
||||
|
||||
---
|
||||
|
||||
## 4. Domain Events vs. Integration Events (The Boundary Rule)
|
||||
*Own Insight / DDD Integration*
|
||||
|
||||
A massive source of spaghetti concepts is failing to differentiate between events meant for *inside* the house and events meant for *outside* the house.
|
||||
|
||||
* **Domain Events:** Fired and consumed *within the same service boundary*. They can contain rich, complex, internal domain models because the producer and consumer share the same codebase/ubiquitous language.
|
||||
* **Integration Events:** Fired *across service boundaries*. They should be simple, generalized, and stripped of internal jargon or complex objects.
|
||||
|
||||
### 🚨 Conflation Point: The Leaky Domain Model
|
||||
**The Smell:** A microservice publishes an event to a global Kafka topic, and the payload contains internal database IDs, complex nested objects, or serialized language-specific data types (like Java/C# specific enums).
|
||||
**Why it’s bad:** Downstream services are now strictly coupled to the internal data structure of the upstream service.
|
||||
**The Fix:** Implement an Anti-Corruption Layer. The producing service should catch its own Domain Event, map the data to a simplified, standardized, versioned Integration Event, and publish *that* to the wider system.
|
||||
|
||||
---
|
||||
|
||||
## Code Review CLI Prompt: "The Conflation Detector"
|
||||
*(Note: Feed the following heuristics to your AI CLI alongside this article to review your codebase).*
|
||||
|
||||
**AI CLI Instructions:** Scan the provided codebase for Event-Driven Architecture anti-patterns. Flag code that violates the conceptual boundaries described in the article. Look specifically for:
|
||||
1. **Shared Event Stores:** Are multiple distinct microservices connecting to the same EventStoreDB or reading the exact same raw Event Sourcing stream?
|
||||
2. **Boomerang Callbacks:** Is an event consumer receiving a message from a broker (RabbitMQ/Kafka/Azure Service Bus), extracting an ID, and immediately making an HTTP request to the service that originated the event?
|
||||
3. **Leaky Domain Models:** Are internal entity objects (e.g., classes mapped directly to ORMs like Entity Framework or Hibernate) being serialized directly into event payloads sent to external message brokers?
|
||||
4. **Misused CDC:** Are Debezium/database-trigger events being used to trigger business logic workflows, rather than simply updating read-models/caches?
|
||||
5. **Fat Notification Trap:** Are Notification events carrying massive payloads just to trigger an email, when a thin event would suffice? Or conversely, are thin events starving consumers of necessary data?
|
||||
@@ -0,0 +1,773 @@
|
||||
# SSH Tunnel VPN Alternative — Feasibility Assessment
|
||||
|
||||
**Date**: 2026-06-01
|
||||
**Status**: Feasibility assessment / architecture sketch
|
||||
**Updated**: 2026-06-01 — Added iroh transport analysis (§11)
|
||||
|
||||
## 1. Problem Statement
|
||||
|
||||
Countries in the "developed west" (UK, CA, etc.) are increasingly banning or restricting VPNs at the protocol level. The valid use case of a VPN — a *virtual private network* for securing traffic on hostile networks, accessing private infrastructure, and tunneling between trusted endpoints — gets caught in the crossfire when VPNs are treated primarily as location-spoofing tools.
|
||||
|
||||
SSH-based tunnels cover the same functional ground without being a VPN protocol. Blocking SSH would break the internet in critical ways (infrastructure management, CI/CD, development workflows). The goal is to build a dead-simple, self-hostable Rust client/server that provides VPN-like functionality over SSH, with optional TLS wrapping for traffic obfuscation.
|
||||
|
||||
## 2. Reference Codebase Analysis
|
||||
|
||||
### 2.1 Dispatch (`/workspace/@alkdev/dispatch`)
|
||||
|
||||
Dispatch proves russh usage well within scope. Key takeaways:
|
||||
|
||||
- **Pure SSH client** — `client::Handler` is a zero-sized type, auto-accepts server keys. Minimal boilerplate.
|
||||
- **Arc-wrapped Handle pattern** — `Arc<client::Handle<Client>>` enables sharing across concurrent tasks (port forwarding, SFTP, exec).
|
||||
- **Port forwarding via `channel_open_direct_tcpip`** — Already implemented. Local TCP listener → `direct-tcpip` SSH channel → `tokio::io::copy_bidirectional`. This is the standard SSH `-L` pattern, implemented programmatically.
|
||||
- **Channel-per-operation model** — Each operation opens its own SSH channel on a shared session. Multiplexing is handled by russh internally.
|
||||
- **Channel.into_stream()** — Converts SSH channels to `AsyncRead + AsyncWrite` streams, enabling use with any tokio I/O combinator.
|
||||
|
||||
The dispatch codebase is clean and demonstrates that the core SSH mechanics are straightforward. The new project would need both client **and** server sides, but russh's server API mirrors the client API closely.
|
||||
|
||||
### 2.2 russh (`/workspace/russh`)
|
||||
|
||||
Critical capabilities confirmed:
|
||||
|
||||
| Feature | API | Status |
|
||||
|---------|-----|--------|
|
||||
| Local port forwarding (client → server → remote) | `Handle::channel_open_direct_tcpip()` | Available, no feature flag |
|
||||
| Remote port forwarding (server listens, client gets channels) | `Handle::tcpip_forward()` / Handler callback `server_channel_open_forwarded_tcpip()` | Available, no feature flag |
|
||||
| Unix socket forwarding | `Handle::channel_open_direct_streamlocal()` / `Handle::streamlocal_forward()` | Available, no feature flag |
|
||||
| Server-side reverse forwarding | `server::Handler::tcpip_forward()` / `server::Handle::forward_tcpip()` | Available, no feature flag |
|
||||
| Arbitrary stream transport | `client::connect_stream()` / `server::run_stream()` | **Both accept `AsyncRead+AsyncWrite+Unpin+Send`** |
|
||||
| Channel as bidirectional stream | `Channel::into_stream()` / `split()` | Available |
|
||||
|
||||
**The `connect_stream()` and `run_stream()` APIs are the key enabler for TLS wrapping.** They accept any async byte stream, meaning we can layer TLS (via `tokio-rustls`) underneath russh without modifying russh itself. The SSH session runs over a TLS stream, which looks like HTTPS to DPI.
|
||||
|
||||
## 3. Architecture Sketch
|
||||
|
||||
### 3.1 Components
|
||||
|
||||
```
|
||||
┌─────────────────────────────────┐ ┌─────────────────────────────────┐
|
||||
│ CLIENT │ │ SERVER │
|
||||
│ │ │ │
|
||||
│ ┌──────────┐ ┌───────────┐ │ │ ┌───────────┐ ┌──────────┐ │
|
||||
│ │ TUN │ │ SSH │ │ SSH │ │ SSH │ │ Proxy │ │
|
||||
│ │ Interface│───▶│ Client │──┼─ over ──▶│ Server │───▶│ Handler │ │
|
||||
│ │ (tun-rs)│◀───│ (russh) │ │ TLS │ (russh) │◀───│ │ │
|
||||
│ └──────────┘ └─────┬─────┘ │ opt. │ └─────┬─────┘ └────┬─────┘ │
|
||||
│ │ │ │ │ │ │
|
||||
│ ┌─────▼─────┐ │ │ ┌─────▼─────┐ ┌────▼─────┐ │
|
||||
│ │ TLS Layer │ │ │ │ TLS Layer │ │ Outbound │ │
|
||||
│ │(tokio- │ │ │ │(tokio- │ │ Proxy │ │
|
||||
│ │ rustls) │ │ │ │ rustls) │ │(SOCKS5/ │ │
|
||||
│ └─────┬─────┘ │ │ └─────┬─────┘ │ HTTP) │ │
|
||||
│ │ │ │ │ └────┬─────┘ │
|
||||
│ ┌─────▼─────┐ │ │ ┌─────▼─────┐ │ │
|
||||
│ │ TCP │ │ │ │ TCP │ ┌────▼─────┐ │
|
||||
│ │ Connect │◀─┼────────▶│ │ Listener │ │ Direct │ │
|
||||
│ └───────────┘ │ │ └───────────┘ │ Forward │ │
|
||||
│ │ │ └────┬─────┘ │
|
||||
└─────────────────────────────────┘ └─────────────────────────────────┘
|
||||
│ │
|
||||
Proxy Mode Direct Mode
|
||||
(outbound via (outbound
|
||||
SOCKS5/HTTP) direct TCP)
|
||||
```
|
||||
|
||||
### 3.2 Data Flow — Client TUN Mode
|
||||
|
||||
1. **TUN interface** (created via `tun-rs`) captures IP packets from the OS routing table
|
||||
2. **Client reads IP packets** from the TUN device, determines destination IP:port
|
||||
3. **Client opens `direct-tcpip` SSH channel** to destination via `handle.channel_open_direct_tcpip(dest_ip, dest_port, ...)`
|
||||
4. **Client writes packet payload** to the SSH channel, reads response
|
||||
5. **Client writes response** back to TUN interface
|
||||
|
||||
This is essentially what tun2proxy does, except instead of SOCKS5 upstream, it's an SSH channel.
|
||||
|
||||
### 3.3 Data Flow — TLS Obfuscation Mode
|
||||
|
||||
When `--tls` or `--https` is specified:
|
||||
|
||||
1. **Client establishes TLS connection** to `server:443` using `tokio-rustls::TlsStream`
|
||||
2. **SSH session runs over the TLS stream** via `client::connect_stream(Arc::new(config), tls_stream, handler)`
|
||||
3. **Server accepts TLS connection**, then runs `server::run_stream(server_config, tls_stream, handler)`
|
||||
4. **To DPI, the traffic looks like HTTPS** — standard TLS handshake, then encrypted application data
|
||||
5. Optional: Server can present a legitimate-looking certificate and serve a fake nginx 404 to non-SSH probes (similar to https_proxy's stealth approach)
|
||||
|
||||
### 3.4 Data Flow — Server-Side Proxy Mode
|
||||
|
||||
When `--proxy` is specified on the server:
|
||||
|
||||
1. Client requests `channel_open_direct_tcpip(target_host, target_port, ...)`
|
||||
2. Server's `channel_open_direct_tcpip` handler checks ACLs
|
||||
3. Instead of connecting directly, server routes through a local SOCKS5/HTTP proxy
|
||||
4. This provides an additional hop for privacy — the SSH server's IP isn't exposed to the destination
|
||||
|
||||
### 3.5 CLI Interface Sketch
|
||||
|
||||
```bash
|
||||
# Server — simplest mode (SSH only, port 22)
|
||||
ghost serve --key /etc/ssh/ssh_host_ed25519_key
|
||||
|
||||
# Server — with TLS on port 443
|
||||
ghost serve --key /etc/ssh/ssh_host_ed25519_key --tls --tls-cert /etc/ssl/cert.pem --tls-key /etc/ssl/key.pem
|
||||
|
||||
# Server — with TLS + outbound proxy
|
||||
ghost serve --key /etc/ssh/ssh_host_ed25519_key --tls --tls-cert /etc/ssl/cert.pem --tls-key /etc/ssl/key.pem --proxy socks5://127.0.0.1:9050
|
||||
|
||||
# Client — TUN mode (routes all traffic through SSH tunnel)
|
||||
ghost connect --server example.com:443 --tls --identity ~/.ssh/id_ed25519 --tun
|
||||
|
||||
# Client — Single port forward (like SSH -L)
|
||||
ghost connect --server example.com:443 --tls --identity ~/.ssh/id_ed25519 --forward 5432:db.internal:5432
|
||||
|
||||
# Client — SOCKS5 proxy mode (local SOCKS5 that tunnels through SSH)
|
||||
ghost connect --server example.com:443 --tls --identity ~/.ssh/id_ed25519 --socks5 1080
|
||||
```
|
||||
|
||||
**Working name: `ghost`** (as in "ghost in the shell" — it's SSH, it's stealthy, it passes through walls). Or `shade`, `wraith`, `spectre`. Pick anything.
|
||||
|
||||
## 4. Key Technical Decisions & Unknowns Analysis
|
||||
|
||||
### 4.1 TUN Interface — SOLVED
|
||||
|
||||
**Library: `tun-rs` (v2, formerly `tun` crate)**
|
||||
|
||||
- Supports Linux, macOS, Windows (via wintun.dll), FreeBSD, OpenBSD, NetBSD, Android, iOS
|
||||
- Async API with `tokio` feature: `DeviceBuilder::new().build_async()`
|
||||
- Clean `recv()` / `send()` API — read IP packets, write IP packets
|
||||
- Already used in production by tun2proxy and similar projects
|
||||
- Supports hardware offload (TSO/GSO) on Linux for performance
|
||||
- No `CAP_NET_ADMIN` needed on some platforms when using `--unshare` namespace approach (tun2proxy pattern)
|
||||
|
||||
**This is a solved problem.** The `tun-rs` crate is mature, cross-platform, and async-native with tokio. The implementation is straightforward:
|
||||
|
||||
```rust
|
||||
let dev = DeviceBuilder::new()
|
||||
.ipv4("10.0.0.1", 24, None)
|
||||
.mtu(1400)
|
||||
.build_async()?;
|
||||
|
||||
let mut buf = vec![0u8; 65536];
|
||||
loop {
|
||||
let len = dev.recv(&mut buf).await?;
|
||||
// Parse IP header, determine destination
|
||||
// Open SSH channel to destination
|
||||
// Write response back to TUN
|
||||
}
|
||||
```
|
||||
|
||||
**Key consideration**: On Linux requires `CAP_NET_ADMIN` or root. The tun2proxy approach of using network namespaces (`--unshare`) is worth adopting for unprivileged operation.
|
||||
|
||||
### 4.2 SSH over TLS — SOLVED (architecturally)
|
||||
|
||||
**Approach: Layer TLS beneath SSH using russh's `connect_stream` / `run_stream`**
|
||||
|
||||
This is the critical insight. russh already decouples transport from protocol:
|
||||
|
||||
- `client::connect_stream(config, stream, handler)` — accepts any `AsyncRead + AsyncWrite + Unpin + Send`
|
||||
- `server::run_stream(config, stream, handler)` — same for server
|
||||
|
||||
This means:
|
||||
|
||||
```rust
|
||||
// Client side
|
||||
let tcp_stream = TcpStream::connect((server_addr, server_port)).await?;
|
||||
let tls_stream = TlsStream::connect(tls_connector, server_domain, tcp_stream).await?;
|
||||
let handle = client::connect_stream(config, tls_stream, handler).await?;
|
||||
|
||||
// Server side
|
||||
let (tcp_stream, addr) = tcp_listener.accept().await?;
|
||||
let tls_stream = TlsStream::accept(tls_acceptor, tcp_stream).await?;
|
||||
server::run_stream(config, tls_stream, handler).await?;
|
||||
```
|
||||
|
||||
**No modification to russh is needed.** This is a clean layering.
|
||||
|
||||
**For HTTPS stealth**: The server can:
|
||||
1. Accept connections on port 443
|
||||
2. Present a valid TLS certificate (self-signed or Let's Encrypt via ACME)
|
||||
3. Non-SSH clients making HTTP requests get a normal-looking 404 response
|
||||
4. SSH clients speak SSH protocol directly after TLS handshake
|
||||
5. DPI sees standard HTTPS traffic since the TLS handshake is normal
|
||||
|
||||
The https_proxy project demonstrates this pattern well — stealth proxy returning fake nginx 404s to probes.
|
||||
|
||||
### 4.3 IP Packet Handling — NEEDS DESIGN
|
||||
|
||||
When using TUN mode, we're receiving raw IP packets. We need to:
|
||||
|
||||
1. **Parse IP headers** to determine destination IP and port
|
||||
2. **Track connection state** — map `(src_ip, src_port, dst_ip, dst_port)` to SSH channels
|
||||
3. **TCP reassembly** — handle segmentation, retransmission, etc.
|
||||
4. **ICMP handling** — respond to pings, handle unreachable destinations
|
||||
5. **DNS interception** — handle DNS queries that arrive at the TUN interface
|
||||
|
||||
This is the most complex part. Options:
|
||||
|
||||
**Option A: Use a userspace TCP/IP stack (smoltcp)**
|
||||
- Parse packets, but let a userspace stack handle TCP
|
||||
- Heavier dependency, but proven approach (what tun2proxy does with its own stack)
|
||||
- `smoltcp` is well-maintained, used in embedded and networking projects
|
||||
|
||||
**Option B: Raw packet forwarding with NAT**
|
||||
- Simpler conceptually — just NAT the packets, forward them through the SSH channel
|
||||
- Requires handling TCP state at the IP level (seq/ack manipulation, checksum recalculation)
|
||||
- More error-prone
|
||||
|
||||
**Option C: SOCKS5 proxy mode only (no TUN)**
|
||||
- Simplest to implement — just a local SOCKS5 server that forwards through SSH
|
||||
- Browsers, curl, and most apps can use SOCKS5
|
||||
- No root/CAP_NET_ADMIN needed
|
||||
- But: doesn't capture all traffic (UDP, DNS leaks, etc.)
|
||||
|
||||
**Recommendation**: Start with Option C (SOCKS5 proxy mode) as the minimal viable product. Add TUN mode (Option A with smoltcp) as an advanced feature. This matches how tun2proxy structures their project and is the pragmatic path.
|
||||
|
||||
### 4.4 SSH Server Authentication — STRAIGHTFORORD
|
||||
|
||||
The server implementation needs:
|
||||
|
||||
- **Public key authentication** — primary method, matching standard SSH practices
|
||||
- **`authorized_keys` file support** — read `~/.ssh/authorized_keys` or a custom path
|
||||
- **Optional password authentication** — for convenience, but not recommended for production
|
||||
|
||||
russh's `server::Handler` trait provides `auth_publickey` and `auth_password` callbacks. Implementation is trivial:
|
||||
|
||||
```rust
|
||||
async fn auth_publickey(&mut self, user: &str, public_key: &PublicKey) -> Auth {
|
||||
if self.authorized_keys.iter().any(|k| k == public_key) {
|
||||
Auth::Accept
|
||||
} else {
|
||||
Auth::Reject { proceed_with_methods: None, partial_success: false }
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 4.5 DNS Handling — DESIGN DECISION NEEDED
|
||||
|
||||
In TUN mode, DNS queries need to be routed through the tunnel. Options:
|
||||
|
||||
1. **Virtual DNS** (tun2proxy approach) — intercept DNS packets, map query names to fake IPs from a reserved range (198.18.0.0/15), resolve via the SSH tunnel
|
||||
2. **DNS-over-TCP** — Force DNS through the SSH tunnel
|
||||
3. **Direct DNS** — Don't handle DNS in the tunnel, rely on system resolver
|
||||
4. **SOCKS5 mode** — SOCKS5 supports DOMAIN names natively (SOCKS5h), so DNS resolution happens server-side
|
||||
|
||||
**Recommendation**: SOCKS5 mode handles DNS naturally via SOCKS5h. For TUN mode, adopt the virtual DNS approach from tun2proxy (their `ip-stack` crate handles this).
|
||||
|
||||
### 4.6 Connection Multiplexing — ALREADY SOLVED
|
||||
|
||||
russh multiplexes channels over a single SSH connection. No need to manage multiple TCP connections per tunnel. One SSH connection, many channels. This is exactly what we want.
|
||||
|
||||
### 4.7 Keep-Alive and Reconnection — NEEDS DESIGN
|
||||
|
||||
- **SSH keepalive**: russh `Config` has `keepalive_interval` and `keepalive_max`
|
||||
- **Auto-reconnect**: Client should detect disconnection (`is_closed()`) and reconnect with exponential backoff
|
||||
- **TUN continuity**: When SSH reconnects, existing TCP connections through the tunnel will fail, but new ones will work. This is acceptable behavior (same as any VPN).
|
||||
|
||||
### 4.8 Server-Side Proxy (Outbound) — STRAIGHTFORORD
|
||||
|
||||
When `--proxy` is specified, the server's `channel_open_direct_tcpip` handler forwards through a local proxy:
|
||||
|
||||
```rust
|
||||
async fn channel_open_direct_tcpip(
|
||||
&mut self,
|
||||
host: &str,
|
||||
port: u32,
|
||||
...
|
||||
) -> Result<Channel<Msg>, Self::Error> {
|
||||
// Option 1: Connect directly
|
||||
let stream = TcpStream::connect((host, port as u16)).await?;
|
||||
|
||||
// Option 2: Connect through SOCKS5 proxy
|
||||
let stream = connect_socks5(proxy_addr, host, port).await?;
|
||||
|
||||
// Option 3: Connect through HTTP CONNECT proxy
|
||||
let stream = connect_http_proxy(proxy_addr, host, port).await?;
|
||||
|
||||
// Then bidirectional copy between SSH channel and stream
|
||||
Ok(channel)
|
||||
}
|
||||
```
|
||||
|
||||
SOCKS5 client implementation is simple (5-byte handshake, variable-length connect). HTTP CONNECT is also straightforward. Both can be implemented in a few hundred lines.
|
||||
|
||||
## 5. Dependency Assessment
|
||||
|
||||
| Dependency | Purpose | Maturity | Risk |
|
||||
|------------|---------|----------|------|
|
||||
| `russh` | SSH client & server | High (used in dispatch, well-maintained) | Low — already proven |
|
||||
| `tun-rs` (v2) | TUN/TAP interface | High (cross-platform, prod-tested, bench'd at 70Gbps) | Low — well-maintained |
|
||||
| `tokio-rustls` | TLS layer | High (standard Rust TLS) | Low — widely used |
|
||||
| `rustls` | TLS implementation | High | Low — no ring dependency needed with aws-lc-rs |
|
||||
| `smoltcp` | Userspace TCP/IP stack (TUN mode) | Medium-High | Medium — complex but well-proven |
|
||||
| `clap` | CLI args | High | None |
|
||||
| `tracing` | Structured logging | High | None |
|
||||
| `anyhow/thiserror` | Error handling | High | None |
|
||||
| `tokio` | Async runtime | High | None |
|
||||
|
||||
**No immature or risky dependencies.** Every crate is well-established with active maintenance.
|
||||
|
||||
## 6. Risk Assessment
|
||||
|
||||
### 6.1 Technical Risks
|
||||
|
||||
| Risk | Likelihood | Impact | Mitigation |
|
||||
|------|-----------|--------|------------|
|
||||
| TUN mode complexity (TCP state, IP parsing) | Medium | Medium | Start with SOCKS5 mode; TUN is advanced feature |
|
||||
| Cross-platform TUN differences | Medium | Medium | tun-rs handles most; `--unshare` for Linux privilege separation |
|
||||
| TLS + SSH interaction edge cases | Low | Low | Both are well-tested; russh's `connect_stream` / `run_stream` abstracts transport |
|
||||
| Performance under load | Low | Medium | russh multiplexes channels; tun-rs has benchmarked 35+ Gbps async |
|
||||
| DPI detecting SSH banner over TLS | Medium | High | After TLS, the SSH banner ("SSH-2.0-...") is encrypted. But SNI reveals domain. Use `Config { anonymous: true }` to minimize fingerprint, or configure `client_id` to look like a web server. |
|
||||
|
||||
### 6.2 Protocol-Level Risks
|
||||
|
||||
| Risk | Likelihood | Impact | Mitigation |
|
||||
|------|-----------|--------|------------|
|
||||
| SSH protocol fingerprinting (packet sizes, timing) | Medium | Medium | Pad messages, add random delays. russh doesn't do this natively — would need custom channel wrapping. |
|
||||
| SNI leaks domain in TLS handshake | High | Low | Use a innocuous domain. Could also explore ECH (Encrypted Client Hello) in rustls if available. |
|
||||
| Deep packet inspection identifying SSH patterns even over TLS | Low-Medium | Medium | The TLS layer prevents payload inspection. Only traffic analysis (sizes, timing) is possible. Padding and traffic shaping could help. |
|
||||
| Countries blocking SSH traffic on port 22 | Already happening | N/A | That's the whole point — we run SSH over TLS on port 443 |
|
||||
|
||||
### 6.3 Usability Risks
|
||||
|
||||
| Risk | Likelihood | Impact | Mitigation |
|
||||
|------|-----------|--------|------------|
|
||||
| Requires self-hosted server | By design | Medium | Document simple deployment. Provide Docker image. Consider one-command install script. |
|
||||
| Root/CAP_NET_ADMIN needed for TUN on Linux | High | Medium | Provide `--unshare` mode. SOCKS5 mode needs no privileges. |
|
||||
| Certificate management for TLS mode | Medium | Low | Support self-signed certs, ACME (Let's Encrypt), or manual cert paths. |
|
||||
|
||||
## 7. Implementation Plan
|
||||
|
||||
### Phase 1: MVP (2-3 days)
|
||||
|
||||
**SOCKS5 proxy mode only. No TUN. Client + server.**
|
||||
|
||||
1. **Server binary** (`ghost serve`)
|
||||
- russh server implementation with public key auth
|
||||
- `channel_open_direct_tcpip` handler: connect to target directly or via outbound proxy
|
||||
- Optional TLS wrapping via `tokio-rustls` + `server::run_stream`
|
||||
- Config: listen address, host key path, authorized keys, TLS options, proxy options
|
||||
|
||||
2. **Client binary** (`ghost connect`)
|
||||
- russh client with public key auth
|
||||
- Local SOCKS5 server that forwards connections through SSH `channel_open_direct_tcpip`
|
||||
- Optional TLS wrapping via `tokio-rustls` + `client::connect_stream`
|
||||
- Config: server address, identity key, TLS options, SOCKS5 listen address
|
||||
|
||||
3. **Testing**
|
||||
- Integration test: client → server → HTTP target
|
||||
- Test with: `curl --socks5-hostname 127.0.0.1:1080 https://example.com`
|
||||
- Test TLS mode against DPI-like inspection
|
||||
|
||||
### Phase 2: Port Forwarding (1 day)
|
||||
|
||||
4. **Client: explicit port forwards** (`--forward local:remote:port`)
|
||||
- Direct reimplementation of SSH `-L` and `-R`
|
||||
- Uses `channel_open_direct_tcpip` for local forwards
|
||||
- Uses `tcpip_forward` / handler callback for remote forwards
|
||||
|
||||
5. **Client: SOCKS5 with DNS** (SOCKS5h)
|
||||
- Domain names resolved server-side, not client-side
|
||||
|
||||
### Phase 3: TUN Mode (2-3 days)
|
||||
|
||||
6. **Client: TUN interface mode** (`--tun`)
|
||||
- Create TUN device via `tun-rs`
|
||||
- IP packet routing through SSH channels
|
||||
- Either: raw packet forwarding (simpler, but fragile) or smoltcp integration (robust, but more code)
|
||||
- Recommend: use tun2proxy's `ip-stack` crate or similar for TCP reconstruction
|
||||
- Virtual DNS for TUN mode
|
||||
|
||||
7. **Privilege separation**
|
||||
- `--unshare` mode for Linux (create network namespace, unshare)
|
||||
- Document CAP_NET_ADMIN requirement
|
||||
|
||||
### Phase 4: Hardening & Polish (1-2 days)
|
||||
|
||||
8. **Obfuscation improvements**
|
||||
- SSH banner customization (`client_id` config)
|
||||
- Random padding in channel data
|
||||
- Traffic shaping / constant-rate padding (optional, advanced)
|
||||
|
||||
9. **Server stealth**
|
||||
- Non-SSH connection detection: serve fake nginx 404 on TLS port
|
||||
- Dual-protocol listener: HTTPS for browsers, SSH for ghost clients
|
||||
|
||||
10. **Auto-reconnect**
|
||||
- Exponential backoff reconnect on SSH session drop
|
||||
- TUN interface survives reconnect (new connections work, in-flight connections fail gracefully)
|
||||
|
||||
### Phase 5: Distribution (1 day)
|
||||
|
||||
11. **Build & packaging**
|
||||
- Static musl binary for Linux
|
||||
- Docker image
|
||||
- systemd unit file
|
||||
- One-line install script
|
||||
|
||||
## 8. Estimated Timeline
|
||||
|
||||
| Phase | Duration | Cumulative |
|
||||
|-------|----------|------------|
|
||||
| Phase 1: SOCKS5 MVP | 2-3 days | 2-3 days |
|
||||
| Phase 2: Port Forwarding | 1 day | 3-4 days |
|
||||
| Phase 3: TUN Mode | 2-3 days | 5-7 days |
|
||||
| Phase 4: Hardening & Polish | 1-2 days | 6-9 days |
|
||||
| Phase 5: Distribution | 1 day | 7-10 days |
|
||||
|
||||
With LLM-assisted development, the MVP (Phase 1) could realistically be done in 1-2 focused sessions. The full feature set in under a week.
|
||||
|
||||
## 9. Open Questions
|
||||
|
||||
1. **Project name** — `ghost`, `wraith`, `shade`, `spectre`, something else? Needs to be catchy, not conflict with existing Rust crates, and suggest stealth/mobility.
|
||||
|
||||
2. **TUN vs smoltcp** — Should TUN mode integrate smoltcp for a userspace TCP stack, or try the simpler "just forward packets and let the OS handle TCP" approach? Smoltcp is more work but more robust. tun2proxy's approach (which uses their own `ip-stack`) suggests userspace TCP is the way to go for reliability.
|
||||
|
||||
3. **TLS certificate story** — Should the server support ACME/Let's Encrypt auto-provisioning (like https_proxy does), or is manual cert management sufficient? Auto-provisioning is more user-friendly but adds significant complexity and a dependency on the ACME protocol.
|
||||
|
||||
4. **Mobile support** — Should we target iOS/Android eventually? tun-rs supports both via platform APIs, but mobile is a much bigger scope. Probably Phase 6+.
|
||||
|
||||
5. **Multi-user server** — Should the server support multiple simultaneous clients? russh's server model handles this naturally (each connection gets its own Handler instance), but access control (per-user ACLs, bandwidth limits) would add complexity.
|
||||
|
||||
6. **Crates structure** — Single binary with subcommands (`ghost serve`, `ghost connect`), or separate binaries? Single crate with `#[tokio::main]` dispatch seems cleanest for MVP.
|
||||
|
||||
## 10. Conclusion
|
||||
|
||||
**This is feasible and straightforward.** The core mechanics — SSH tunnel via russh, TLS wrapping via tokio-rustls, TUN interface via tun-rs — are all solved problems with mature Rust libraries. The dispatch codebase proves russh is production-ready for this kind of work. The `connect_stream` / `run_stream` API in russh makes TLS wrapping a clean layering, not a hack.
|
||||
|
||||
The biggest design decision is TUN mode approach (raw packets vs. userspace TCP), and the recommendation is to start with SOCKS5 mode and add TUN later. This gives a working tool in 2-3 days that covers the primary use case (private tunneling that doesn't look like VPN traffic).
|
||||
|
||||
The project is well-scoped, the risk profile is low, and the existing tooling (russh, tun-rs, tokio-rustls) handles the hard parts. This is a "few days of focused work" estimate, not a "few weeks."
|
||||
|
||||
## 11. iroh Transport — Feasibility Addendum
|
||||
|
||||
### 11.1 The Insight
|
||||
|
||||
russh's `connect_stream()` and `server::run_stream()` accept **any** `AsyncRead + AsyncWrite + Unpin + Send` stream. The iroh project provides exactly such a stream — a QUIC bidirectional stream (`open_bi()` / `accept_bi()`) where both `SendStream` and `RecvStream` implement `tokio::io::AsyncWrite` and `tokio::io::AsyncRead` respectively.
|
||||
|
||||
This means **iroh can serve as a transport layer beneath SSH**, the same way TLS can. The architecture becomes:
|
||||
|
||||
```
|
||||
┌──────────────────────────────────────────────────┐
|
||||
│ APPLICATION │
|
||||
│ (SOCKS5 / TUN / port-forward) │
|
||||
├──────────────────────────────────────────────────┤
|
||||
│ SSH (russh) │
|
||||
│ channel_open_direct_tcpip/etc. │
|
||||
├──────────────────────────────────────────────────┤
|
||||
│ Transport Layer (SWAPPABLE) │
|
||||
│ │
|
||||
│ ┌──────────┐ ┌──────────┐ ┌──────────────┐ │
|
||||
│ │ TCP │ │ TLS │ │ iroh │ │
|
||||
│ │(direct) │ │(obfusc) │ │ (P2P QUIC) │ │
|
||||
│ └──────────┘ └──────────┘ └──────────────┘ │
|
||||
└──────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### 11.2 Why iroh is Compelling
|
||||
|
||||
iroh solves the **biggest deployment problem** with SSH tunnels: the server needs a public IP and open port.
|
||||
|
||||
With iroh as transport:
|
||||
|
||||
1. **No public IP needed** — Server and client both connect outbound to iroh's relay servers. Hole-punching attempts direct UDP in the background.
|
||||
2. **No open firewall ports** — The server only needs outbound HTTPS to the relay. No inbound 22 or 443 required.
|
||||
3. **NAT traversal for free** — iroh's relay + hole-punching means peers behind CGNAT or strict firewalls can still connect.
|
||||
4. **Ed25519-based addressing** — Peers are identified by public key (EndpointId), no DNS or IP addresses needed.
|
||||
5. **Built-in address discovery** — pkarr DNS records let you find a peer knowing only their public key.
|
||||
6. **Still SSH underneath** — All the channel multiplexing, port forwarding, SOCKS5 logic still works. iroh is just the wire.
|
||||
|
||||
The use cases multiply:
|
||||
|
||||
- **Home server behind NAT**: No reverse proxy, no dynamic DNS, no port forwarding. Just run the server, share the EndpointId.
|
||||
- **Temporary infrastructure**: Spin up a server anywhere (even behind corporate NAT), connect by public key.
|
||||
- **Internal services**: Expose Postgres/Redis etc. over an SSH connection that traverses any NAT, no VPN required.
|
||||
- **Censorship circumvention**: SSH over iroh QUIC to a relay that uses standard HTTPS. The deep packet inspector sees HTTPS traffic to a relay server, not SSH.
|
||||
|
||||
### 11.3 How It Works — The Code
|
||||
|
||||
The integration is trivially clean because both primitives implement the right traits:
|
||||
|
||||
**Client side:**
|
||||
```rust
|
||||
// Create iroh endpoint
|
||||
let endpoint = Endpoint::builder(presets::N0)
|
||||
.alpns(vec![b"ghost-ssh/1".to_vec()])
|
||||
.bind()
|
||||
.await?;
|
||||
|
||||
// Connect to peer (no IP needed — just public key)
|
||||
let addr = EndpointAddr::from_bytes(peer_id_bytes);
|
||||
let conn = endpoint.connect(addr, b"ghost-ssh/1").await?;
|
||||
|
||||
// Open a bidirectional QUIC stream
|
||||
let (send_stream, recv_stream) = conn.open_bi().await?;
|
||||
|
||||
// Combine into a single AsyncRead+AsyncWrite
|
||||
let iroh_stream = tokio::io::join(recv_stream, send_stream);
|
||||
// OR use a custom wrapper that implements AsyncRead+AsyncWrite
|
||||
|
||||
// Run SSH client over the iroh stream
|
||||
let handle = client::connect_stream(
|
||||
Arc::new(client_config),
|
||||
iroh_stream,
|
||||
client_handler
|
||||
).await?;
|
||||
```
|
||||
|
||||
**Server side:**
|
||||
```rust
|
||||
// Create iroh endpoint
|
||||
let endpoint = Endpoint::builder(presets::N0)
|
||||
.alpns(vec![b"ghost-ssh/1".to_vec()])
|
||||
.bind()
|
||||
.await?;
|
||||
|
||||
// Accept incoming connections
|
||||
while let Some(incoming) = endpoint.accept().await {
|
||||
let conn = incoming.await?;
|
||||
|
||||
// For each connection, accept a bidirectional stream
|
||||
let (send_stream, recv_stream) = conn.accept_bi().await?;
|
||||
let iroh_stream = tokio::io::join(recv_stream, send_stream);
|
||||
|
||||
// Run SSH server over the iroh stream
|
||||
server::run_stream(
|
||||
Arc::new(server_config),
|
||||
iroh_stream,
|
||||
server_handler
|
||||
).await?;
|
||||
}
|
||||
```
|
||||
|
||||
**Or using iroh's Router + ProtocolHandler pattern:**
|
||||
```rust
|
||||
struct GhostSshProtocol;
|
||||
|
||||
impl ProtocolHandler for GhostSshProtocol {
|
||||
async fn accept(&self, connection: Connection) -> Result<(), AcceptError> {
|
||||
// iroh already handled connection acceptance
|
||||
// We can accept bi streams on the connection directly
|
||||
// Or: each SSH session could be a new bi stream on the same connection
|
||||
|
||||
let (send, recv) = connection.accept_bi().await
|
||||
.map_err(AcceptError::from_err)?;
|
||||
let stream = join_streams(recv, send);
|
||||
|
||||
server::run_stream(server_config, stream, GhostHandler).await
|
||||
.map_err(AcceptError::from_err)
|
||||
}
|
||||
}
|
||||
|
||||
let endpoint = Endpoint::builder(presets::N0).bind().await?;
|
||||
let router = Router::builder(endpoint)
|
||||
.accept(b"ghost-ssh/1", GhostSshProtocol)
|
||||
.spawn();
|
||||
```
|
||||
|
||||
### 11.4 Design Decision: One Stream per Session vs. One Connection with Multiple Streams
|
||||
|
||||
There are two ways to layer SSH over iroh:
|
||||
|
||||
**Option A: One QUIC bi-stream per SSH session**
|
||||
- Each SSH session opens a new `open_bi()` stream under a single iroh `Connection`
|
||||
- The iroh Connection itself persists (one QUIC connection per peer pair)
|
||||
- Simpler: `open_bi()` gives you a stream, you feed it to `connect_stream()`
|
||||
- Pro: Connection setup cost amortized. If SSH disconnects, `open_bi()` again is cheap.
|
||||
- Con: Need to combine `RecvStream` + `SendStream` into a single `AsyncRead+AsyncWrite`
|
||||
|
||||
**Option B: One iroh Connection per SSH session (new QUIC connection each time)**
|
||||
- Each SSH session = one `endpoint.connect()` + the whole connection
|
||||
- Wasteful: QUIC handshake + iroh relay discovery each time
|
||||
- Not recommended
|
||||
|
||||
**Recommendation: Option A.** One iroh `Connection` per peer pair, one `open_bi()` stream per SSH session. The connection is long-lived; SSH sessions can be re-established cheaply on the same QUIC connection.
|
||||
|
||||
### 11.5 Combining `RecvStream + SendStream` into `AsyncRead + AsyncWrite`
|
||||
|
||||
QUIC splits streams into separate send and receive halves. russh needs a single duplex stream. Two approaches:
|
||||
|
||||
**Approach 1: `tokio::io::join()` (simplest)**
|
||||
```rust
|
||||
use tokio::io;
|
||||
|
||||
fn join_iroh_stream(
|
||||
recv: iroh::endpoint::RecvStream,
|
||||
send: iroh::endpoint::SendStream,
|
||||
) -> impl AsyncRead + AsyncWrite + Unpin + Send {
|
||||
io::join(recv, send)
|
||||
}
|
||||
```
|
||||
`tokio::io::join` returns a `Join<A, B>` that implements both `AsyncRead` (from the first) and `AsyncWrite` (from the second). Since `RecvStream: AsyncRead` and `SendStream: AsyncWrite`, this works directly.
|
||||
|
||||
**Approach 2: Custom wrapper (more control)**
|
||||
```rust
|
||||
struct IrohStream {
|
||||
recv: iroh::endpoint::RecvStream,
|
||||
send: iroh::endpoint::SendStream,
|
||||
}
|
||||
|
||||
impl AsyncRead for IrohStream { /* delegate to recv */ }
|
||||
impl AsyncWrite for IrohStream { /* delegate to send */ }
|
||||
```
|
||||
|
||||
**Recommendation: Start with `tokio::io::join`.** It's one line and has the right trait implementations. Only switch to a custom wrapper if profiling shows overhead (unlikely).
|
||||
|
||||
### 11.6 Relay Considerations
|
||||
|
||||
iroh provides two relay options:
|
||||
|
||||
1. **Default n0 relay servers** (`https://use1-1.relay.n0.iroh.network.`) — free, operated by n0. Good for getting started and testing.
|
||||
2. **Self-hosted relay** (`iroh-relay` crate) — The relay server is part of the iroh project. Can be self-hosted for complete independence.
|
||||
|
||||
For this project:
|
||||
|
||||
- **Development/quick start**: Use n0 relays (they're free and reliable)
|
||||
- **Production/privacy**: Self-host the relay server. It's a single binary (`iroh-relay`) that can run on any VPS. The relay sees only encrypted QUIC packets — it cannot read SSH traffic.
|
||||
- **Paranoid**: Disable relay entirely. Both peers must have direct network connectivity. No third-party dependency.
|
||||
|
||||
The `RelayMode` enum handles this:
|
||||
```rust
|
||||
// Default n0 relays
|
||||
let endpoint = Endpoint::builder(presets::N0).bind().await?;
|
||||
|
||||
// Self-hosted relay
|
||||
let relay_map = RelayMap::from([(relay_url, Some(direct_addr))]);
|
||||
let endpoint = Endpoint::builder(presets::Custom(relay_map)).bind().await?;
|
||||
|
||||
// No relay (direct only)
|
||||
let endpoint = Endpoint::builder(presets::RelayDisabled).bind().await?;
|
||||
```
|
||||
|
||||
### 11.7 Updated Architecture with iroh Transport
|
||||
|
||||
```
|
||||
┌───────────────────────────────────────────────────────────┐
|
||||
│ CLIENT │
|
||||
│ │
|
||||
│ ┌──────────┐ ┌───────────┐ ┌────────────────────┐ │
|
||||
│ │ TUN / │ │ SSH │ │ Transport │ │
|
||||
│ │ SOCKS5 / │───▶│ Client │───▶│ (selectable) │ │
|
||||
│ │ Port- │ │ (russh) │ │ │ │
|
||||
│ │ Forward │ │ │ │ ┌────────────────┐ │ │
|
||||
│ └──────────┘ └───────────┘ │ │ TCP direct │ │ │
|
||||
│ │ │ TLS (rustls) │ │ │
|
||||
│ │ │ iroh (QUIC) │ │ │
|
||||
│ │ └────────────────┘ │ │
|
||||
│ └────────────────────┘ │
|
||||
└───────────────────────────────────────────────────────────┘
|
||||
|
||||
┌───────────────────────────────────────────────────────────┐
|
||||
│ SERVER │
|
||||
│ │
|
||||
│ ┌──────────┐ ┌───────────┐ ┌────────────────────┐ │
|
||||
│ │ Outbound │ │ SSH │ │ Transport │ │
|
||||
│ │ Proxy / │◀───│ Server │◀───│ (selectable) │ │
|
||||
│ │ Direct │ │ (russh) │ │ │ │
|
||||
│ │ Forward │ │ │ │ ┌────────────────┐ │ │
|
||||
│ └──────────┘ └───────────┘ │ │ TCP listener │ │ │
|
||||
│ │ │ TLS (rustls) │ │ │
|
||||
│ │ │ iroh (QUIC) │ │ │
|
||||
│ │ └────────────────┘ │ │
|
||||
│ └────────────────────┘ │
|
||||
└───────────────────────────────────────────────────────────┘
|
||||
|
||||
┌──────────────┐
|
||||
│ iroh Relay │ (optional, for NAT)
|
||||
│ (self-host │
|
||||
│ or n0) │
|
||||
└──────────────┘
|
||||
|
||||
Transport modes:
|
||||
--transport tcp Direct TCP (default, simplest)
|
||||
--transport tls TCP + TLS (obfuscation)
|
||||
--transport iroh iroh QUIC (NAT traversal, no public IP)
|
||||
--transport iroh+tls iroh QUIC + TLS (NAT traversal + obfuscation)
|
||||
```
|
||||
|
||||
### 11.8 iroh Transport — Risk Assessment
|
||||
|
||||
| Risk | Likelihood | Impact | Mitigation |
|
||||
|------|-----------|--------|------------|
|
||||
| iroh API instability (it's v0.x) | Medium | Medium | Pin version; iroh's core stream API is stable (it's just QUIC) |
|
||||
| Relay dependency for initial connectivity | Low | Low | Self-host relay; or direct-only mode for LAN |
|
||||
| QUIC stream vs TCP semantics differences | Low | Medium | QUIC streams are reliable ordered byte streams, same semantics as TCP. russh won't know the difference. |
|
||||
| Performance overhead of QUIC + SSH | Low | Low | QUIC is fast. SSH over QUIC might actually be *faster* than SSH over TCP due to QUIC's multipath and no head-of-line blocking. |
|
||||
| iroh crate size / compile time | Low | Low | iroh pulls in quinn + rustls + lots of networking. But we already need rustls for TLS mode. The incremental cost is the QUIC stack. |
|
||||
|
||||
**Key observation**: QUIC streams have identical reliability and ordering guarantees to TCP. russh's `connect_stream()` / `run_stream()` will work correctly over iroh QUIC streams with no modifications.
|
||||
|
||||
### 11.9 Updated CLI Sketch with iroh
|
||||
|
||||
```bash
|
||||
# Server — iroh mode (no public IP needed!)
|
||||
ghost serve --key ~/.ssh/id_ed25519 --transport iroh
|
||||
# Prints endpoint ID: e.g., "abc123..."
|
||||
# Clients connect using this ID
|
||||
|
||||
# Server — iroh mode with self-hosted relay
|
||||
ghost serve --key ~/.ssh/id_ed25519 --transport iroh \
|
||||
--iroh-relay https://my-relay.example.com
|
||||
|
||||
# Client — connect via iroh (no IP needed!)
|
||||
ghost connect --peer abc123def456... --transport iroh --socks5 1080
|
||||
|
||||
# Client — connect via iroh with TUN
|
||||
ghost connect --peer abc123def456... --transport iroh --tun
|
||||
|
||||
# Client — traditional TCP mode (still works)
|
||||
ghost connect --server 1.2.3.4:443 --transport tls --socks5 1080
|
||||
```
|
||||
|
||||
### 11.10 Implementation Impact
|
||||
|
||||
Adding iroh as a transport option is **incremental** — it doesn't change the SSH layer at all:
|
||||
|
||||
1. **Transport trait**: Define a `Transport` trait that produces `Box<dyn AsyncRead + AsyncWrite + Unpin + Send>`:
|
||||
```rust
|
||||
trait Transport {
|
||||
async fn connect(&self) -> Result<Box<dyn AsyncRead + AsyncWrite + Unpin + Send>>;
|
||||
}
|
||||
```
|
||||
|
||||
2. **Three implementations**:
|
||||
- `TcpTransport` — plain TCP
|
||||
- `TlsTransport` — TCP + tokio-rustls
|
||||
- `IrohTransport` — iroh endpoint + `open_bi()` + `tokio::io::join(recv, send)`
|
||||
|
||||
3. **Server side**: Same trait, different direction:
|
||||
```rust
|
||||
trait TransportAcceptor {
|
||||
async fn accept(&self) -> Result<Box<dyn AsyncRead + AsyncWrite + Unpin + Send>>;
|
||||
}
|
||||
```
|
||||
|
||||
4. **The SSH layer never changes.** russh's `connect_stream()` / `run_stream()` takes the transport stream, and everything else stays the same.
|
||||
|
||||
### 11.11 Dependency Impact
|
||||
|
||||
| Dependency | Added? | Size concern |
|
||||
|------------|--------|-------------|
|
||||
| `iroh` (includes iroh-base) | Yes, feature-gated | Yes — pulls in QUIC stack, DNS, relay client |
|
||||
| `n0-error` | Yes (small) | No |
|
||||
| `tokio` | Already present | No |
|
||||
| `rustls` | Already present (for TLS mode) | No |
|
||||
|
||||
**Recommendation**: Make iroh a feature flag (`--features iroh`) so the base install stays lean. Users who want P2P capability opt in:
|
||||
|
||||
```toml
|
||||
[features]
|
||||
default = ["tls"]
|
||||
tls = ["tokio-rustls", "rustls-pemfile"]
|
||||
iroh = ["dep:iroh"]
|
||||
tun = ["dep:tun-rs", "dep:smoltcp"]
|
||||
```
|
||||
|
||||
### 11.12 The Compelling Narrative
|
||||
|
||||
With iroh as a transport option, this tool becomes something genuinely new:
|
||||
|
||||
- **Not just a VPN alternative** — it's a VPN alternative that doesn't need port forwarding, public IPs, or DNS records.
|
||||
- **Not just SSH tunneling** — it's SSH tunneling that works between any two machines on the internet, regardless of NAT configuration.
|
||||
- **Not just for censorship circumvention** — it's how you securely expose internal services (Postgres, Redis, admin panels) from machines behind corporate firewalls or home networks.
|
||||
|
||||
The "ghetto VPN" becomes a **zero-config mesh VPN**. Spin up `ghost serve` on any machine, share the public key, connect from anywhere. The relay server is optional (self-host or n0's free tier). And underneath it's just SSH, doing what SSH does best.
|
||||
|
||||
This isn't theoretical — the API compatibility is exact. iroh's `RecvStream + SendStream` implement `AsyncRead + AsyncWrite`, and russh's `connect_stream` / `run_stream` accept `AsyncRead + AsyncWrite`. Three lines of `tokio::io::join(recv, send)` and you have a transport stream that russh can use.
|
||||
56
docs/research/ops/certbot.md
Normal file
56
docs/research/ops/certbot.md
Normal file
@@ -0,0 +1,56 @@
|
||||
# Certbot — dev1
|
||||
|
||||
## Overview
|
||||
|
||||
Let's Encrypt SSL certificates managed by certbot. Used by nginx for HTTPS.
|
||||
|
||||
## Installed
|
||||
|
||||
certbot (snap package on Ubuntu 24.04)
|
||||
|
||||
## Certificates
|
||||
|
||||
| Domain | Expiry | Path |
|
||||
|--------|--------|------|
|
||||
| git.alk.dev | 2026-06-18 | /etc/letsencrypt/live/git.alk.dev/ |
|
||||
|
||||
## File Locations
|
||||
|
||||
```
|
||||
/etc/letsencrypt/live/git.alk.dev/
|
||||
├── fullchain.pem # Server cert + chain
|
||||
├── privkey.pem # Private key
|
||||
├── cert.pem # Server cert only
|
||||
├── chain.pem # Chain only
|
||||
└── README
|
||||
```
|
||||
|
||||
Renewal config: `/etc/letsencrypt/renewal/git.alk.dev.conf`
|
||||
|
||||
## Renewal
|
||||
|
||||
Certbot auto-renews via systemd timer. Certificates renew when <30 days remaining.
|
||||
|
||||
```bash
|
||||
# Check certificates and expiry
|
||||
sudo certbot certificates
|
||||
|
||||
# Dry run renewal
|
||||
sudo certbot renew --dry-run
|
||||
|
||||
# Force renewal (if needed)
|
||||
sudo certbot renew --force-renewal
|
||||
|
||||
# Reload nginx after renewal
|
||||
sudo systemctl reload nginx
|
||||
```
|
||||
|
||||
## Initial Certificate
|
||||
|
||||
If adding a new domain, obtain the cert with the standalone plugin (nginx doesn't need to be running):
|
||||
|
||||
```bash
|
||||
sudo certbot certonly --standalone -d <domain> --agree-tos -m <email>
|
||||
```
|
||||
|
||||
Port 80 must be open for the ACME challenge. The api.alk.dev UFW rule allows HTTP for this purpose.
|
||||
106
docs/research/ops/fail2ban.md
Normal file
106
docs/research/ops/fail2ban.md
Normal file
@@ -0,0 +1,106 @@
|
||||
# Fail2ban — dev1
|
||||
|
||||
## Status
|
||||
|
||||
Active. 7 jails. Uses `nftables` backend with `systemd` journal.
|
||||
|
||||
## Active Jails
|
||||
|
||||
| Jail | Port | Filter | Max Retry | Find Time | Ban Time | Log Source |
|
||||
|------|------|--------|-----------|-----------|----------|------------|
|
||||
| sshd | ssh | sshd | default (5) | default (10m) | default (10m) | systemd journal |
|
||||
| gitea | ssh | gitea | 5 | 10m | 1h | journald (CONTAINER_NAME=gitea) |
|
||||
| nginx-badbots | http,https | nginx-badbots | 5 | 10m | 1h | /var/log/nginx/access.log |
|
||||
| nginx-botsearch | http,https | nginx-botsearch | default | default | default | /var/log/nginx/access.log |
|
||||
| nginx-limit-req | http,https | nginx-limit-req | default | default | default | /var/log/nginx/error.log |
|
||||
| nginx-401 | http,https | nginx-401 | 5 | 10m | 1h | /var/log/nginx/access.log |
|
||||
| nginx-403 | http,https | nginx-403 | 10 | 10m | 30m | /var/log/nginx/access.log |
|
||||
|
||||
## Configuration
|
||||
|
||||
Default settings in `/etc/fail2ban/jail.d/defaults-debian.conf`:
|
||||
|
||||
```ini
|
||||
[DEFAULT]
|
||||
banaction = nftables
|
||||
banaction_allports = nftables[type=allports]
|
||||
backend = systemd
|
||||
```
|
||||
|
||||
Jail configs in `/etc/fail2ban/jail.d/`:
|
||||
- `gitea.conf` — Gitea jail with Docker journald log driver
|
||||
- `nginx.conf` — nginx-related jails
|
||||
|
||||
## Gitea Jail Details
|
||||
|
||||
Gitea runs in Docker with the `journald` log driver. The fail2ban filter uses `journalmatch` to read only Gitea container logs:
|
||||
|
||||
```ini
|
||||
[gitea]
|
||||
enabled = true
|
||||
port = ssh
|
||||
filter = gitea
|
||||
backend = systemd
|
||||
journalmatch = CONTAINER_NAME=gitea
|
||||
maxretry = 5
|
||||
findtime = 10m
|
||||
bantime = 1h
|
||||
action = iptables-allports[chain="DOCKER-USER"]
|
||||
```
|
||||
|
||||
The `DOCKER-USER` chain ensures bans affect Docker traffic.
|
||||
|
||||
## Custom Filters
|
||||
|
||||
Default install includes `gitea.conf`, `nginx-401.conf`, `nginx-403.conf` in `/etc/fail2ban/filter.d/`. Custom filter:
|
||||
|
||||
### nginx-badbots (`/etc/fail2ban/filter.d/nginx-badbots.conf`)
|
||||
|
||||
Catches malicious requests that the other nginx jails miss: `.env`/`.git` probes, PROPFIND/CONNECT abuse, common exploit paths (`/actuator`, `/cgi-bin`, `/ecp`, `/SDK`), and binary/garbage requests. Matches 400/404/405/413 status codes for known-bad path patterns only — legitimate 404s (e.g. wrong Gitea repo name) are not matched.
|
||||
|
||||
## Lesson Learned: Default Filters Miss Most Scanner Traffic
|
||||
|
||||
The default fail2ban nginx filters (`nginx-botsearch`, `nginx-401`, `nginx-403`, `nginx-limit-req`) only catch a narrow subset of malicious requests:
|
||||
|
||||
- **nginx-botsearch** only matches `<webmail|phpmyadmin|wordpress|cgi-bin|mysqladmin>` paths returning **404**. Misses `.env`, `.git/config`, `/actuator`, `/SDK`, `/ecp`, crypto mining RPC, PROPFIND/CONNECT abuse, and binary garbage — all of which return 400/405 instead of 404.
|
||||
- **nginx-401/403** only trigger on those specific status codes. Most scanners get 400 or 405.
|
||||
- **nginx-limit-req** only triggers when the rate limiter in nginx actually rejects a request.
|
||||
|
||||
**Result**: A site with heavy scanner traffic can show zero bans from all four default jails. The `nginx-badbots` custom filter closes this gap by matching known-bad path patterns regardless of status code.
|
||||
|
||||
### Verifying Jail Coverage
|
||||
|
||||
When setting up fail2ban on a new host:
|
||||
|
||||
1. Install jails and filters first
|
||||
2. Let traffic flow for a few hours
|
||||
3. Run `sudo fail2ban-regex /var/log/nginx/access.log /etc/fail2ban/filter.d/<filter>.conf` to verify each filter matches expected lines
|
||||
4. Check `sudo fail2ban-client status` to confirm jails show `Total failed > 0` — if any jail stays at 0 for hours on a public-facing host, the filter likely has a gap
|
||||
5. Inspect logs manually: `awk '$9>=400' /var/log/nginx/access.log | awk '{print $9}' | sort | uniq -c | sort -rn` shows which status codes scanners are hitting
|
||||
|
||||
### Adding the nginx-badbots Filter to a New Host
|
||||
|
||||
1. Copy `/etc/fail2ban/filter.d/nginx-badbots.conf` to the new host
|
||||
2. Append the jail config to `/etc/fail2ban/jail.d/nginx.conf`:
|
||||
|
||||
```ini
|
||||
[nginx-badbots]
|
||||
enabled = true
|
||||
port = http,https
|
||||
filter = nginx-badbots
|
||||
logpath = /var/log/nginx/access.log
|
||||
maxretry = 5
|
||||
findtime = 10m
|
||||
bantime = 1h
|
||||
```
|
||||
|
||||
3. `sudo fail2ban-client reload`
|
||||
|
||||
## Commands
|
||||
|
||||
```bash
|
||||
sudo fail2ban-client status
|
||||
sudo fail2ban-client status gitea
|
||||
sudo fail2ban-client set gitea unbanip <IP>
|
||||
sudo journalctl -u fail2ban -f
|
||||
```
|
||||
Reference in New Issue
Block a user