Add architecture specification for Rust/axum reverse proxy

Phase 1 architecture docs covering proxy handler, TLS termination (ACME + manual), TOML config with static/dynamic split (ArcSwap), and operations (rate limiting, logging, health check, systemd, graceful shutdown). Nine ADRs documenting key decisions: Rust/axum, custom proxy handler, TOML config, rustls-acme for cert management, tokio-rustls direct, token bucket rate limiting, custom log format for fail2ban, static/dynamic config split, and signal handling strategy. Includes threat landscape research documenting the nginx CVEs motivating this project.
2026-06-11 07:25:50 +00:00
parent 5c54a28822
commit 8ee6284b62
17 changed files with 1819 additions and 0 deletions
--- a/docs/architecture/decisions/001-rust-axum.md
+++ b/docs/architecture/decisions/001-rust-axum.md
@@ -0,0 +1,61 @@
+# ADR-001: Rust with Axum
+
+## Status
+
+Accepted
+
+## Context
+
+Our current nginx 1.24.0 installation is vulnerable to multiple actively-exploited
+CVEs, most critically CVE-2026-42945 (CVSS 9.2, unauthenticated RCE via
+`ngx_http_rewrite_module`). Six of seven recent nginx CVEs are memory corruption
+bugs (buffer overflow, use-after-free, buffer overread) — the exact class of
+vulnerabilities that Rust eliminates by construction.
+
+The threat landscape is worsening: LLM-assisted fuzzing is accelerating bug
+discovery in nginx's C codebase, and security researchers report additional
+undisclosed vulnerabilities.
+
+We need to replace nginx with a memory-safe alternative that can handle:
+- TLS termination
+- HTTP reverse proxying to backend services
+- Rate limiting with fail2ban-compatible logging
+- Operational simplicity (single binary, systemd integration)
+
+## Decision
+
+Use Rust with the axum web framework for the reverse proxy implementation.
+
+**Rust** provides:
+- Memory safety by construction (no buffer overflows, use-after-free, or
+  double-free at runtime)
+- rustls (pure Rust TLS) avoids OpenSSL dependency and its CVE history
+- Single static binary deployment with no runtime dependencies
+- Excellent async I/O support via tokio
+
+**axum** provides:
+- Ergonomic handler definitions with extractors
+- Tower middleware ecosystem (Service trait, layers)
+- Type-safe routing and state management
+- Well-maintained, widely used, good documentation
+
+## Consequences
+
+**Positive:**
+- Eliminates the entire class of memory corruption vulnerabilities affecting
+  nginx
+- Single binary deployment simplifies operations
+- Rust's type system catches many errors at compile time
+- axum + tower provides composable middleware
+
+**Negative:**
+- Smaller ecosystem than nginx for HTTP proxy features (but our use case is
+  simple)
+- We maintain the code (vs. using a battle-tested C project)
+- Less granular control over HTTP/2 and connection pooling compared to nginx
+- Team needs Rust expertise (already available)
+
+## References
+
+- [threat-landscape.md](../../research/threat-landscape.md)
+- [overview.md](../overview.md)
--- a/docs/architecture/decisions/002-custom-proxy-handler.md
+++ b/docs/architecture/decisions/002-custom-proxy-handler.md
@@ -0,0 +1,56 @@
+# ADR-002: Custom Proxy Handler
+
+## Status
+
+Accepted
+
+## Context
+
+We need to implement HTTP reverse proxying — receiving requests and forwarding
+them to an upstream service (Gitea on localhost:3000). Two approaches are
+available:
+
+1. **`axum-reverse-proxy` crate**: Provides path-based routing, header
+   forwarding, round-robin load balancing, TLS support, retry mechanisms, and
+   RFC 9110 compliance.
+2. **Custom handler** (Felix Knorr pattern): Build a handler using hyper's
+   `Client` to forward requests. ~50-100 lines of Rust for our needs.
+
+Our use case is minimal: single upstream per domain, single domain, no load
+balancing, no retry, no HTTP/2 proxying.
+
+## Decision
+
+Implement a custom proxy handler using hyper's `Client` for request forwarding,
+following the pattern demonstrated by Felix Knorr and used in the alknet
+project's channel proxy.
+
+## Rationale
+
+- `axum-reverse-proxy` adds complexity we don't need (load balancing, retry,
+  path-based routing to multiple backends)
+- Our proxy case is the simplest possible: match a Host header, forward the
+  entire request to a single upstream, stream the response back
+- The Felix Knorr pattern is proven, idiomatic, and ~50-100 lines
+- We maintain full control over header injection, error handling, and upstream
+  connection behavior
+- If requirements grow, we can adopt `axum-reverse-proxy` later
+
+## Consequences
+
+**Positive:**
+- Minimal dependencies
+- Full control over proxy behavior
+- Easy to understand and audit (~100 lines of proxy code)
+- No unnecessary abstraction layers
+
+**Negative:**
+- We implement and maintain proxy logic ourselves (but it's trivial for our
+  use case)
+- If requirements grow to load balancing or retry, we'd need to add that
+  ourselves or switch to `axum-reverse-proxy`
+
+## References
+
+- [proxy.md](../proxy.md)
+- Felix Knorr, "Replacing nginx with axum" (felix-knorr.net/posts/2024-10-13-replacing-nginx-with-axum.html)
--- a/docs/architecture/decisions/003-toml-config.md
+++ b/docs/architecture/decisions/003-toml-config.md
@@ -0,0 +1,44 @@
+# ADR-003: TOML Configuration Format
+
+## Status
+
+Accepted
+
+## Context
+
+The proxy needs a configuration file format for defining sites, TLS settings,
+bind addresses, and rate limits. Options include TOML, YAML, JSON, and custom
+binary formats.
+
+## Decision
+
+Use TOML as the configuration file format.
+
+## Rationale
+
+- **Rust-native**: TOML is the configuration format for Cargo (Rust's package
+  manager). The Rust ecosystem has first-class TOML support via `serde` +
+  `toml` crate.
+- **Unambiguous**: TOML has a single canonical representation for any given
+  data structure, unlike YAML which has multiple equivalent representations and
+  surprising type coercion rules (e.g., `no` → boolean, `1.0` → float).
+- **Human-friendly**: TOML is easy to read and write for simple configurations
+  like ours. It supports sections (tables), arrays, and inline tables.
+- **Good error messages**: The `toml` crate provides clear deserialization
+  error messages pointing to the exact field that failed.
+
+## Consequences
+
+**Positive:**
+- Familiar to Rust developers (Cargo.toml)
+- Clear, unambiguous syntax
+- Excellent serde integration with detailed error reporting
+- No type coercion surprises
+
+**Negative:**
+- Not as widely used for config outside Rust (but our audience is ourselves)
+- No `#include` or file composition (each config file is self-contained)
+
+## References
+
+- [config.md](../config.md)
--- a/docs/architecture/decisions/004-rustls-acme.md
+++ b/docs/architecture/decisions/004-rustls-acme.md
@@ -0,0 +1,67 @@
+# ADR-004: ACME-Primary Certificate Management
+
+## Status
+
+Accepted
+
+## Context
+
+The proxy needs TLS certificates for HTTPS. Two approaches are available:
+
+1. **certbot (external ACME client)**: Run certbot as a cron job or systemd
+   timer to obtain and renew certificates. The proxy loads certificates from
+   files on disk. Renewal requires either SIGHUP/restart or inotify file
+   watching to pick up new certs.
+
+2. **rustls-acme (built-in ACME client)**: The proxy handles ACME
+   certificate provisioning and renewal internally as a background task. No
+   external certbot dependency. The `ResolvesServerCertAcme` cert resolver
+   automatically serves the correct certificate and updates when renewed.
+
+The alknet project has successfully implemented the rustls-acme approach, and
+its patterns are directly reusable.
+
+## Decision
+
+Use `rustls-acme` as the primary certificate management mode, with manual
+certificate paths as a fallback mode for testing, self-signed certs, and
+corporate CA environments.
+
+## Rationale
+
+- **Eliminates certbot dependency**: No external cron job, no deploy hooks, no
+  certbot package to install and maintain. The proxy is self-contained.
+- **Automatic renewal**: `rustls-acme` runs as a background tokio task that
+  handles certificate provisioning and renewal automatically (~30 days before
+  expiry).
+- **No restart needed**: When `rustls-acme` provisions a new certificate, the
+  `ResolvesServerCertAcme` resolver updates atomically. No SIGHUP, no restart,
+  no file watching.
+- **Proven pattern**: alknet uses the same approach successfully.
+- **Cache persistence**: `DirCache` persists ACME state between restarts,
+  avoiding re-provisioning.
+- **Fallback mode**: Manual cert paths are still supported for environments
+  where ACME is not possible.
+
+## Consequences
+
+**Positive:**
+- Single binary deployment (no certbot dependency)
+- Zero-downtime certificate renewal
+- Simpler operational model (no certbot cron, no deploy hooks)
+- Proven in alknet
+
+**Negative:**
+- `rustls-acme` is an additional dependency
+- ACME challenges require either port 80 (HTTP-01) or TLS-ALPN-01 on port 443,
+  which our proxy already listens on
+- Less control over certificate issuance compared to certbot (e.g., no DNS-01
+  challenge support, though rustls-acme supports TLS-ALPN-01 which is sufficient
+  for our use case)
+- Manual mode requires restart for cert changes (acceptable for fallback)
+
+## References
+
+- [tls.md](../tls.md)
+- alknet ADR-008: ACME/Let's Encrypt decision
+- `rustls-acme` crate: https://github.com/FlorianUekermann/rustls-acme
--- a/docs/architecture/decisions/005-tokio-rustls-direct.md
+++ b/docs/architecture/decisions/005-tokio-rustls-direct.md
@@ -0,0 +1,65 @@
+# ADR-005: tokio-rustls Directly, Not axum-server
+
+## Status
+
+Accepted
+
+## Context
+
+We need to serve HTTPS (TLS) traffic through axum. Two approaches exist for
+integrating TLS with axum:
+
+1. **`axum-server`**: A wrapper that provides TLS support for axum via
+   `tls_rustls` feature. Handles TCP binding, TLS accept, and passing TLS
+   streams to axum. Simple API but limited control over the TLS configuration.
+
+2. **`tokio-rustls` directly**: Bind TCP manually, perform TLS handshake with
+   `TlsAcceptor`, then serve the TLS stream to axum/hyper. More code but full
+   control over `ServerConfig`, cipher suites, ALPN protocols, and cert
+   resolvers.
+
+The alknet project uses tokio-rustls directly and has proven this pattern for
+both manual and ACME certificate management.
+
+## Decision
+
+Use `tokio-rustls` directly for TLS termination, with `hyper` serving the
+resulting TLS streams to axum. Do not use `axum-server`.
+
+## Rationale
+
+- **ACME integration**: The `rustls-acme` `ResolvesServerCertAcme` resolver
+  needs to be set as the certificate resolver on `ServerConfig` via
+  `with_cert_resolver()`. `axum-server` does not expose this level of control
+  over the `ServerConfig`.
+- **Cipher suite control**: We may need to configure cipher suites beyond the
+  defaults (see OQ-01). `axum-server` wraps the `ServerConfig` construction
+  and may not expose `CryptoProvider` configuration. Direct `tokio-rustls`
+  usage gives us full control.
+- **ALPN configuration**: ACME TLS-ALPN-01 challenge requires adding
+  `acme-tls/1` to the ALPN protocol list. This is only possible with direct
+  `ServerConfig` access.
+- **Proven pattern**: alknet uses exactly this approach (`TlsAcceptor` wrapping
+  `tokio-rustls`, with manual or ACME `ServerConfig` construction).
+- **No abstraction cost**: The code to bind TCP, accept TLS, and serve to
+  axum/hyper is ~50 lines. `axum-server` saves little for our simple case.
+
+## Consequences
+
+**Positive:**
+- Full control over TLS configuration
+- Direct `rustls-acme` integration
+- Ability to add ALPN protocols for ACME challenges
+- Proven pattern from alknet
+
+**Negative:**
+- Slightly more code than `axum-server` (~50 lines for the TLS acceptor loop)
+- Need to manage the TCP listener and TLS accept explicitly
+- Must handle the `TlsStream<TcpStream>` → `hyper::service_fn` → axum
+  integration manually (well-documented pattern from Felix Knorr's blog and
+  alknet)
+
+## References
+
+- [tls.md](../tls.md)
+- alknet transport layer (`alknet-core/src/transport/tls.rs`, `alknet-core/src/transport/acme.rs`)
--- a/docs/architecture/decisions/006-rate-limiting-approach.md
+++ b/docs/architecture/decisions/006-rate-limiting-approach.md
@@ -0,0 +1,77 @@
+# ADR-006: Token Bucket Rate Limiting with In-Memory State
+
+## Status
+
+Accepted
+
+## Context
+
+The proxy must enforce request rate limits per client IP address, replacing
+nginx's `limit_req_zone` directive. Rate limiting is critical for preventing
+abuse and for fail2ban integration (rate-limited requests trigger fail2ban
+actions).
+
+Several rate limiting approaches exist:
+- **Token bucket**: Tokens accumulate at a fixed rate; each request consumes a
+  token. Allows short bursts up to the bucket capacity.
+- **Leaky bucket**: Requests are processed at a fixed rate; excess requests
+  queue or are rejected. No burst allowance.
+- **Fixed window**: Count requests in fixed time windows (e.g., per minute).
+  Allows burst at window boundaries.
+- **Sliding window**: Count requests in a rolling time window. More accurate
+  than fixed window but more complex.
+
+The current nginx config uses `limit_req zone=gitea_limit burst=20 nodelay`,
+which is a token bucket with burst allowance.
+
+For state storage:
+- **In-memory HashMap**: Fast, no external dependencies, lost on restart.
+- **External store (Redis, etc.)**: Shared across instances, persists across
+  restarts. Adds operational complexity.
+- **tower-governor crate**: Pre-built rate limiting middleware. Uses
+  generalized cell algorithm. Adds dependency.
+
+## Decision
+
+Use a token bucket algorithm with in-memory `HashMap<IpAddr, TokenBucket>`
+state, protected by `tokio::sync::Mutex`. Rate limiting runs as axum middleware
+before the proxy handler.
+
+Rate limits are global per-IP (not per-site) in Phase 1. Per-site rate limits
+may be added in Phase 2 as the config model evolves.
+
+Stale entries in the HashMap are cleaned up periodically. A background task
+scans the HashMap at a configurable interval (default: 60 seconds) and removes
+entries that haven't been accessed within the cleanup interval.
+
+## Rationale
+
+- Token bucket matches nginx's `limit_req burst` semantics, ensuring
+  behavioral compatibility during migration.
+- In-memory state is sufficient for a single-instance proxy (no shared state
+  needed).
+- `tokio::sync::Mutex` (not `std::sync::Mutex`) avoids holding the lock across
+  await points and integrates with the async runtime.
+- Custom implementation gives full control over logging output for fail2ban
+  integration (ADR-007).
+- State loss on restart is acceptable — rate limit state is inherently
+  ephemeral.
+
+## Consequences
+
+**Positive:**
+- Behavioral compatibility with nginx rate limiting
+- Full control over fail2ban log format
+- No external dependencies (Redis, etc.)
+- Simple implementation (~100 lines)
+
+**Negative:**
+- Rate limit state is lost on restart (acceptable for single-instance deploy)
+- Not suitable for multi-instance deployments without external state store
+  (Phase 1 is single-instance)
+- HashMap grows over time without eviction (mitigated by periodic cleanup)
+
+## References
+
+- [operations.md](../operations.md)
+- nginx `limit_req` documentation
--- a/docs/architecture/decisions/007-custom-log-format.md
+++ b/docs/architecture/decisions/007-custom-log-format.md
@@ -0,0 +1,67 @@
+# ADR-007: Custom Structured Log Format for Fail2ban
+
+## Status
+
+Accepted
+
+## Context
+
+The proxy needs to produce log output that fail2ban can parse to detect and ban
+abusive IP addresses. The current nginx setup uses nginx's default log format
+with standard fail2ban filters.
+
+Options for fail2ban integration:
+- **nginx-compatible format**: Replicate nginx's log format so existing
+  fail2ban filters work unchanged. Couples us to nginx's format.
+- **Custom structured format**: Design a clean, parseable format with a
+  corresponding custom fail2ban filter. Gives us control and clarity.
+- **JSON format**: Machine-readable but harder for fail2ban regex matching.
+
+## Decision
+
+Use a custom structured log format with a corresponding custom fail2ban filter.
+
+The format for rate-limited requests:
+
+```
+RATE_LIMIT client_ip=<IP> host=<host> path=<path> status=429
+```
+
+The format for general access logs:
+
+```
+REQUEST client_ip=<IP> host=<host> method=<METHOD> path=<path> status=<code> upstream=<addr> duration_ms=<ms>
+```
+
+A corresponding fail2ban filter (`/etc/fail2ban/filter.d/reverse-proxy.conf`)
+uses regex matching on the `RATE_LIMIT` prefix and `client_ip=<HOST>` field.
+
+## Rationale
+
+- Custom format is clear, unambiguous, and self-documenting
+- No coupling to nginx's format, which may change or include fields we don't
+  produce
+- `key=value` pairs are easy to parse with regex and easy to extend
+- The `RATE_LIMIT` prefix makes it trivial to distinguish rate-limit events
+  from other logs
+- Writing a custom fail2ban filter is straightforward (5 lines of config)
+- We control both sides (the proxy and the filter), so compatibility is
+  guaranteed
+
+## Consequences
+
+**Positive:**
+- Clean, purpose-built format
+- Easy to extend with new fields
+- No dependency on nginx log format
+- Custom fail2ban filter is simple to maintain
+
+**Negative:**
+- Cannot reuse existing nginx fail2ban filters (trivial to write our own)
+- Existing fail2ban configurations need updating (acceptable since we're
+  replacing nginx entirely)
+
+## References
+
+- [operations.md](../operations.md)
+- [open-questions.md](../open-questions.md) OQ-02 (now resolved)
--- a/docs/architecture/decisions/008-static-dynamic-config-split.md
+++ b/docs/architecture/decisions/008-static-dynamic-config-split.md
@@ -0,0 +1,76 @@
+# ADR-008: Static/Dynamic Configuration Split with ArcSwap
+
+## Status
+
+Accepted
+
+## Context
+
+The proxy needs configuration that can be partially reloaded at runtime (site
+definitions, rate limits) without restarting the process and dropping active
+connections. However, some configuration (bind addresses, TLS mode) fundamentally
+requires creating new listeners and cannot be changed at runtime.
+
+Two approaches:
+- **Full restart for all config changes**: Simple, but requires dropping
+  active connections for every change, including trivial rate limit adjustments.
+- **Static/dynamic split**: Immutable parameters (bind address, TLS mode) in a
+  `StaticConfig` that requires restart; runtime-adjustable parameters (sites,
+  rate limits) in a `DynamicConfig` that can be atomically swapped via
+  `Arc<ArcSwap<DynamicConfig>>` without dropping connections.
+
+This pattern is proven in the alknet project, which uses the same
+`ArcSwap<DynamicConfig>` approach for auth policy, forwarding rules, and rate
+limits.
+
+## Decision
+
+Split configuration into `StaticConfig` (immutable after startup) and
+`DynamicConfig` (hot-reloadable via `ArcSwap`). The split is:
+
+**StaticConfig** (restart required):
+- Bind address, HTTP port, HTTPS port
+- TLS mode (ACME vs. manual), cert paths, ACME settings
+- Log level and format
+
+**DynamicConfig** (hot-reloadable via SIGHUP):
+- Site definitions (hostname → upstream mappings)
+- Rate limits (requests per second, burst)
+- Body size limits
+
+`ConfigReloadHandle` provides a `reload(DynamicConfig)` method that atomically
+swaps the entire config. All request handlers read `DynamicConfig` via
+`ArcSwap::load()` — a lock-free operation.
+
+## Rationale
+
+- Rate limits and site definitions change more frequently than bind addresses
+  and TLS settings. Hot-reload avoids unnecessary downtime.
+- `ArcSwap` provides lock-free reads and atomic writes — no partial updates,
+  no lock contention on the hot path.
+- Proven pattern from alknet, where it's used for auth policy, forwarding
+  rules, and rate limits.
+- SIGHUP trigger is simple, well-understood, and compatible with systemd and
+  process supervisors.
+- The entire config is swapped at once, preventing inconsistent states where
+  some sites use the old config and others use the new one.
+
+## Consequences
+
+**Positive:**
+- Zero-downtime config reload for sites and rate limits
+- Lock-free reads on the request hot path
+- Atomic config updates — no partial states
+- Proven pattern from alknet
+
+**Negative:**
+- Two config types add conceptual complexity
+- SIGHUP reload requires reading the config file from disk (need to handle
+  file read errors gracefully)
+- Must validate DynamicConfig before swapping (invalid config must not replace
+  valid config)
+
+## References
+
+- [config.md](../config.md)
+- alknet ADR-030 (static/dynamic config split)
--- a/docs/architecture/decisions/009-signal-handling.md
+++ b/docs/architecture/decisions/009-signal-handling.md
@@ -0,0 +1,62 @@
+# ADR-009: Signal Handling Strategy
+
+## Status
+
+Accepted
+
+## Context
+
+The proxy needs to handle Unix signals for:
+- **Graceful shutdown**: SIGTERM and SIGINT should stop accepting new
+  connections, drain in-flight requests, then exit.
+- **Config reload**: SIGHUP should trigger a DynamicConfig reload from disk.
+
+Two approaches for signal handling:
+- **`tokio::signal`**: Built into tokio. Handles SIGTERM and SIGINT via
+  `ctrl_c()`. Does not directly handle SIGHUP.
+- **`signal-hook`**: External crate. Handles all Unix signals including SIGHUP.
+  More flexible but adds a dependency.
+
+## Decision
+
+Use `signal-hook` for all signal handling. Specifically:
+- `signal-hook::flag` to set termination flags on SIGTERM/SIGINT
+- `signal-hook` to register a SIGHUP handler that triggers config reload
+
+`tokio::signal::ctrl_c()` is registered as a secondary shutdown trigger; both
+mechanisms converge on the same shutdown path. This is a belt-and-suspenders
+approach: `signal-hook` handles all signals including SIGHUP, while
+`ctrl_c()` provides a fallback for environments where signal handling may not
+be fully wired (e.g., container runtimes).
+
+The shutdown sequence:
+1. On SIGTERM or SIGINT: stop accepting new connections, wait up to 30 seconds
+   for in-flight requests to complete, then exit with code 0.
+2. On SIGHUP: re-read config file, validate, and swap DynamicConfig if valid.
+   Log the result.
+
+## Rationale
+
+- SIGHUP handling is required for config reload — `tokio::signal` doesn't
+  support SIGHUP.
+- `signal-hook` is well-maintained, widely used, and handles all Unix signals.
+- Using one signal handling mechanism (rather than mixing `tokio::signal` and
+  `signal-hook`) is simpler and avoids edge cases.
+- `signal-hook::flag` is a minimal, safe API for signal-triggered flags.
+
+## Consequences
+
+**Positive:**
+- SIGHUP for config reload is simple and well-understood
+- Single signal handling mechanism for all signals
+- Compatible with systemd (SIGTERM for shutdown) and standard Unix conventions
+
+**Negative:**
+- `signal-hook` is an additional dependency (but a well-established one)
+- Signal handling requires careful coordination with the tokio runtime (async
+  signal receivers must be properly integrated)
+
+## References
+
+- [operations.md](../operations.md)
+- [config.md](../config.md)