Add architecture specification for Rust/axum reverse proxy

Phase 1 architecture docs covering proxy handler, TLS termination (ACME +
manual), TOML config with static/dynamic split (ArcSwap), and operations
(rate limiting, logging, health check, systemd, graceful shutdown).

Nine ADRs documenting key decisions: Rust/axum, custom proxy handler,
TOML config, rustls-acme for cert management, tokio-rustls direct,
token bucket rate limiting, custom log format for fail2ban,
static/dynamic config split, and signal handling strategy.

Includes threat landscape research documenting the nginx CVEs motivating
this project.
This commit is contained in:
2026-06-11 07:25:50 +00:00
parent 5c54a28822
commit 8ee6284b62
17 changed files with 1819 additions and 0 deletions

View File

@@ -0,0 +1,61 @@
# ADR-001: Rust with Axum
## Status
Accepted
## Context
Our current nginx 1.24.0 installation is vulnerable to multiple actively-exploited
CVEs, most critically CVE-2026-42945 (CVSS 9.2, unauthenticated RCE via
`ngx_http_rewrite_module`). Six of seven recent nginx CVEs are memory corruption
bugs (buffer overflow, use-after-free, buffer overread) — the exact class of
vulnerabilities that Rust eliminates by construction.
The threat landscape is worsening: LLM-assisted fuzzing is accelerating bug
discovery in nginx's C codebase, and security researchers report additional
undisclosed vulnerabilities.
We need to replace nginx with a memory-safe alternative that can handle:
- TLS termination
- HTTP reverse proxying to backend services
- Rate limiting with fail2ban-compatible logging
- Operational simplicity (single binary, systemd integration)
## Decision
Use Rust with the axum web framework for the reverse proxy implementation.
**Rust** provides:
- Memory safety by construction (no buffer overflows, use-after-free, or
double-free at runtime)
- rustls (pure Rust TLS) avoids OpenSSL dependency and its CVE history
- Single static binary deployment with no runtime dependencies
- Excellent async I/O support via tokio
**axum** provides:
- Ergonomic handler definitions with extractors
- Tower middleware ecosystem (Service trait, layers)
- Type-safe routing and state management
- Well-maintained, widely used, good documentation
## Consequences
**Positive:**
- Eliminates the entire class of memory corruption vulnerabilities affecting
nginx
- Single binary deployment simplifies operations
- Rust's type system catches many errors at compile time
- axum + tower provides composable middleware
**Negative:**
- Smaller ecosystem than nginx for HTTP proxy features (but our use case is
simple)
- We maintain the code (vs. using a battle-tested C project)
- Less granular control over HTTP/2 and connection pooling compared to nginx
- Team needs Rust expertise (already available)
## References
- [threat-landscape.md](../../research/threat-landscape.md)
- [overview.md](../overview.md)

View File

@@ -0,0 +1,56 @@
# ADR-002: Custom Proxy Handler
## Status
Accepted
## Context
We need to implement HTTP reverse proxying — receiving requests and forwarding
them to an upstream service (Gitea on localhost:3000). Two approaches are
available:
1. **`axum-reverse-proxy` crate**: Provides path-based routing, header
forwarding, round-robin load balancing, TLS support, retry mechanisms, and
RFC 9110 compliance.
2. **Custom handler** (Felix Knorr pattern): Build a handler using hyper's
`Client` to forward requests. ~50-100 lines of Rust for our needs.
Our use case is minimal: single upstream per domain, single domain, no load
balancing, no retry, no HTTP/2 proxying.
## Decision
Implement a custom proxy handler using hyper's `Client` for request forwarding,
following the pattern demonstrated by Felix Knorr and used in the alknet
project's channel proxy.
## Rationale
- `axum-reverse-proxy` adds complexity we don't need (load balancing, retry,
path-based routing to multiple backends)
- Our proxy case is the simplest possible: match a Host header, forward the
entire request to a single upstream, stream the response back
- The Felix Knorr pattern is proven, idiomatic, and ~50-100 lines
- We maintain full control over header injection, error handling, and upstream
connection behavior
- If requirements grow, we can adopt `axum-reverse-proxy` later
## Consequences
**Positive:**
- Minimal dependencies
- Full control over proxy behavior
- Easy to understand and audit (~100 lines of proxy code)
- No unnecessary abstraction layers
**Negative:**
- We implement and maintain proxy logic ourselves (but it's trivial for our
use case)
- If requirements grow to load balancing or retry, we'd need to add that
ourselves or switch to `axum-reverse-proxy`
## References
- [proxy.md](../proxy.md)
- Felix Knorr, "Replacing nginx with axum" (felix-knorr.net/posts/2024-10-13-replacing-nginx-with-axum.html)

View File

@@ -0,0 +1,44 @@
# ADR-003: TOML Configuration Format
## Status
Accepted
## Context
The proxy needs a configuration file format for defining sites, TLS settings,
bind addresses, and rate limits. Options include TOML, YAML, JSON, and custom
binary formats.
## Decision
Use TOML as the configuration file format.
## Rationale
- **Rust-native**: TOML is the configuration format for Cargo (Rust's package
manager). The Rust ecosystem has first-class TOML support via `serde` +
`toml` crate.
- **Unambiguous**: TOML has a single canonical representation for any given
data structure, unlike YAML which has multiple equivalent representations and
surprising type coercion rules (e.g., `no` → boolean, `1.0` → float).
- **Human-friendly**: TOML is easy to read and write for simple configurations
like ours. It supports sections (tables), arrays, and inline tables.
- **Good error messages**: The `toml` crate provides clear deserialization
error messages pointing to the exact field that failed.
## Consequences
**Positive:**
- Familiar to Rust developers (Cargo.toml)
- Clear, unambiguous syntax
- Excellent serde integration with detailed error reporting
- No type coercion surprises
**Negative:**
- Not as widely used for config outside Rust (but our audience is ourselves)
- No `#include` or file composition (each config file is self-contained)
## References
- [config.md](../config.md)

View File

@@ -0,0 +1,67 @@
# ADR-004: ACME-Primary Certificate Management
## Status
Accepted
## Context
The proxy needs TLS certificates for HTTPS. Two approaches are available:
1. **certbot (external ACME client)**: Run certbot as a cron job or systemd
timer to obtain and renew certificates. The proxy loads certificates from
files on disk. Renewal requires either SIGHUP/restart or inotify file
watching to pick up new certs.
2. **rustls-acme (built-in ACME client)**: The proxy handles ACME
certificate provisioning and renewal internally as a background task. No
external certbot dependency. The `ResolvesServerCertAcme` cert resolver
automatically serves the correct certificate and updates when renewed.
The alknet project has successfully implemented the rustls-acme approach, and
its patterns are directly reusable.
## Decision
Use `rustls-acme` as the primary certificate management mode, with manual
certificate paths as a fallback mode for testing, self-signed certs, and
corporate CA environments.
## Rationale
- **Eliminates certbot dependency**: No external cron job, no deploy hooks, no
certbot package to install and maintain. The proxy is self-contained.
- **Automatic renewal**: `rustls-acme` runs as a background tokio task that
handles certificate provisioning and renewal automatically (~30 days before
expiry).
- **No restart needed**: When `rustls-acme` provisions a new certificate, the
`ResolvesServerCertAcme` resolver updates atomically. No SIGHUP, no restart,
no file watching.
- **Proven pattern**: alknet uses the same approach successfully.
- **Cache persistence**: `DirCache` persists ACME state between restarts,
avoiding re-provisioning.
- **Fallback mode**: Manual cert paths are still supported for environments
where ACME is not possible.
## Consequences
**Positive:**
- Single binary deployment (no certbot dependency)
- Zero-downtime certificate renewal
- Simpler operational model (no certbot cron, no deploy hooks)
- Proven in alknet
**Negative:**
- `rustls-acme` is an additional dependency
- ACME challenges require either port 80 (HTTP-01) or TLS-ALPN-01 on port 443,
which our proxy already listens on
- Less control over certificate issuance compared to certbot (e.g., no DNS-01
challenge support, though rustls-acme supports TLS-ALPN-01 which is sufficient
for our use case)
- Manual mode requires restart for cert changes (acceptable for fallback)
## References
- [tls.md](../tls.md)
- alknet ADR-008: ACME/Let's Encrypt decision
- `rustls-acme` crate: https://github.com/FlorianUekermann/rustls-acme

View File

@@ -0,0 +1,65 @@
# ADR-005: tokio-rustls Directly, Not axum-server
## Status
Accepted
## Context
We need to serve HTTPS (TLS) traffic through axum. Two approaches exist for
integrating TLS with axum:
1. **`axum-server`**: A wrapper that provides TLS support for axum via
`tls_rustls` feature. Handles TCP binding, TLS accept, and passing TLS
streams to axum. Simple API but limited control over the TLS configuration.
2. **`tokio-rustls` directly**: Bind TCP manually, perform TLS handshake with
`TlsAcceptor`, then serve the TLS stream to axum/hyper. More code but full
control over `ServerConfig`, cipher suites, ALPN protocols, and cert
resolvers.
The alknet project uses tokio-rustls directly and has proven this pattern for
both manual and ACME certificate management.
## Decision
Use `tokio-rustls` directly for TLS termination, with `hyper` serving the
resulting TLS streams to axum. Do not use `axum-server`.
## Rationale
- **ACME integration**: The `rustls-acme` `ResolvesServerCertAcme` resolver
needs to be set as the certificate resolver on `ServerConfig` via
`with_cert_resolver()`. `axum-server` does not expose this level of control
over the `ServerConfig`.
- **Cipher suite control**: We may need to configure cipher suites beyond the
defaults (see OQ-01). `axum-server` wraps the `ServerConfig` construction
and may not expose `CryptoProvider` configuration. Direct `tokio-rustls`
usage gives us full control.
- **ALPN configuration**: ACME TLS-ALPN-01 challenge requires adding
`acme-tls/1` to the ALPN protocol list. This is only possible with direct
`ServerConfig` access.
- **Proven pattern**: alknet uses exactly this approach (`TlsAcceptor` wrapping
`tokio-rustls`, with manual or ACME `ServerConfig` construction).
- **No abstraction cost**: The code to bind TCP, accept TLS, and serve to
axum/hyper is ~50 lines. `axum-server` saves little for our simple case.
## Consequences
**Positive:**
- Full control over TLS configuration
- Direct `rustls-acme` integration
- Ability to add ALPN protocols for ACME challenges
- Proven pattern from alknet
**Negative:**
- Slightly more code than `axum-server` (~50 lines for the TLS acceptor loop)
- Need to manage the TCP listener and TLS accept explicitly
- Must handle the `TlsStream<TcpStream>``hyper::service_fn` → axum
integration manually (well-documented pattern from Felix Knorr's blog and
alknet)
## References
- [tls.md](../tls.md)
- alknet transport layer (`alknet-core/src/transport/tls.rs`, `alknet-core/src/transport/acme.rs`)

View File

@@ -0,0 +1,77 @@
# ADR-006: Token Bucket Rate Limiting with In-Memory State
## Status
Accepted
## Context
The proxy must enforce request rate limits per client IP address, replacing
nginx's `limit_req_zone` directive. Rate limiting is critical for preventing
abuse and for fail2ban integration (rate-limited requests trigger fail2ban
actions).
Several rate limiting approaches exist:
- **Token bucket**: Tokens accumulate at a fixed rate; each request consumes a
token. Allows short bursts up to the bucket capacity.
- **Leaky bucket**: Requests are processed at a fixed rate; excess requests
queue or are rejected. No burst allowance.
- **Fixed window**: Count requests in fixed time windows (e.g., per minute).
Allows burst at window boundaries.
- **Sliding window**: Count requests in a rolling time window. More accurate
than fixed window but more complex.
The current nginx config uses `limit_req zone=gitea_limit burst=20 nodelay`,
which is a token bucket with burst allowance.
For state storage:
- **In-memory HashMap**: Fast, no external dependencies, lost on restart.
- **External store (Redis, etc.)**: Shared across instances, persists across
restarts. Adds operational complexity.
- **tower-governor crate**: Pre-built rate limiting middleware. Uses
generalized cell algorithm. Adds dependency.
## Decision
Use a token bucket algorithm with in-memory `HashMap<IpAddr, TokenBucket>`
state, protected by `tokio::sync::Mutex`. Rate limiting runs as axum middleware
before the proxy handler.
Rate limits are global per-IP (not per-site) in Phase 1. Per-site rate limits
may be added in Phase 2 as the config model evolves.
Stale entries in the HashMap are cleaned up periodically. A background task
scans the HashMap at a configurable interval (default: 60 seconds) and removes
entries that haven't been accessed within the cleanup interval.
## Rationale
- Token bucket matches nginx's `limit_req burst` semantics, ensuring
behavioral compatibility during migration.
- In-memory state is sufficient for a single-instance proxy (no shared state
needed).
- `tokio::sync::Mutex` (not `std::sync::Mutex`) avoids holding the lock across
await points and integrates with the async runtime.
- Custom implementation gives full control over logging output for fail2ban
integration (ADR-007).
- State loss on restart is acceptable — rate limit state is inherently
ephemeral.
## Consequences
**Positive:**
- Behavioral compatibility with nginx rate limiting
- Full control over fail2ban log format
- No external dependencies (Redis, etc.)
- Simple implementation (~100 lines)
**Negative:**
- Rate limit state is lost on restart (acceptable for single-instance deploy)
- Not suitable for multi-instance deployments without external state store
(Phase 1 is single-instance)
- HashMap grows over time without eviction (mitigated by periodic cleanup)
## References
- [operations.md](../operations.md)
- nginx `limit_req` documentation

View File

@@ -0,0 +1,67 @@
# ADR-007: Custom Structured Log Format for Fail2ban
## Status
Accepted
## Context
The proxy needs to produce log output that fail2ban can parse to detect and ban
abusive IP addresses. The current nginx setup uses nginx's default log format
with standard fail2ban filters.
Options for fail2ban integration:
- **nginx-compatible format**: Replicate nginx's log format so existing
fail2ban filters work unchanged. Couples us to nginx's format.
- **Custom structured format**: Design a clean, parseable format with a
corresponding custom fail2ban filter. Gives us control and clarity.
- **JSON format**: Machine-readable but harder for fail2ban regex matching.
## Decision
Use a custom structured log format with a corresponding custom fail2ban filter.
The format for rate-limited requests:
```
RATE_LIMIT client_ip=<IP> host=<host> path=<path> status=429
```
The format for general access logs:
```
REQUEST client_ip=<IP> host=<host> method=<METHOD> path=<path> status=<code> upstream=<addr> duration_ms=<ms>
```
A corresponding fail2ban filter (`/etc/fail2ban/filter.d/reverse-proxy.conf`)
uses regex matching on the `RATE_LIMIT` prefix and `client_ip=<HOST>` field.
## Rationale
- Custom format is clear, unambiguous, and self-documenting
- No coupling to nginx's format, which may change or include fields we don't
produce
- `key=value` pairs are easy to parse with regex and easy to extend
- The `RATE_LIMIT` prefix makes it trivial to distinguish rate-limit events
from other logs
- Writing a custom fail2ban filter is straightforward (5 lines of config)
- We control both sides (the proxy and the filter), so compatibility is
guaranteed
## Consequences
**Positive:**
- Clean, purpose-built format
- Easy to extend with new fields
- No dependency on nginx log format
- Custom fail2ban filter is simple to maintain
**Negative:**
- Cannot reuse existing nginx fail2ban filters (trivial to write our own)
- Existing fail2ban configurations need updating (acceptable since we're
replacing nginx entirely)
## References
- [operations.md](../operations.md)
- [open-questions.md](../open-questions.md) OQ-02 (now resolved)

View File

@@ -0,0 +1,76 @@
# ADR-008: Static/Dynamic Configuration Split with ArcSwap
## Status
Accepted
## Context
The proxy needs configuration that can be partially reloaded at runtime (site
definitions, rate limits) without restarting the process and dropping active
connections. However, some configuration (bind addresses, TLS mode) fundamentally
requires creating new listeners and cannot be changed at runtime.
Two approaches:
- **Full restart for all config changes**: Simple, but requires dropping
active connections for every change, including trivial rate limit adjustments.
- **Static/dynamic split**: Immutable parameters (bind address, TLS mode) in a
`StaticConfig` that requires restart; runtime-adjustable parameters (sites,
rate limits) in a `DynamicConfig` that can be atomically swapped via
`Arc<ArcSwap<DynamicConfig>>` without dropping connections.
This pattern is proven in the alknet project, which uses the same
`ArcSwap<DynamicConfig>` approach for auth policy, forwarding rules, and rate
limits.
## Decision
Split configuration into `StaticConfig` (immutable after startup) and
`DynamicConfig` (hot-reloadable via `ArcSwap`). The split is:
**StaticConfig** (restart required):
- Bind address, HTTP port, HTTPS port
- TLS mode (ACME vs. manual), cert paths, ACME settings
- Log level and format
**DynamicConfig** (hot-reloadable via SIGHUP):
- Site definitions (hostname → upstream mappings)
- Rate limits (requests per second, burst)
- Body size limits
`ConfigReloadHandle` provides a `reload(DynamicConfig)` method that atomically
swaps the entire config. All request handlers read `DynamicConfig` via
`ArcSwap::load()` — a lock-free operation.
## Rationale
- Rate limits and site definitions change more frequently than bind addresses
and TLS settings. Hot-reload avoids unnecessary downtime.
- `ArcSwap` provides lock-free reads and atomic writes — no partial updates,
no lock contention on the hot path.
- Proven pattern from alknet, where it's used for auth policy, forwarding
rules, and rate limits.
- SIGHUP trigger is simple, well-understood, and compatible with systemd and
process supervisors.
- The entire config is swapped at once, preventing inconsistent states where
some sites use the old config and others use the new one.
## Consequences
**Positive:**
- Zero-downtime config reload for sites and rate limits
- Lock-free reads on the request hot path
- Atomic config updates — no partial states
- Proven pattern from alknet
**Negative:**
- Two config types add conceptual complexity
- SIGHUP reload requires reading the config file from disk (need to handle
file read errors gracefully)
- Must validate DynamicConfig before swapping (invalid config must not replace
valid config)
## References
- [config.md](../config.md)
- alknet ADR-030 (static/dynamic config split)

View File

@@ -0,0 +1,62 @@
# ADR-009: Signal Handling Strategy
## Status
Accepted
## Context
The proxy needs to handle Unix signals for:
- **Graceful shutdown**: SIGTERM and SIGINT should stop accepting new
connections, drain in-flight requests, then exit.
- **Config reload**: SIGHUP should trigger a DynamicConfig reload from disk.
Two approaches for signal handling:
- **`tokio::signal`**: Built into tokio. Handles SIGTERM and SIGINT via
`ctrl_c()`. Does not directly handle SIGHUP.
- **`signal-hook`**: External crate. Handles all Unix signals including SIGHUP.
More flexible but adds a dependency.
## Decision
Use `signal-hook` for all signal handling. Specifically:
- `signal-hook::flag` to set termination flags on SIGTERM/SIGINT
- `signal-hook` to register a SIGHUP handler that triggers config reload
`tokio::signal::ctrl_c()` is registered as a secondary shutdown trigger; both
mechanisms converge on the same shutdown path. This is a belt-and-suspenders
approach: `signal-hook` handles all signals including SIGHUP, while
`ctrl_c()` provides a fallback for environments where signal handling may not
be fully wired (e.g., container runtimes).
The shutdown sequence:
1. On SIGTERM or SIGINT: stop accepting new connections, wait up to 30 seconds
for in-flight requests to complete, then exit with code 0.
2. On SIGHUP: re-read config file, validate, and swap DynamicConfig if valid.
Log the result.
## Rationale
- SIGHUP handling is required for config reload — `tokio::signal` doesn't
support SIGHUP.
- `signal-hook` is well-maintained, widely used, and handles all Unix signals.
- Using one signal handling mechanism (rather than mixing `tokio::signal` and
`signal-hook`) is simpler and avoids edge cases.
- `signal-hook::flag` is a minimal, safe API for signal-triggered flags.
## Consequences
**Positive:**
- SIGHUP for config reload is simple and well-understood
- Single signal handling mechanism for all signals
- Compatible with systemd (SIGTERM for shutdown) and standard Unix conventions
**Negative:**
- `signal-hook` is an additional dependency (but a well-established one)
- Signal handling requires careful coordination with the tokio runtime (async
signal receivers must be properly integrated)
## References
- [operations.md](../operations.md)
- [config.md](../config.md)