Add architecture specification for Rust/axum reverse proxy

Phase 1 architecture docs covering proxy handler, TLS termination (ACME +
manual), TOML config with static/dynamic split (ArcSwap), and operations
(rate limiting, logging, health check, systemd, graceful shutdown).

Nine ADRs documenting key decisions: Rust/axum, custom proxy handler,
TOML config, rustls-acme for cert management, tokio-rustls direct,
token bucket rate limiting, custom log format for fail2ban,
static/dynamic config split, and signal handling strategy.

Includes threat landscape research documenting the nginx CVEs motivating
this project.
This commit is contained in:
2026-06-11 07:25:50 +00:00
parent 5c54a28822
commit 8ee6284b62
17 changed files with 1819 additions and 0 deletions

View File

@@ -0,0 +1,61 @@
---
status: draft
last_updated: 2026-06-11
---
# Reverse Proxy — Architecture
## Current State
**Phase 0 (Exploration) — Complete.** Phase 1 (Architecture) — In progress.
This project replaces our vulnerable nginx 1.24.0 installation with a
memory-safe Rust/axum reverse proxy. The primary motivation is CVE-2026-42945
(unauthenticated RCE in nginx's rewrite module) and the broader pattern of
memory corruption bugs in nginx's C codebase.
## Architecture Documents
| Document | Status | Description |
|----------|--------|-------------|
| [overview.md](overview.md) | Draft | Vision, scope, crate dependencies, exports |
| [proxy.md](proxy.md) | Draft | Reverse proxy handler, request flow, header injection |
| [tls.md](tls.md) | Draft | TLS termination, ACME, manual certs, SNI |
| [config.md](config.md) | Draft | TOML config format, static/dynamic split, ArcSwap reload |
| [operations.md](operations.md) | Draft | Rate limiting, logging, health check, systemd, shutdown |
## ADR Table
| ADR | Title | Status |
|-----|-------|--------|
| [001](decisions/001-rust-axum.md) | Rust with Axum | Accepted |
| [002](decisions/002-custom-proxy-handler.md) | Custom Proxy Handler | Accepted |
| [003](decisions/003-toml-config.md) | TOML Configuration Format | Accepted |
| [004](decisions/004-rustls-acme.md) | ACME-Primary Certificate Management | Accepted |
| [005](decisions/005-tokio-rustls-direct.md) | tokio-rustls Directly, Not axum-server | Accepted |
| [006](decisions/006-rate-limiting-approach.md) | Token Bucket Rate Limiting | Accepted |
| [007](decisions/007-custom-log-format.md) | Custom Structured Log Format | Accepted |
| [008](decisions/008-static-dynamic-config-split.md) | Static/Dynamic Config Split with ArcSwap | Accepted |
| [009](decisions/009-signal-handling.md) | Signal Handling Strategy | Accepted |
## Open Questions
See [open-questions.md](open-questions.md) for the full tracker.
| OQ | Question | Priority | Status |
|----|----------|----------|--------|
| OQ-01 | Should cipher suites be restricted beyond rustls defaults? | medium | open |
| ~~OQ-02~~ | ~~What log format should fail2ban consume?~~ | ~~high~~ | **resolved** (ADR-007) |
| OQ-03 | Should the health check endpoint be on a separate port? | low | open |
| OQ-04 | Config reload: SIGHUP only or also Unix socket API? | low | open |
| OQ-05 | Should the proxy bind to multiple addresses? | low | open |
| OQ-06 | Should upstream timeouts be configurable per-site? | low | open |
## Document Lifecycle
| Status | Meaning | Transitions |
|--------|---------|-------------|
| `draft` | Under active development. May change significantly. | → `reviewed` when open questions are resolved |
| `reviewed` | Architecture is final. Implementation may begin. | → `stable` when implementation is complete |
| `stable` | Locked. Changes require review and may warrant an ADR. | → `deprecated` when superseded |
| `deprecated` | Superseded. Kept for reference. | Removed when no longer referenced |

206
docs/architecture/config.md Normal file
View File

@@ -0,0 +1,206 @@
---
status: draft
last_updated: 2026-06-11
---
# Configuration
## What It Is
The configuration system defines how the proxy is configured, how configuration
is loaded, and how dynamic configuration can be reloaded without restarting the
process.
## Why It Exists
The proxy needs to be configurable without hard-coding domains, upstream
addresses, or TLS settings. The configuration system separates immutable
startup parameters (bind addresses, TLS mode) from runtime-adjustable
parameters (site definitions, rate limits) using the `ArcSwap` pattern proven
in the alknet project.
## Architecture
```
config.toml
┌──────────────────────┐
│ serde::Deserialize │
│ (TOML → Config) │
└──────────┬───────────┘
┌──────────────────────┐ ┌──────────────────────┐
│ StaticConfig │ │ DynamicConfig │
│ (immutable) │ │ (hot-reloadable) │
│ │ │ │
│ bind_addr │ │ sites[] │
│ http_port │ │ rate_limit │
│ https_port │ │ body_limit │
│ tls.mode │ │ proxy_headers │
│ tls.acme_domain │ │ │
│ tls.cert_path │ │ ← ArcSwap → │
│ tls.key_path │ │ ConfigReloadHandle │
│ tls.cache_dir │ │ .reload(new_config) │
│ log_level │ │ │
│ log_format │ └───────────────────────┘
└──────────────────────┘
```
## Static vs Dynamic Configuration
This split follows the pattern established in alknet (ADR-030) and adapted
for our simpler use case.
### StaticConfig
Immutable after startup. Changes require a process restart.
| Field | Type | Description |
|-------|------|-------------|
| `bind_addr` | `String` | IP address to bind to (e.g., `"15.235.125.95"`) |
| `http_port` | `u16` | Port for HTTP→HTTPS redirect (default: `80`; set to `0` to disable) |
| `https_port` | `u16` | Port for TLS listener (default: `443`) |
| `tls.mode` | `"acme"` or `"manual"` | Certificate provisioning mode |
| `tls.acme_domain` | `String` | Domain for ACME (ACME mode only) |
| `tls.acme_cache_dir` | `String` | ACME state cache directory |
| `tls.acme_directory` | `"production"` or `"staging"` | Let's Encrypt directory |
| `tls.cert_path` | `String` | Certificate file path (manual mode only) |
| `tls.key_path` | `String` | Private key file path (manual mode only) |
| `log_level` | `"trace"`, `"debug"`, `"info"`, `"warn"`, `"error"` | Logging verbosity |
| `log_format` | `"text"` or `"json"` | Log output format |
**Why these are static:** Changing bind addresses, ports, or TLS mode requires
creating new listeners and TLS configurations — operations that fundamentally
require a restart. There's no safe way to change these at runtime.
### DynamicConfig
Hot-reloadable at runtime via `ArcSwap`. Changes take effect for new
connections immediately.
| Field | Type | Description |
|-------|------|-------------|
| `sites` | `Vec<SiteConfig>` | Site definitions (hostname → upstream mapping) |
| `rate_limit.requests_per_second` | `u32` | Rate limit per IP (global in Phase 1) |
| `rate_limit.burst` | `u32` | Burst capacity (global in Phase 1) |
| `body_limit_bytes` | `u64` | Max request body size in bytes (global in Phase 1) |
**SiteConfig:**
| Field | Type | Description |
|-------|------|-------------|
| `host` | `String` | Hostname to match (e.g., `"git.alk.dev"`) |
| `upstream` | `String` | Upstream address (e.g., `"127.0.0.1:3000"`) |
| `upstream_scheme` | `"http"` or `"https"` | Protocol for upstream connection (default: `"http"`) |
**Why these are dynamic:** Site definitions and rate limits are per-request
concerns. Adding a site or changing a rate limit should not require restarting
the proxy and dropping active connections. Rate limits and body limits are
global settings in Phase 1; per-site configuration for these may be added in
Phase 2.
## Config Reload
### ArcSwap Pattern
`DynamicConfig` is wrapped in `Arc<ArcSwap<DynamicConfig>>`. This provides:
- **Lock-free reads**: Every handler reads the current config via a single
`Arc` dereference — no lock contention on the request hot path.
- **Atomic writes**: `ConfigReloadHandle::reload(new_config)` swaps the entire
config atomically. All new requests see the new config immediately.
- **No partial updates**: The entire config is swapped at once. There's no risk
of reading a half-updated config.
See [ADR-008](decisions/008-static-dynamic-config-split.md) for the rationale
behind this split.
### Reload Trigger
The initial implementation uses SIGHUP as the reload trigger. When the process
receives SIGHUP:
1. Re-read the config file from disk
2. Deserialize into `DynamicConfig`
3. Validate (check upstream reachability is optional)
4. Call `ConfigReloadHandle::reload(new_config)`
Future implementations could add a Unix domain socket API or HTTP endpoint for
config reload, but SIGHUP is sufficient for Phase 1.
## TOML Config Format
```toml
# reverse-proxy config
[server]
bind_addr = "15.235.125.95"
http_port = 80
https_port = 443
[server.tls]
mode = "acme" # "acme" or "manual"
acme_domain = "git.alk.dev"
acme_cache_dir = "/var/lib/reverse-proxy/acme-cache"
acme_directory = "production" # "production" or "staging"
# Manual mode (uncomment and comment out ACME settings)
# mode = "manual"
# cert_path = "/etc/letsencrypt/live/git.alk.dev/fullchain.pem"
# key_path = "/etc/letsencrypt/live/git.alk.dev/privkey.pem"
[server.logging]
level = "info"
format = "text" # "text" or "json"
[rate_limit]
requests_per_second = 10
burst = 20
[body]
limit_bytes = 104857600 # 100 MB
[[sites]]
host = "git.alk.dev"
upstream = "127.0.0.1:3000"
upstream_scheme = "http"
```
### Validation
On startup, the config is validated:
1. `bind_addr` is not `0.0.0.0` (must be explicit)
2. In ACME mode, `acme_domain` must be set
3. In manual mode, `cert_path` and `key_path` must both be set and the files
must be readable
4. Each site must have a `host` and `upstream`
5. `rate_limit.requests_per_second` must be > 0
6. `body.limit_bytes` must be > 0
On SIGHUP reload, the same validation applies. If the new config fails
validation, the reload is rejected and the old config remains active. An error
is logged.
**On startup**: If config validation fails, the process exits with a non-zero
code and logs the validation errors. The proxy will not start with an invalid
configuration.
## Design Decisions
All design decisions are documented as ADRs in [decisions/](decisions/).
| ADR | Decision | Summary |
|-----|----------|---------|
| [003](decisions/003-toml-config.md) | TOML configuration format | Rust-native, unambiguous, excellent serde support |
| [008](decisions/008-static-dynamic-config-split.md) | Static/dynamic config split | Immutable StaticConfig, hot-reloadable DynamicConfig via ArcSwap |
## Open Questions
Open questions are tracked in [open-questions.md](open-questions.md). Key
questions affecting this document:
- **OQ-04**: Should config reload support a Unix domain socket API in addition
to SIGHUP? (open)

View File

@@ -0,0 +1,61 @@
# ADR-001: Rust with Axum
## Status
Accepted
## Context
Our current nginx 1.24.0 installation is vulnerable to multiple actively-exploited
CVEs, most critically CVE-2026-42945 (CVSS 9.2, unauthenticated RCE via
`ngx_http_rewrite_module`). Six of seven recent nginx CVEs are memory corruption
bugs (buffer overflow, use-after-free, buffer overread) — the exact class of
vulnerabilities that Rust eliminates by construction.
The threat landscape is worsening: LLM-assisted fuzzing is accelerating bug
discovery in nginx's C codebase, and security researchers report additional
undisclosed vulnerabilities.
We need to replace nginx with a memory-safe alternative that can handle:
- TLS termination
- HTTP reverse proxying to backend services
- Rate limiting with fail2ban-compatible logging
- Operational simplicity (single binary, systemd integration)
## Decision
Use Rust with the axum web framework for the reverse proxy implementation.
**Rust** provides:
- Memory safety by construction (no buffer overflows, use-after-free, or
double-free at runtime)
- rustls (pure Rust TLS) avoids OpenSSL dependency and its CVE history
- Single static binary deployment with no runtime dependencies
- Excellent async I/O support via tokio
**axum** provides:
- Ergonomic handler definitions with extractors
- Tower middleware ecosystem (Service trait, layers)
- Type-safe routing and state management
- Well-maintained, widely used, good documentation
## Consequences
**Positive:**
- Eliminates the entire class of memory corruption vulnerabilities affecting
nginx
- Single binary deployment simplifies operations
- Rust's type system catches many errors at compile time
- axum + tower provides composable middleware
**Negative:**
- Smaller ecosystem than nginx for HTTP proxy features (but our use case is
simple)
- We maintain the code (vs. using a battle-tested C project)
- Less granular control over HTTP/2 and connection pooling compared to nginx
- Team needs Rust expertise (already available)
## References
- [threat-landscape.md](../../research/threat-landscape.md)
- [overview.md](../overview.md)

View File

@@ -0,0 +1,56 @@
# ADR-002: Custom Proxy Handler
## Status
Accepted
## Context
We need to implement HTTP reverse proxying — receiving requests and forwarding
them to an upstream service (Gitea on localhost:3000). Two approaches are
available:
1. **`axum-reverse-proxy` crate**: Provides path-based routing, header
forwarding, round-robin load balancing, TLS support, retry mechanisms, and
RFC 9110 compliance.
2. **Custom handler** (Felix Knorr pattern): Build a handler using hyper's
`Client` to forward requests. ~50-100 lines of Rust for our needs.
Our use case is minimal: single upstream per domain, single domain, no load
balancing, no retry, no HTTP/2 proxying.
## Decision
Implement a custom proxy handler using hyper's `Client` for request forwarding,
following the pattern demonstrated by Felix Knorr and used in the alknet
project's channel proxy.
## Rationale
- `axum-reverse-proxy` adds complexity we don't need (load balancing, retry,
path-based routing to multiple backends)
- Our proxy case is the simplest possible: match a Host header, forward the
entire request to a single upstream, stream the response back
- The Felix Knorr pattern is proven, idiomatic, and ~50-100 lines
- We maintain full control over header injection, error handling, and upstream
connection behavior
- If requirements grow, we can adopt `axum-reverse-proxy` later
## Consequences
**Positive:**
- Minimal dependencies
- Full control over proxy behavior
- Easy to understand and audit (~100 lines of proxy code)
- No unnecessary abstraction layers
**Negative:**
- We implement and maintain proxy logic ourselves (but it's trivial for our
use case)
- If requirements grow to load balancing or retry, we'd need to add that
ourselves or switch to `axum-reverse-proxy`
## References
- [proxy.md](../proxy.md)
- Felix Knorr, "Replacing nginx with axum" (felix-knorr.net/posts/2024-10-13-replacing-nginx-with-axum.html)

View File

@@ -0,0 +1,44 @@
# ADR-003: TOML Configuration Format
## Status
Accepted
## Context
The proxy needs a configuration file format for defining sites, TLS settings,
bind addresses, and rate limits. Options include TOML, YAML, JSON, and custom
binary formats.
## Decision
Use TOML as the configuration file format.
## Rationale
- **Rust-native**: TOML is the configuration format for Cargo (Rust's package
manager). The Rust ecosystem has first-class TOML support via `serde` +
`toml` crate.
- **Unambiguous**: TOML has a single canonical representation for any given
data structure, unlike YAML which has multiple equivalent representations and
surprising type coercion rules (e.g., `no` → boolean, `1.0` → float).
- **Human-friendly**: TOML is easy to read and write for simple configurations
like ours. It supports sections (tables), arrays, and inline tables.
- **Good error messages**: The `toml` crate provides clear deserialization
error messages pointing to the exact field that failed.
## Consequences
**Positive:**
- Familiar to Rust developers (Cargo.toml)
- Clear, unambiguous syntax
- Excellent serde integration with detailed error reporting
- No type coercion surprises
**Negative:**
- Not as widely used for config outside Rust (but our audience is ourselves)
- No `#include` or file composition (each config file is self-contained)
## References
- [config.md](../config.md)

View File

@@ -0,0 +1,67 @@
# ADR-004: ACME-Primary Certificate Management
## Status
Accepted
## Context
The proxy needs TLS certificates for HTTPS. Two approaches are available:
1. **certbot (external ACME client)**: Run certbot as a cron job or systemd
timer to obtain and renew certificates. The proxy loads certificates from
files on disk. Renewal requires either SIGHUP/restart or inotify file
watching to pick up new certs.
2. **rustls-acme (built-in ACME client)**: The proxy handles ACME
certificate provisioning and renewal internally as a background task. No
external certbot dependency. The `ResolvesServerCertAcme` cert resolver
automatically serves the correct certificate and updates when renewed.
The alknet project has successfully implemented the rustls-acme approach, and
its patterns are directly reusable.
## Decision
Use `rustls-acme` as the primary certificate management mode, with manual
certificate paths as a fallback mode for testing, self-signed certs, and
corporate CA environments.
## Rationale
- **Eliminates certbot dependency**: No external cron job, no deploy hooks, no
certbot package to install and maintain. The proxy is self-contained.
- **Automatic renewal**: `rustls-acme` runs as a background tokio task that
handles certificate provisioning and renewal automatically (~30 days before
expiry).
- **No restart needed**: When `rustls-acme` provisions a new certificate, the
`ResolvesServerCertAcme` resolver updates atomically. No SIGHUP, no restart,
no file watching.
- **Proven pattern**: alknet uses the same approach successfully.
- **Cache persistence**: `DirCache` persists ACME state between restarts,
avoiding re-provisioning.
- **Fallback mode**: Manual cert paths are still supported for environments
where ACME is not possible.
## Consequences
**Positive:**
- Single binary deployment (no certbot dependency)
- Zero-downtime certificate renewal
- Simpler operational model (no certbot cron, no deploy hooks)
- Proven in alknet
**Negative:**
- `rustls-acme` is an additional dependency
- ACME challenges require either port 80 (HTTP-01) or TLS-ALPN-01 on port 443,
which our proxy already listens on
- Less control over certificate issuance compared to certbot (e.g., no DNS-01
challenge support, though rustls-acme supports TLS-ALPN-01 which is sufficient
for our use case)
- Manual mode requires restart for cert changes (acceptable for fallback)
## References
- [tls.md](../tls.md)
- alknet ADR-008: ACME/Let's Encrypt decision
- `rustls-acme` crate: https://github.com/FlorianUekermann/rustls-acme

View File

@@ -0,0 +1,65 @@
# ADR-005: tokio-rustls Directly, Not axum-server
## Status
Accepted
## Context
We need to serve HTTPS (TLS) traffic through axum. Two approaches exist for
integrating TLS with axum:
1. **`axum-server`**: A wrapper that provides TLS support for axum via
`tls_rustls` feature. Handles TCP binding, TLS accept, and passing TLS
streams to axum. Simple API but limited control over the TLS configuration.
2. **`tokio-rustls` directly**: Bind TCP manually, perform TLS handshake with
`TlsAcceptor`, then serve the TLS stream to axum/hyper. More code but full
control over `ServerConfig`, cipher suites, ALPN protocols, and cert
resolvers.
The alknet project uses tokio-rustls directly and has proven this pattern for
both manual and ACME certificate management.
## Decision
Use `tokio-rustls` directly for TLS termination, with `hyper` serving the
resulting TLS streams to axum. Do not use `axum-server`.
## Rationale
- **ACME integration**: The `rustls-acme` `ResolvesServerCertAcme` resolver
needs to be set as the certificate resolver on `ServerConfig` via
`with_cert_resolver()`. `axum-server` does not expose this level of control
over the `ServerConfig`.
- **Cipher suite control**: We may need to configure cipher suites beyond the
defaults (see OQ-01). `axum-server` wraps the `ServerConfig` construction
and may not expose `CryptoProvider` configuration. Direct `tokio-rustls`
usage gives us full control.
- **ALPN configuration**: ACME TLS-ALPN-01 challenge requires adding
`acme-tls/1` to the ALPN protocol list. This is only possible with direct
`ServerConfig` access.
- **Proven pattern**: alknet uses exactly this approach (`TlsAcceptor` wrapping
`tokio-rustls`, with manual or ACME `ServerConfig` construction).
- **No abstraction cost**: The code to bind TCP, accept TLS, and serve to
axum/hyper is ~50 lines. `axum-server` saves little for our simple case.
## Consequences
**Positive:**
- Full control over TLS configuration
- Direct `rustls-acme` integration
- Ability to add ALPN protocols for ACME challenges
- Proven pattern from alknet
**Negative:**
- Slightly more code than `axum-server` (~50 lines for the TLS acceptor loop)
- Need to manage the TCP listener and TLS accept explicitly
- Must handle the `TlsStream<TcpStream>``hyper::service_fn` → axum
integration manually (well-documented pattern from Felix Knorr's blog and
alknet)
## References
- [tls.md](../tls.md)
- alknet transport layer (`alknet-core/src/transport/tls.rs`, `alknet-core/src/transport/acme.rs`)

View File

@@ -0,0 +1,77 @@
# ADR-006: Token Bucket Rate Limiting with In-Memory State
## Status
Accepted
## Context
The proxy must enforce request rate limits per client IP address, replacing
nginx's `limit_req_zone` directive. Rate limiting is critical for preventing
abuse and for fail2ban integration (rate-limited requests trigger fail2ban
actions).
Several rate limiting approaches exist:
- **Token bucket**: Tokens accumulate at a fixed rate; each request consumes a
token. Allows short bursts up to the bucket capacity.
- **Leaky bucket**: Requests are processed at a fixed rate; excess requests
queue or are rejected. No burst allowance.
- **Fixed window**: Count requests in fixed time windows (e.g., per minute).
Allows burst at window boundaries.
- **Sliding window**: Count requests in a rolling time window. More accurate
than fixed window but more complex.
The current nginx config uses `limit_req zone=gitea_limit burst=20 nodelay`,
which is a token bucket with burst allowance.
For state storage:
- **In-memory HashMap**: Fast, no external dependencies, lost on restart.
- **External store (Redis, etc.)**: Shared across instances, persists across
restarts. Adds operational complexity.
- **tower-governor crate**: Pre-built rate limiting middleware. Uses
generalized cell algorithm. Adds dependency.
## Decision
Use a token bucket algorithm with in-memory `HashMap<IpAddr, TokenBucket>`
state, protected by `tokio::sync::Mutex`. Rate limiting runs as axum middleware
before the proxy handler.
Rate limits are global per-IP (not per-site) in Phase 1. Per-site rate limits
may be added in Phase 2 as the config model evolves.
Stale entries in the HashMap are cleaned up periodically. A background task
scans the HashMap at a configurable interval (default: 60 seconds) and removes
entries that haven't been accessed within the cleanup interval.
## Rationale
- Token bucket matches nginx's `limit_req burst` semantics, ensuring
behavioral compatibility during migration.
- In-memory state is sufficient for a single-instance proxy (no shared state
needed).
- `tokio::sync::Mutex` (not `std::sync::Mutex`) avoids holding the lock across
await points and integrates with the async runtime.
- Custom implementation gives full control over logging output for fail2ban
integration (ADR-007).
- State loss on restart is acceptable — rate limit state is inherently
ephemeral.
## Consequences
**Positive:**
- Behavioral compatibility with nginx rate limiting
- Full control over fail2ban log format
- No external dependencies (Redis, etc.)
- Simple implementation (~100 lines)
**Negative:**
- Rate limit state is lost on restart (acceptable for single-instance deploy)
- Not suitable for multi-instance deployments without external state store
(Phase 1 is single-instance)
- HashMap grows over time without eviction (mitigated by periodic cleanup)
## References
- [operations.md](../operations.md)
- nginx `limit_req` documentation

View File

@@ -0,0 +1,67 @@
# ADR-007: Custom Structured Log Format for Fail2ban
## Status
Accepted
## Context
The proxy needs to produce log output that fail2ban can parse to detect and ban
abusive IP addresses. The current nginx setup uses nginx's default log format
with standard fail2ban filters.
Options for fail2ban integration:
- **nginx-compatible format**: Replicate nginx's log format so existing
fail2ban filters work unchanged. Couples us to nginx's format.
- **Custom structured format**: Design a clean, parseable format with a
corresponding custom fail2ban filter. Gives us control and clarity.
- **JSON format**: Machine-readable but harder for fail2ban regex matching.
## Decision
Use a custom structured log format with a corresponding custom fail2ban filter.
The format for rate-limited requests:
```
RATE_LIMIT client_ip=<IP> host=<host> path=<path> status=429
```
The format for general access logs:
```
REQUEST client_ip=<IP> host=<host> method=<METHOD> path=<path> status=<code> upstream=<addr> duration_ms=<ms>
```
A corresponding fail2ban filter (`/etc/fail2ban/filter.d/reverse-proxy.conf`)
uses regex matching on the `RATE_LIMIT` prefix and `client_ip=<HOST>` field.
## Rationale
- Custom format is clear, unambiguous, and self-documenting
- No coupling to nginx's format, which may change or include fields we don't
produce
- `key=value` pairs are easy to parse with regex and easy to extend
- The `RATE_LIMIT` prefix makes it trivial to distinguish rate-limit events
from other logs
- Writing a custom fail2ban filter is straightforward (5 lines of config)
- We control both sides (the proxy and the filter), so compatibility is
guaranteed
## Consequences
**Positive:**
- Clean, purpose-built format
- Easy to extend with new fields
- No dependency on nginx log format
- Custom fail2ban filter is simple to maintain
**Negative:**
- Cannot reuse existing nginx fail2ban filters (trivial to write our own)
- Existing fail2ban configurations need updating (acceptable since we're
replacing nginx entirely)
## References
- [operations.md](../operations.md)
- [open-questions.md](../open-questions.md) OQ-02 (now resolved)

View File

@@ -0,0 +1,76 @@
# ADR-008: Static/Dynamic Configuration Split with ArcSwap
## Status
Accepted
## Context
The proxy needs configuration that can be partially reloaded at runtime (site
definitions, rate limits) without restarting the process and dropping active
connections. However, some configuration (bind addresses, TLS mode) fundamentally
requires creating new listeners and cannot be changed at runtime.
Two approaches:
- **Full restart for all config changes**: Simple, but requires dropping
active connections for every change, including trivial rate limit adjustments.
- **Static/dynamic split**: Immutable parameters (bind address, TLS mode) in a
`StaticConfig` that requires restart; runtime-adjustable parameters (sites,
rate limits) in a `DynamicConfig` that can be atomically swapped via
`Arc<ArcSwap<DynamicConfig>>` without dropping connections.
This pattern is proven in the alknet project, which uses the same
`ArcSwap<DynamicConfig>` approach for auth policy, forwarding rules, and rate
limits.
## Decision
Split configuration into `StaticConfig` (immutable after startup) and
`DynamicConfig` (hot-reloadable via `ArcSwap`). The split is:
**StaticConfig** (restart required):
- Bind address, HTTP port, HTTPS port
- TLS mode (ACME vs. manual), cert paths, ACME settings
- Log level and format
**DynamicConfig** (hot-reloadable via SIGHUP):
- Site definitions (hostname → upstream mappings)
- Rate limits (requests per second, burst)
- Body size limits
`ConfigReloadHandle` provides a `reload(DynamicConfig)` method that atomically
swaps the entire config. All request handlers read `DynamicConfig` via
`ArcSwap::load()` — a lock-free operation.
## Rationale
- Rate limits and site definitions change more frequently than bind addresses
and TLS settings. Hot-reload avoids unnecessary downtime.
- `ArcSwap` provides lock-free reads and atomic writes — no partial updates,
no lock contention on the hot path.
- Proven pattern from alknet, where it's used for auth policy, forwarding
rules, and rate limits.
- SIGHUP trigger is simple, well-understood, and compatible with systemd and
process supervisors.
- The entire config is swapped at once, preventing inconsistent states where
some sites use the old config and others use the new one.
## Consequences
**Positive:**
- Zero-downtime config reload for sites and rate limits
- Lock-free reads on the request hot path
- Atomic config updates — no partial states
- Proven pattern from alknet
**Negative:**
- Two config types add conceptual complexity
- SIGHUP reload requires reading the config file from disk (need to handle
file read errors gracefully)
- Must validate DynamicConfig before swapping (invalid config must not replace
valid config)
## References
- [config.md](../config.md)
- alknet ADR-030 (static/dynamic config split)

View File

@@ -0,0 +1,62 @@
# ADR-009: Signal Handling Strategy
## Status
Accepted
## Context
The proxy needs to handle Unix signals for:
- **Graceful shutdown**: SIGTERM and SIGINT should stop accepting new
connections, drain in-flight requests, then exit.
- **Config reload**: SIGHUP should trigger a DynamicConfig reload from disk.
Two approaches for signal handling:
- **`tokio::signal`**: Built into tokio. Handles SIGTERM and SIGINT via
`ctrl_c()`. Does not directly handle SIGHUP.
- **`signal-hook`**: External crate. Handles all Unix signals including SIGHUP.
More flexible but adds a dependency.
## Decision
Use `signal-hook` for all signal handling. Specifically:
- `signal-hook::flag` to set termination flags on SIGTERM/SIGINT
- `signal-hook` to register a SIGHUP handler that triggers config reload
`tokio::signal::ctrl_c()` is registered as a secondary shutdown trigger; both
mechanisms converge on the same shutdown path. This is a belt-and-suspenders
approach: `signal-hook` handles all signals including SIGHUP, while
`ctrl_c()` provides a fallback for environments where signal handling may not
be fully wired (e.g., container runtimes).
The shutdown sequence:
1. On SIGTERM or SIGINT: stop accepting new connections, wait up to 30 seconds
for in-flight requests to complete, then exit with code 0.
2. On SIGHUP: re-read config file, validate, and swap DynamicConfig if valid.
Log the result.
## Rationale
- SIGHUP handling is required for config reload — `tokio::signal` doesn't
support SIGHUP.
- `signal-hook` is well-maintained, widely used, and handles all Unix signals.
- Using one signal handling mechanism (rather than mixing `tokio::signal` and
`signal-hook`) is simpler and avoids edge cases.
- `signal-hook::flag` is a minimal, safe API for signal-triggered flags.
## Consequences
**Positive:**
- SIGHUP for config reload is simple and well-understood
- Single signal handling mechanism for all signals
- Compatible with systemd (SIGTERM for shutdown) and standard Unix conventions
**Negative:**
- `signal-hook` is an additional dependency (but a well-established one)
- Signal handling requires careful coordination with the tokio runtime (async
signal receivers must be properly integrated)
## References
- [operations.md](../operations.md)
- [config.md](../config.md)

View File

@@ -0,0 +1,86 @@
---
status: draft
last_updated: 2026-06-11
---
# Open Questions
## TLS
### OQ-01: Should cipher suites be restricted beyond rustls defaults?
- **Origin**: [tls.md](tls.md)
- **Status**: open
- **Priority**: medium
- **Context**: Our current nginx config explicitly restricts cipher suites to
four ECDHE-AES-GCM suites. rustls 0.23 with `aws_lc_rs` defaults to a
conservative set that excludes all weak ciphers (no SHA-1, no 3DES, no RC4,
no CBC-mode suites, no RSA key exchange). The defaults include TLS 1.3 suites
which nginx also allows. Restricting further would reduce compatibility with
older clients; not restricting means accepting a wider (but still safe) set
than the current nginx config.
- **Cross-references**: ADR-005
## Logging and Monitoring
### ~~OQ-02: What log format should fail2ban consume?~~
- **Origin**: [operations.md](operations.md), [proxy.md](proxy.md)
- **Status**: resolved
- **Priority**: high
- **Resolution**: Custom structured log format with `key=value` pairs and
`RATE_LIMIT` prefix. A corresponding custom fail2ban filter will be provided.
See ADR-007.
- **Cross-references**: ADR-007
### OQ-03: Should the health check endpoint be on a separate port?
- **Origin**: [operations.md](operations.md)
- **Status**: open
- **Priority**: low
- **Context**: Currently the health check is on the main HTTPS listener at
`/health`. Alternatives: (a) separate unencrypted port for health checks
(simpler for load balancers but less secure), (b) admin port with its own
listener (more complex but isolates operational traffic), (c) on the main
listener (simplest, proposed approach). For a single-server deployment behind
no external load balancer, the main listener is fine.
- **Cross-references**: None
## Configuration
### OQ-04: Should config reload support a Unix domain socket API in addition to SIGHUP?
- **Origin**: [config.md](config.md)
- **Status**: open
- **Priority**: low
- **Context**: Phase 1 uses SIGHUP for config reload, which is simple and proven.
A Unix domain socket API would allow programmatic reload (e.g., from an admin
tool or CI/CD pipeline) and could return success/failure status. This adds
complexity and is not needed for Phase 1.
- **Cross-references**: None
## Deployment
### OQ-05: Should the proxy bind to multiple addresses or just one?
- **Origin**: [overview.md](overview.md)
- **Status**: open
- **Priority**: low
- **Context**: Current nginx config binds to a specific IP (`15.235.125.95`).
The proposed config uses `bind_addr` which could be any IP. For Phase 1, the
config will specify a single IP address. Multi-address binding (listening on
multiple IPs) is not needed but could be added as an array of addresses.
- **Cross-references**: None
## Proxy
### OQ-06: Should upstream timeouts be configurable per-site?
- **Origin**: [proxy.md](proxy.md)
- **Status**: open
- **Priority**: low
- **Context**: Phase 1 uses global defaults (5s connect timeout, 60s request
timeout) for all upstream connections. Per-site timeout configuration would
allow tuning for different upstream services (e.g., a slow database-backed
API vs. a fast static site). Not needed for Phase 1 with a single upstream.
- **Cross-references**: None

View File

@@ -0,0 +1,250 @@
---
status: draft
last_updated: 2026-06-11
---
# Operations
## What It Is
The operations component covers everything related to running the proxy in
production: rate limiting, logging (fail2ban integration), health checks,
systemd integration, and graceful shutdown.
## Why It Exists
A reverse proxy that can't be monitored, rate-limited, or gracefully restarted
is not production-ready. These concerns are cross-cutting — they affect the
proxy handler, the TLS layer, and the config system.
## Rate Limiting
### Requirements
- Limit requests per IP address (replacing nginx's `limit_req_zone`)
- Default: 10 requests/second with burst of 20 (matching current nginx config)
- Configurable via DynamicConfig (no restart needed)
- Must produce logs that fail2ban can consume
### Design
The rate limiter runs as axum middleware before the proxy handler. It uses a
token bucket algorithm per client IP, matching nginx's `limit_req burst`
semantics.
Rate limits are global per-IP in Phase 1 (not per-site). A request from IP
address X counts against the same bucket regardless of which site it targets.
Per-site rate limits may be added in Phase 2.
When a request exceeds the rate limit, the middleware returns `429 Too Many
Requests` and logs the event with structured fields.
### State Eviction
The per-IP token bucket state grows over time as new IPs are seen. A
background task runs at a configurable interval (default: 60 seconds) and
removes entries that haven't been accessed within the cleanup interval. This
prevents unbounded memory growth.
### Fail2ban Integration
Rate limit events are logged in a structured format that a custom fail2ban
filter can parse. See [ADR-007](decisions/007-custom-log-format.md) for the
format decision.
The log format uses `key=value` pairs with a `RATE_LIMIT` prefix:
```
RATE_LIMIT client_ip=X.X.X.X host=Y.Z path=/W status=429
```
A corresponding fail2ban filter and jail configuration are provided as part
of the deployment documentation.
## Logging
### Structure
All logs use `tracing` with structured fields. The proxy outputs two types of
log entries:
1. **Access logs**: Every proxied request is logged at `info` level with
structured fields.
```
REQUEST client_ip=1.2.3.4 host=git.alk.dev method=GET path=/user/repo status=200 upstream=127.0.0.1:3000 duration_ms=45
```
2. **Event logs**: Rate limits, TLS errors, upstream failures, config reloads,
etc.
```
RATE_LIMIT client_ip=1.2.3.4 host=git.alk.dev path=/login status=429
UPSTREAM_ERROR host=git.alk.dev upstream=127.0.0.1:3000 error="connection refused"
CONFIG_RELOAD status=success sites=1
```
### Output
Logs are written to:
- **stdout/stderr**: For systemd/journald integration
- **File** (optional): For fail2ban consumption at
`/var/log/reverse-proxy/access.log`
The `tracing-subscriber` layer configuration supports both simultaneously via
`Layer` composition.
### Log Levels
| Level | Use |
|-------|-----|
| `error` | Unrecoverable failures (TLS handshake failure, config validation) |
| `warn` | Rate limit exceeded, upstream unreachable, upstream timeout |
| `info` | Access logs, config reloads, ACME events, startup/shutdown |
| `debug` | Request/response headers, connection details |
| `trace` | Detailed protocol-level information |
Configurable via `log_level` in StaticConfig.
## Health Check
### Endpoint
```
GET /health → 200 OK (empty body)
```
The health check endpoint is accessible on the main HTTPS listener. It returns
200 if the process is alive and serving requests.
**Limitation**: Since `/health` is served over TLS, it cannot detect TLS
configuration errors that prevent the TLS handshake from completing. External
monitoring should also check TCP connectivity to port 443 independently.
### What It Checks
- Process is running and the tokio runtime is responsive
- TLS listener is accepting connections
- Config is loaded (StaticConfig and DynamicConfig are initialized)
It does **not** check upstream reachability. The health check answers "is the
proxy process healthy?", not "is the upstream reachable?" — upstream health is
a separate concern that would produce 502/504 responses in the proxy handler.
### Future Extensions
- `/health/ready` — readiness check that includes upstream reachability
- Prometheus metrics at `/metrics`
## Systemd Integration
### Unit File
```ini
[Unit]
Description=Reverse Proxy
After=network.target
Wants=network-online.target
[Service]
Type=notify
NotifyAccess=all
ExecStart=/usr/local/bin/reverse-proxy --config /etc/reverse-proxy/config.toml
Restart=on-failure
RestartSec=5
# Security hardening
NoNewPrivileges=yes
ProtectSystem=strict
ProtectHome=yes
PrivateTmp=yes
ReadWritePaths=/var/lib/reverse-proxy /var/log/reverse-proxy
# ACME challenge cache directory
StateDirectory=reverse-proxy
[Install]
WantedBy=multi-user.target
```
The proxy signals readiness to systemd via `sd_notify` after binding listeners
and completing the initial configuration load.
## Graceful Shutdown
### Signal Handling
The proxy handles three signals via `signal-hook` (see [ADR-009](decisions/009-signal-handling.md)):
- **SIGTERM / SIGINT**: Graceful shutdown. Stop accepting new connections, wait
for in-flight requests to complete (up to a configurable timeout), then exit.
- **SIGHUP**: Config reload. Re-read the config file, validate, and swap
DynamicConfig if valid.
### SIGHUP for Config Reload
SIGHUP triggers config reload (see [config.md](config.md) for details). The
process does not exit on SIGHUP.
### Timeout
In-flight requests have a configurable shutdown timeout (default: 30 seconds).
After the timeout, remaining connections are forcefully closed and the process
exits.
## Deployment
### Binary
Single static binary, no runtime dependencies:
```bash
cargo build --release
# Produces: target/release/reverse-proxy
```
The binary is self-contained — no system libraries beyond libc for DNS
resolution. The `aws_lc_rs` crypto provider is statically linked.
### Configuration
```bash
# Config file
/etc/reverse-proxy/config.toml
# ACME cache directory
/var/lib/reverse-proxy/acme-cache/
# Log directory (optional, for fail2ban)
/var/log/reverse-proxy/
```
### CLI
```bash
reverse-proxy [OPTIONS]
Options:
--config <PATH> Path to config file [default: /etc/reverse-proxy/config.toml]
--validate Validate config and exit
--help Show help
--version Show version
```
## Design Decisions
All design decisions are documented as ADRs in [decisions/](decisions/).
| ADR | Decision | Summary |
|-----|----------|---------|
| [001](decisions/001-rust-axum.md) | Rust with axum | Memory safety; single binary deployment |
| [006](decisions/006-rate-limiting-approach.md) | Token bucket rate limiting | In-memory per-IP token bucket matching nginx burst semantics |
| [007](decisions/007-custom-log-format.md) | Custom structured log format | key=value pairs with RATE_LIMIT prefix for fail2ban |
| [009](decisions/009-signal-handling.md) | Signal handling strategy | signal-hook for SIGTERM/SIGINT/SIGHUP |
## Open Questions
Open questions are tracked in [open-questions.md](open-questions.md). Key
questions affecting this document:
- **OQ-03**: Should the health check endpoint be on a separate port? (open)

View File

@@ -0,0 +1,166 @@
---
status: draft
last_updated: 2026-06-11
---
# Overview
## Vision
A memory-safe, minimal reverse proxy that replaces our vulnerable nginx instance
for forward-proxying to backend services. The proxy terminates TLS, injects
standard proxy headers, enforces rate limits, and forwards requests to upstream
services — with operational feature parity for our current single-domain Gitea
setup.
## Why This Exists
Our nginx 1.24.0 installation is vulnerable to multiple actively-exploited
CVEs, including CVE-2026-42945 (unauthenticated RCE via `rewrite`/`set`
directives). The broader threat landscape is worsening: LLM-assisted fuzzing
is accelerating bug discovery in nginx's C codebase, and security researchers
report additional undisclosed vulnerabilities. Upgrading nginx patches known
CVEs but does not address the structural problem — memory corruption bugs are
endemic to C, and the discovery rate is accelerating.
Rust's memory safety eliminates the entire class of buffer overflow,
use-after-free, and double-free bugs that constitute 6 of 7 recent nginx CVEs.
Combined with rustls (pure Rust TLS, no OpenSSL dependency), this provides a
fundamentally safer baseline.
See [threat-landscape.md](../research/threat-landscape.md) for full vulnerability
details.
## Scope
### In Scope
- **Phase 1**: Replace nginx for `git.alk.dev` with feature parity
- TLS termination with ACME (Let's Encrypt) certificate management
- Manual certificate paths as fallback mode
- HTTP → HTTPS redirect
- Reverse proxy to Gitea at `127.0.0.1:3000`
- Proxy header injection (Host, X-Real-IP, X-Forwarded-For, X-Forwarded-Proto)
- Request rate limiting with fail2ban-compatible logging (global per-IP; per-site in Phase 2)
- 100 MB body size limit (global; per-site in Phase 2)
- Configurable bind address (no `0.0.0.0` default)
- Health check endpoint
- Graceful shutdown (SIGTERM handling)
- Systemd unit file
- **Phase 2**: Multi-site support
- SNI-based TLS routing for multiple domains
- Config file for site definitions
- Dynamic config reload (ArcSwap pattern)
- **Phase 3**: Operational hardening
- Metrics endpoint (Prometheus-compatible)
- Connection limits and timeouts
- Log rotation
### Out of Scope
- HTTP/2 or HTTP/3 proxying (services that need these run their own native
Rust servers — e.g., `api.alk.dev`)
- Load balancing or round-robin upstream selection
- WebSocket proxying (can be added later if needed)
- Static file serving
- Access control beyond rate limiting (no auth, no IP allowlists in Phase 1)
- CGI, SCGI, uWSGI, FastCGI
## Architecture
```
┌────────────────────────────────────┐
│ reverse-proxy (Rust/axum) │
config.toml ──────► │ StaticConfig + DynamicConfig │
│ (ArcSwap for hot-reload) │
│ │
bind_addr:80 ──► │ HTTP listener → 301 redirect │
│ to HTTPS │
│ │
bind_addr:443 ──► │ TLS listener (tokio-rustls) │
│ ├─ ACME mode: rustls-acme resolver │
│ │ (auto cert provisioning/renewal) │
│ └─ Manual mode: cert/key file paths │
│ │
│ axum router │
│ ├─ Host-based routing │
│ ├─ Rate limiting middleware │
│ ├─ Proxy header injection │
│ ├─ Body size limit (100MB) │
│ └─ Reverse proxy handler │
│ └─ hyper Client → upstream │
│ │
│ /health → 200 OK │
└────────────────────────────────────┘
```
## Crate Dependencies
### Core
| Crate | Version | Purpose | Notes |
|-------|---------|---------|-------|
| `axum` | 0.8 | HTTP framework | Routing, middleware, extractors |
| `tokio` | 1 (full) | Async runtime | Multi-threaded runtime |
| `hyper` | 1 | HTTP protocol | Used via axum, and directly for proxy `Client` |
| `tower` | 0.5 | Middleware ecosystem | Service trait, layers |
| `rustls` | 0.23 | TLS implementation | `aws_lc_rs` crypto provider |
| `tokio-rustls` | 0.26 | Async TLS I/O | Wraps TCP with TLS |
| `rustls-acme` | 0.12 | ACME client | Let's Encrypt auto-provisioning and renewal |
### Supporting
| Crate | Version | Purpose | Notes |
|-------|---------|---------|-------|
| `serde` | 1 | Serialization | TOML config deserialization |
| `toml` | 0.8 | Config format | Declarative site definitions |
| `arc-swap` | 1 | Atomic config swap | Lock-free DynamicConfig reload |
| `tracing` | 0.1 | Structured logging | fail2ban-compatible output |
| `tracing-subscriber` | 0.3 | Log output | File + journald support |
| `rustls-pemfile` | 2 | PEM parsing | Manual cert loading |
| `rustls-pki-types` | 1 | TLS types | CertificateDer, PrivateKeyDer |
| `clap` | 4 | CLI arguments | Server startup options |
| `signal-hook` | 0.3 | Signal handling | SIGTERM/SIGINT for shutdown, SIGHUP for config reload |
Versions listed are minimum major versions. Implementation should pin exact
versions in `Cargo.toml` per standard Rust practice.
## Exports
This is a single-binary deployment. There are no library exports. The product
is the `reverse-proxy` binary plus a systemd unit file and a config file.
## Dependencies on Other Projects
- **alknet**: The `ArcSwap<DynamicConfig>` pattern, `tokio-rustls` TLS acceptor
construction, `rustls-acme` integration, and `ServerConfig` builder patterns
are adapted from alknet's transport and config layers. These patterns are
referenced as validation that the approaches work in production; all code
in this project is written from scratch.
## Design Decisions
All design decisions are documented as ADRs in [decisions/](decisions/).
| ADR | Decision | Summary |
|-----|----------|---------|
| [001](decisions/001-rust-axum.md) | Rust with axum | Memory safety eliminates the bug class causing nginx CVEs; axum provides ergonomic tower integration |
| [002](decisions/002-custom-proxy-handler.md) | Custom proxy handler | Single upstream, single domain — axum-reverse-proxy adds unnecessary complexity |
| [003](decisions/003-toml-config.md) | TOML configuration format | Rust-native, unambiguous, excellent serde support |
| [004](decisions/004-rustls-acme.md) | ACME-primary certificate management | Eliminates certbot dependency; automatic provisioning and renewal |
| [005](decisions/005-tokio-rustls-direct.md) | tokio-rustls directly, not axum-server | Full control over TLS config, ACME resolver integration, cipher suite configuration |
| [006](decisions/006-rate-limiting-approach.md) | Token bucket rate limiting | In-memory per-IP token bucket matching nginx burst semantics |
| [007](decisions/007-custom-log-format.md) | Custom structured log format | key=value pairs with RATE_LIMIT prefix for fail2ban |
| [008](decisions/008-static-dynamic-config-split.md) | Static/dynamic config with ArcSwap | Immutable StaticConfig, hot-reloadable DynamicConfig via ArcSwap |
| [009](decisions/009-signal-handling.md) | Signal handling strategy | signal-hook for SIGTERM/SIGINT/SIGHUP |
## Open Questions
Open questions are tracked in [open-questions.md](open-questions.md). Key
questions affecting this document:
- **OQ-01**: Should cipher suites be restricted beyond rustls defaults? (open)
- **OQ-03**: Should the health check endpoint be on a separate port? (open)
- **OQ-05**: Should the proxy bind to multiple addresses or just one? (open)

169
docs/architecture/proxy.md Normal file
View File

@@ -0,0 +1,169 @@
---
status: draft
last_updated: 2026-06-11
---
# Proxy Handler
## What It Is
The proxy handler is the core component that receives an incoming HTTP request
on the TLS-terminated connection, applies middleware (rate limiting, header
injection, body size limits), and forwards it to the upstream service.
## Why It Exists
This component replaces nginx's `proxy_pass` directive. For our use case —
single upstream per domain, no load balancing, no HTTP/2 proxying — a custom
handler is simpler and more maintainable than a general-purpose proxy library.
## Architecture
```
Incoming HTTPS request
┌─────────────────┐
│ axum Router │
│ (Host-based) │─── /health → 200 OK
│ │
│ match Host │
│ header on │
│ incoming req │
└───────┬─────────┘
┌─────────────────┐
│ Rate Limiting │ ← tower middleware layer
│ Middleware │
└───────┬─────────┘
┌─────────────────┐
│ Proxy Header │ ← custom middleware / handler
│ Injection │
│ │
│ X-Real-IP │ ← connect_info remote_addr
│ X-Forwarded-For │ ← append to existing or set
│ X-Forwarded-Proto │ ← "https" (or "http" on port 80)
│ Host │ ← original host header (already set)
└───────┬─────────┘
┌─────────────────┐
│ Body Size Limit │ ← DefaultBodyLimit(100 MB)
│ Middleware │
└───────┬─────────┘
┌─────────────────┐
│ Reverse Proxy │ ← hyper Client request forwarding
│ Handler │
│ │
│ 1. Build upstream│
│ URI from │
│ original req │
│ 2. Forward req │
│ to upstream │
│ 3. Stream │
│ response back │
└─────────────────┘
```
## Request Flow
### 1. Host-Based Routing
The axum router uses a `Host` extractor to match incoming requests to site
definitions from `DynamicConfig`. Each site definition maps a hostname to an
upstream address.
Where `host_based_proxy` reads the `Host` header, looks up the site in
`DynamicConfig.sites`, and either proxies to the upstream or returns 404.
### 2. Proxy Header Injection
Headers are injected before forwarding. The handler reads connection metadata
from axum's `ConnectInfo` and the original request:
| Header | Value Source | Notes |
|--------|-------------|-------|
| `Host` | Original request `Host` header | Already present; preserved as-is |
| `X-Real-IP` | `ConnectInfo<SocketAddr>` remote IP | Set to client's IP address |
| `X-Forwarded-For` | Client IP, appended if header exists | Comma-separated list of proxies |
| `X-Forwarded-Proto` | Determined by listener | `https` on port 443, `http` on port 80 |
The `X-Forwarded-For` handling must append the client IP to any existing value
(rather than replacing it), to support chained proxies correctly.
### 3. Request Forwarding
The proxy handler constructs a new request to the upstream:
1. Build the upstream URI using the site's `upstream_scheme` and `upstream`
address, preserving the original path and query string
2. Copy the request method, headers, and body from the original
3. Inject proxy headers (X-Real-IP, X-Forwarded-For, X-Forwarded-Proto)
4. Send the request via a shared hyper Client instance
5. Stream the response back to the client
The hyper Client is created once at startup and shared via axum's `State`. It
must be configured with:
- Connection pooling (hyper default behavior)
- Connect timeout: 5 seconds
- Request timeout: 60 seconds
- No redirect following (proxies should not follow redirects)
### 4. Error Handling
| Upstream Condition | Response | Notes |
|-------------------|----------|-------|
| Upstream reachable | Stream response as-is | Headers, status, body all forwarded |
| Upstream unreachable | 502 Bad Gateway | Logged at `warn` level |
| Upstream timeout | 504 Gateway Timeout | Logged at `warn` level |
| Request body too large | 413 Payload Too Large | From `DefaultBodyLimit` middleware |
| Rate limit exceeded | 429 Too Many Requests | Logged at `info` level |
| Unknown Host header | 404 Not Found | No matching site definition |
### 5. HTTP → HTTPS Redirect
A separate HTTP listener on port 80 handles redirect. It reads the `Host`
header from the incoming request and returns a 301 Permanent Redirect to the
HTTPS equivalent URL (preserving the path and query string).
This listener runs on the same bind address as the TLS listener but on port 80.
## Upstream Connection
The upstream connection scheme defaults to `http://` since the proxy and backend
services typically run on the same host (e.g., `127.0.0.1:3000`). The
`upstream_scheme` field in each site's configuration allows specifying `https://`
for upstreams that require TLS (e.g., separate hosts or secure internal services).
For the initial deployment (`git.alk.dev``127.0.0.1:3000`), the upstream
connection uses plain HTTP, as TLS between the proxy and Gitea on loopback is
unnecessary.
## Body Size Limit
axum's `DefaultBodyLimit` layer sets the maximum request body size. For
compatibility with Gitea's push operations (large pack files), this defaults
to 100 MB. In Phase 1, the body limit is a global setting; Phase 2 may add
per-site body limits.
## Design Decisions
All design decisions are documented as ADRs in [decisions/](decisions/).
| ADR | Decision | Summary |
|-----|----------|---------|
| [002](decisions/002-custom-proxy-handler.md) | Custom proxy handler | Single upstream, single domain — simpler than a general proxy library |
| [007](decisions/007-custom-log-format.md) | Custom structured log format | key=value pairs with RATE_LIMIT prefix for fail2ban |
## Open Questions
Open questions are tracked in [open-questions.md](open-questions.md). Key
questions affecting this document:
- **OQ-06**: Should upstream timeouts be configurable per-site? (open — Phase 1
uses global defaults of 5s connect, 60s request)

220
docs/architecture/tls.md Normal file
View File

@@ -0,0 +1,220 @@
---
status: draft
last_updated: 2026-06-11
---
# TLS Termination
## What It Is
The TLS termination component handles all aspects of encrypted connections:
certificate provisioning (ACME and manual), TLS handshake, SNI-based certificate
selection, and connection wrapping for the axum router.
## Why It Exists
TLS termination is the security boundary between the public internet and our
upstream services. It replaces nginx's `ssl_certificate`, `ssl_protocols`, and
`ssl_ciphers` configuration with a memory-safe Rust implementation using rustls.
## Architecture
```
┌──────────────────────────────────────────┐
│ TLS Termination │
│ │
bind_addr:443 ──► │ TcpListener::bind(bind_addr) │
│ │ │
│ ▼ │
│ tokio-rustls::TlsAcceptor │
│ │ │
│ ├─ ACME mode: │
│ │ rustls-acme::ResolvesServerCertAcme │
│ │ (auto-provisions & renews certs) │
│ │ │
│ └─ Manual mode: │
│ rustls::ServerConfig │
│ .with_single_cert(cert_chain, key) │
│ │
│ │ │
│ ▼ │
│ TlsStream<TcpStream> │
│ │ │
│ ▼ │
│ hyper::service_fn → axum router │
└──────────────────────────────────────────┘
bind_addr:80 ──► HTTP listener (redirect to HTTPS, no TLS)
```
## Certificate Provisioning
### ACME Mode (Primary)
Uses `rustls-acme` for automatic certificate provisioning and renewal through
Let's Encrypt. This is the primary mode — no certbot dependency, no cron jobs,
no deploy hooks.
**How it works:**
1. `AcmeCertProvider` configures the ACME client with the domain, cache
directory, and Let's Encrypt directory (staging or production).
2. `AcmeConfig::new(vec![domain])` creates an ACME configuration for the
domain.
3. The ACME state machine runs as a background tokio task, handling:
- Account registration with Let's Encrypt
- Certificate ordering
- TLS-ALPN-01 challenge (or HTTP-01 challenge)
- Certificate issuance
- Certificate renewal (automatic, ~30 days before expiry)
4. `ResolvesServerCertAcme` is a rustls `ResolvesServerCert` implementation
that automatically serves the ACME-provisioned certificate.
5. When a new certificate is issued, the resolver updates atomically — no
restart or signal handling needed.
**Configuration:**
```toml
[tls]
mode = "acme"
acme_domain = "git.alk.dev"
acme_cache_dir = "/var/lib/reverse-proxy/acme-cache"
acme_directory = "production" # or "staging" for testing
```
**Cache directory:** The `DirCache` from rustls-acme persists ACME account data,
private keys, and certificates between restarts. This avoids re-provisioning on
every restart.
### Manual Mode (Fallback)
For environments where ACME is not desired (testing, self-signed certs,
corporate CAs, or BYO certificates), the proxy loads certificates from file
paths at startup.
```toml
[tls]
mode = "manual"
cert_path = "/etc/letsencrypt/live/git.alk.dev/fullchain.pem"
key_path = "/etc/letsencrypt/live/git.alk.dev/privkey.pem"
```
Certificate files are loaded once at startup using `rustls_pemfile`. Manual
mode requires a restart to pick up new certificates.
**Why not hot-reload manual certs?** ACME mode handles renewal automatically.
Manual mode is for cases where you control cert rotation externally (certbot,
manual renewal). In that case, a SIGHUP-triggered restart is simpler and more
reliable than file watching. If zero-downtime cert rotation is needed, use ACME
mode.
## TLS Configuration
### Protocol Versions
The proxy supports TLS 1.2 and TLS 1.3 only, matching the minimum security
level of the current nginx configuration. The `aws_lc_rs` crypto provider
defaults to these protocol versions; explicit configuration ensures no
regression if defaults change in future rustls releases.
### Cipher Suites
rustls 0.23 with the `aws_lc_rs` crypto provider defaults to a conservative
cipher suite selection that excludes all weak ciphers (no SHA-1, no 3DES, no
RC4, no CBC-mode suites, no RSA key exchange).
The current nginx config explicitly restricts to:
```
ECDHE-ECDSA-AES128-GCM-SHA256
ECDHE-RSA-AES128-GCM-SHA256
ECDHE-ECDSA-AES256-GCM-SHA384
ECDHE-RSA-AES256-GCM-SHA384
```
rustls's defaults include these plus TLS 1.3 suites (which nginx's config
also allows via `TLSv1.3`). The default rustls cipher list is a strict subset
of what browsers accept.
See [open-questions.md](open-questions.md) OQ-01 for whether to further
restrict cipher suites beyond rustls defaults.
### ServerConfig Construction
For manual mode, the `ServerConfig` is built with `with_no_client_auth()` and
`with_single_cert()`, loading the certificate chain and private key from disk.
For ACME mode, the `ServerConfig` is built with `with_cert_resolver()`, passing
the `ResolvesServerCertAcme` resolver. The ACME TLS-ALPN-01 protocol identifier
(`acme-tls/1`) must be registered in the `alpn_protocols` list so the server
can respond to TLS-ALPN-01 challenges.
Both modes use the `aws_lc_rs` crypto provider with safe default protocol
versions (TLS 1.2 and TLS 1.3).
## SNI-Based Certificate Selection
### Current (Single Domain)
For single-domain setups, SNI selection is trivial: there's only one
certificate, so `with_single_cert()` or `ResolvesServerCertAcme` (which
handles the domain) is sufficient.
### Future (Multi-Domain)
When multiple domains are served, SNI selection works as follows:
1. **TLS handshake**: The client sends the SNI extension indicating which
hostname it's connecting to.
2. **Certificate resolution**: In ACME mode, `ResolvesServerCertAcme` handles
this automatically — it stores certificates keyed by domain. In manual mode,
a custom `ResolvesServerCert` implementation maps SNI hostname to the
correct `CertifiedKey`.
3. **HTTP routing**: After the TLS handshake, axum's `Host` extractor routes
the request to the correct site handler based on the `Host` header.
This is the same pattern nginx uses — SNI selects the cert during TLS, then
`Host` header selects the server block. In manual mode, a `ResolvesServerCert`
implementation maps SNI hostname to the correct `CertifiedKey`.
## HTTP Listener (Port 80)
The HTTP listener on port 80 is a plain TCP listener with no TLS. It has one
job: redirect all requests to the HTTPS equivalent.
The listener binds to the same IP address as the TLS listener, but on port 80.
### ACME Challenge Type
The default ACME challenge type is **TLS-ALPN-01**, since the proxy already
listens on port 443. This avoids requiring a separate HTTP-01 challenge server.
HTTP-01 is available as a fallback for environments where TLS-ALPN-01 is not
suitable (e.g., behind a CDN that terminates TLS). When using HTTP-01, the
port 80 listener serves `/.well-known/acme-challenge/{token}` paths for
challenge verification.
## Key Files and Crates
| Component | Crate | Purpose |
|-----------|-------|---------|
| TLS acceptor | `tokio-rustls` 0.26 | Async TLS handshake over TCP streams |
| TLS config | `rustls` 0.23 | ServerConfig, CryptoProvider, cipher suites |
| ACME client | `rustls-acme` 0.12 | Automatic cert provisioning and renewal |
| PEM parsing | `rustls-pemfile` 2 | Load cert/key from PEM files (manual mode) |
| PKI types | `rustls-pki-types` 1 | CertificateDer, PrivateKeyDer |
## Design Decisions
All design decisions are documented as ADRs in [decisions/](decisions/).
| ADR | Decision | Summary |
|-----|----------|---------|
| [004](decisions/004-rustls-acme.md) | ACME-primary cert management | Eliminates certbot; automatic provisioning and renewal |
| [005](decisions/005-tokio-rustls-direct.md) | tokio-rustls directly | Full control over TLS config and ACME resolver integration |
## Open Questions
Open questions are tracked in [open-questions.md](open-questions.md). Key
questions affecting this document:
- **OQ-01**: Should cipher suites be restricted beyond rustls defaults? (open)