Add architecture specification for Rust/axum reverse proxy

Phase 1 architecture docs covering proxy handler, TLS termination (ACME + manual), TOML config with static/dynamic split (ArcSwap), and operations (rate limiting, logging, health check, systemd, graceful shutdown). Nine ADRs documenting key decisions: Rust/axum, custom proxy handler, TOML config, rustls-acme for cert management, tokio-rustls direct, token bucket rate limiting, custom log format for fail2ban, static/dynamic config split, and signal handling strategy. Includes threat landscape research documenting the nginx CVEs motivating this project.
2026-06-11 07:25:50 +00:00
parent 5c54a28822
commit 8ee6284b62
17 changed files with 1819 additions and 0 deletions
--- a/docs/architecture/README.md
+++ b/docs/architecture/README.md
@@ -0,0 +1,61 @@
+---
+status: draft
+last_updated: 2026-06-11
+---
+
+# Reverse Proxy — Architecture
+
+## Current State
+
+**Phase 0 (Exploration) — Complete.** Phase 1 (Architecture) — In progress.
+
+This project replaces our vulnerable nginx 1.24.0 installation with a
+memory-safe Rust/axum reverse proxy. The primary motivation is CVE-2026-42945
+(unauthenticated RCE in nginx's rewrite module) and the broader pattern of
+memory corruption bugs in nginx's C codebase.
+
+## Architecture Documents
+
+| Document | Status | Description |
+|----------|--------|-------------|
+| [overview.md](overview.md) | Draft | Vision, scope, crate dependencies, exports |
+| [proxy.md](proxy.md) | Draft | Reverse proxy handler, request flow, header injection |
+| [tls.md](tls.md) | Draft | TLS termination, ACME, manual certs, SNI |
+| [config.md](config.md) | Draft | TOML config format, static/dynamic split, ArcSwap reload |
+| [operations.md](operations.md) | Draft | Rate limiting, logging, health check, systemd, shutdown |
+
+## ADR Table
+
+| ADR | Title | Status |
+|-----|-------|--------|
+| [001](decisions/001-rust-axum.md) | Rust with Axum | Accepted |
+| [002](decisions/002-custom-proxy-handler.md) | Custom Proxy Handler | Accepted |
+| [003](decisions/003-toml-config.md) | TOML Configuration Format | Accepted |
+| [004](decisions/004-rustls-acme.md) | ACME-Primary Certificate Management | Accepted |
+| [005](decisions/005-tokio-rustls-direct.md) | tokio-rustls Directly, Not axum-server | Accepted |
+| [006](decisions/006-rate-limiting-approach.md) | Token Bucket Rate Limiting | Accepted |
+| [007](decisions/007-custom-log-format.md) | Custom Structured Log Format | Accepted |
+| [008](decisions/008-static-dynamic-config-split.md) | Static/Dynamic Config Split with ArcSwap | Accepted |
+| [009](decisions/009-signal-handling.md) | Signal Handling Strategy | Accepted |
+
+## Open Questions
+
+See [open-questions.md](open-questions.md) for the full tracker.
+
+| OQ | Question | Priority | Status |
+|----|----------|----------|--------|
+| OQ-01 | Should cipher suites be restricted beyond rustls defaults? | medium | open |
+| ~~OQ-02~~ | ~~What log format should fail2ban consume?~~ | ~~high~~ | **resolved** (ADR-007) |
+| OQ-03 | Should the health check endpoint be on a separate port? | low | open |
+| OQ-04 | Config reload: SIGHUP only or also Unix socket API? | low | open |
+| OQ-05 | Should the proxy bind to multiple addresses? | low | open |
+| OQ-06 | Should upstream timeouts be configurable per-site? | low | open |
+
+## Document Lifecycle
+
+| Status | Meaning | Transitions |
+|--------|---------|-------------|
+| `draft` | Under active development. May change significantly. | → `reviewed` when open questions are resolved |
+| `reviewed` | Architecture is final. Implementation may begin. | → `stable` when implementation is complete |
+| `stable` | Locked. Changes require review and may warrant an ADR. | → `deprecated` when superseded |
+| `deprecated` | Superseded. Kept for reference. | Removed when no longer referenced |
--- a/docs/architecture/config.md
+++ b/docs/architecture/config.md
@@ -0,0 +1,206 @@
+---
+status: draft
+last_updated: 2026-06-11
+---
+
+# Configuration
+
+## What It Is
+
+The configuration system defines how the proxy is configured, how configuration
+is loaded, and how dynamic configuration can be reloaded without restarting the
+process.
+
+## Why It Exists
+
+The proxy needs to be configurable without hard-coding domains, upstream
+addresses, or TLS settings. The configuration system separates immutable
+startup parameters (bind addresses, TLS mode) from runtime-adjustable
+parameters (site definitions, rate limits) using the `ArcSwap` pattern proven
+in the alknet project.
+
+## Architecture
+
+```
+config.toml
+    │
+    ▼
+┌──────────────────────┐
+│  serde::Deserialize   │
+│  (TOML → Config)     │
+└──────────┬───────────┘
+           │
+           ▼
+┌──────────────────────┐     ┌──────────────────────┐
+│  StaticConfig         │     │  DynamicConfig        │
+│  (immutable)         │     │  (hot-reloadable)     │
+│                      │     │                       │
+│  bind_addr           │     │  sites[]              │
+│  http_port           │     │  rate_limit           │
+│  https_port          │     │  body_limit           │
+│  tls.mode            │     │  proxy_headers        │
+│  tls.acme_domain     │     │                       │
+│  tls.cert_path       │     │  ← ArcSwap →          │
+│  tls.key_path        │     │  ConfigReloadHandle    │
+│  tls.cache_dir       │     │  .reload(new_config)  │
+│  log_level           │     │                       │
+│  log_format          │     └───────────────────────┘
+└──────────────────────┘
+```
+
+## Static vs Dynamic Configuration
+
+This split follows the pattern established in alknet (ADR-030) and adapted
+for our simpler use case.
+
+### StaticConfig
+
+Immutable after startup. Changes require a process restart.
+
+| Field | Type | Description |
+|-------|------|-------------|
+| `bind_addr` | `String` | IP address to bind to (e.g., `"15.235.125.95"`) |
+| `http_port` | `u16` | Port for HTTP→HTTPS redirect (default: `80`; set to `0` to disable) |
+| `https_port` | `u16` | Port for TLS listener (default: `443`) |
+| `tls.mode` | `"acme"` or `"manual"` | Certificate provisioning mode |
+| `tls.acme_domain` | `String` | Domain for ACME (ACME mode only) |
+| `tls.acme_cache_dir` | `String` | ACME state cache directory |
+| `tls.acme_directory` | `"production"` or `"staging"` | Let's Encrypt directory |
+| `tls.cert_path` | `String` | Certificate file path (manual mode only) |
+| `tls.key_path` | `String` | Private key file path (manual mode only) |
+| `log_level` | `"trace"`, `"debug"`, `"info"`, `"warn"`, `"error"` | Logging verbosity |
+| `log_format` | `"text"` or `"json"` | Log output format |
+
+**Why these are static:** Changing bind addresses, ports, or TLS mode requires
+creating new listeners and TLS configurations — operations that fundamentally
+require a restart. There's no safe way to change these at runtime.
+
+### DynamicConfig
+
+Hot-reloadable at runtime via `ArcSwap`. Changes take effect for new
+connections immediately.
+
+| Field | Type | Description |
+|-------|------|-------------|
+| `sites` | `Vec<SiteConfig>` | Site definitions (hostname → upstream mapping) |
+| `rate_limit.requests_per_second` | `u32` | Rate limit per IP (global in Phase 1) |
+| `rate_limit.burst` | `u32` | Burst capacity (global in Phase 1) |
+| `body_limit_bytes` | `u64` | Max request body size in bytes (global in Phase 1) |
+
+**SiteConfig:**
+
+| Field | Type | Description |
+|-------|------|-------------|
+| `host` | `String` | Hostname to match (e.g., `"git.alk.dev"`) |
+| `upstream` | `String` | Upstream address (e.g., `"127.0.0.1:3000"`) |
+| `upstream_scheme` | `"http"` or `"https"` | Protocol for upstream connection (default: `"http"`) |
+
+**Why these are dynamic:** Site definitions and rate limits are per-request
+concerns. Adding a site or changing a rate limit should not require restarting
+the proxy and dropping active connections. Rate limits and body limits are
+global settings in Phase 1; per-site configuration for these may be added in
+Phase 2.
+
+## Config Reload
+
+### ArcSwap Pattern
+
+`DynamicConfig` is wrapped in `Arc<ArcSwap<DynamicConfig>>`. This provides:
+
+- **Lock-free reads**: Every handler reads the current config via a single
+  `Arc` dereference — no lock contention on the request hot path.
+- **Atomic writes**: `ConfigReloadHandle::reload(new_config)` swaps the entire
+  config atomically. All new requests see the new config immediately.
+- **No partial updates**: The entire config is swapped at once. There's no risk
+  of reading a half-updated config.
+
+See [ADR-008](decisions/008-static-dynamic-config-split.md) for the rationale
+behind this split.
+
+### Reload Trigger
+
+The initial implementation uses SIGHUP as the reload trigger. When the process
+receives SIGHUP:
+
+1. Re-read the config file from disk
+2. Deserialize into `DynamicConfig`
+3. Validate (check upstream reachability is optional)
+4. Call `ConfigReloadHandle::reload(new_config)`
+
+Future implementations could add a Unix domain socket API or HTTP endpoint for
+config reload, but SIGHUP is sufficient for Phase 1.
+
+## TOML Config Format
+
+```toml
+# reverse-proxy config
+
+[server]
+bind_addr = "15.235.125.95"
+http_port = 80
+https_port = 443
+
+[server.tls]
+mode = "acme"                    # "acme" or "manual"
+acme_domain = "git.alk.dev"
+acme_cache_dir = "/var/lib/reverse-proxy/acme-cache"
+acme_directory = "production"    # "production" or "staging"
+
+# Manual mode (uncomment and comment out ACME settings)
+# mode = "manual"
+# cert_path = "/etc/letsencrypt/live/git.alk.dev/fullchain.pem"
+# key_path = "/etc/letsencrypt/live/git.alk.dev/privkey.pem"
+
+[server.logging]
+level = "info"
+format = "text"                  # "text" or "json"
+
+[rate_limit]
+requests_per_second = 10
+burst = 20
+
+[body]
+limit_bytes = 104857600          # 100 MB
+
+[[sites]]
+host = "git.alk.dev"
+upstream = "127.0.0.1:3000"
+upstream_scheme = "http"
+```
+
+### Validation
+
+On startup, the config is validated:
+
+1. `bind_addr` is not `0.0.0.0` (must be explicit)
+2. In ACME mode, `acme_domain` must be set
+3. In manual mode, `cert_path` and `key_path` must both be set and the files
+   must be readable
+4. Each site must have a `host` and `upstream`
+5. `rate_limit.requests_per_second` must be > 0
+6. `body.limit_bytes` must be > 0
+
+On SIGHUP reload, the same validation applies. If the new config fails
+validation, the reload is rejected and the old config remains active. An error
+is logged.
+
+**On startup**: If config validation fails, the process exits with a non-zero
+code and logs the validation errors. The proxy will not start with an invalid
+configuration.
+
+## Design Decisions
+
+All design decisions are documented as ADRs in [decisions/](decisions/).
+
+| ADR | Decision | Summary |
+|-----|----------|---------|
+| [003](decisions/003-toml-config.md) | TOML configuration format | Rust-native, unambiguous, excellent serde support |
+| [008](decisions/008-static-dynamic-config-split.md) | Static/dynamic config split | Immutable StaticConfig, hot-reloadable DynamicConfig via ArcSwap |
+
+## Open Questions
+
+Open questions are tracked in [open-questions.md](open-questions.md). Key
+questions affecting this document:
+
+- **OQ-04**: Should config reload support a Unix domain socket API in addition
+  to SIGHUP? (open)
--- a/docs/architecture/decisions/001-rust-axum.md
+++ b/docs/architecture/decisions/001-rust-axum.md
@@ -0,0 +1,61 @@
+# ADR-001: Rust with Axum
+
+## Status
+
+Accepted
+
+## Context
+
+Our current nginx 1.24.0 installation is vulnerable to multiple actively-exploited
+CVEs, most critically CVE-2026-42945 (CVSS 9.2, unauthenticated RCE via
+`ngx_http_rewrite_module`). Six of seven recent nginx CVEs are memory corruption
+bugs (buffer overflow, use-after-free, buffer overread) — the exact class of
+vulnerabilities that Rust eliminates by construction.
+
+The threat landscape is worsening: LLM-assisted fuzzing is accelerating bug
+discovery in nginx's C codebase, and security researchers report additional
+undisclosed vulnerabilities.
+
+We need to replace nginx with a memory-safe alternative that can handle:
+- TLS termination
+- HTTP reverse proxying to backend services
+- Rate limiting with fail2ban-compatible logging
+- Operational simplicity (single binary, systemd integration)
+
+## Decision
+
+Use Rust with the axum web framework for the reverse proxy implementation.
+
+**Rust** provides:
+- Memory safety by construction (no buffer overflows, use-after-free, or
+  double-free at runtime)
+- rustls (pure Rust TLS) avoids OpenSSL dependency and its CVE history
+- Single static binary deployment with no runtime dependencies
+- Excellent async I/O support via tokio
+
+**axum** provides:
+- Ergonomic handler definitions with extractors
+- Tower middleware ecosystem (Service trait, layers)
+- Type-safe routing and state management
+- Well-maintained, widely used, good documentation
+
+## Consequences
+
+**Positive:**
+- Eliminates the entire class of memory corruption vulnerabilities affecting
+  nginx
+- Single binary deployment simplifies operations
+- Rust's type system catches many errors at compile time
+- axum + tower provides composable middleware
+
+**Negative:**
+- Smaller ecosystem than nginx for HTTP proxy features (but our use case is
+  simple)
+- We maintain the code (vs. using a battle-tested C project)
+- Less granular control over HTTP/2 and connection pooling compared to nginx
+- Team needs Rust expertise (already available)
+
+## References
+
+- [threat-landscape.md](../../research/threat-landscape.md)
+- [overview.md](../overview.md)
--- a/docs/architecture/decisions/002-custom-proxy-handler.md
+++ b/docs/architecture/decisions/002-custom-proxy-handler.md
@@ -0,0 +1,56 @@
+# ADR-002: Custom Proxy Handler
+
+## Status
+
+Accepted
+
+## Context
+
+We need to implement HTTP reverse proxying — receiving requests and forwarding
+them to an upstream service (Gitea on localhost:3000). Two approaches are
+available:
+
+1. **`axum-reverse-proxy` crate**: Provides path-based routing, header
+   forwarding, round-robin load balancing, TLS support, retry mechanisms, and
+   RFC 9110 compliance.
+2. **Custom handler** (Felix Knorr pattern): Build a handler using hyper's
+   `Client` to forward requests. ~50-100 lines of Rust for our needs.
+
+Our use case is minimal: single upstream per domain, single domain, no load
+balancing, no retry, no HTTP/2 proxying.
+
+## Decision
+
+Implement a custom proxy handler using hyper's `Client` for request forwarding,
+following the pattern demonstrated by Felix Knorr and used in the alknet
+project's channel proxy.
+
+## Rationale
+
+- `axum-reverse-proxy` adds complexity we don't need (load balancing, retry,
+  path-based routing to multiple backends)
+- Our proxy case is the simplest possible: match a Host header, forward the
+  entire request to a single upstream, stream the response back
+- The Felix Knorr pattern is proven, idiomatic, and ~50-100 lines
+- We maintain full control over header injection, error handling, and upstream
+  connection behavior
+- If requirements grow, we can adopt `axum-reverse-proxy` later
+
+## Consequences
+
+**Positive:**
+- Minimal dependencies
+- Full control over proxy behavior
+- Easy to understand and audit (~100 lines of proxy code)
+- No unnecessary abstraction layers
+
+**Negative:**
+- We implement and maintain proxy logic ourselves (but it's trivial for our
+  use case)
+- If requirements grow to load balancing or retry, we'd need to add that
+  ourselves or switch to `axum-reverse-proxy`
+
+## References
+
+- [proxy.md](../proxy.md)
+- Felix Knorr, "Replacing nginx with axum" (felix-knorr.net/posts/2024-10-13-replacing-nginx-with-axum.html)
--- a/docs/architecture/decisions/003-toml-config.md
+++ b/docs/architecture/decisions/003-toml-config.md
@@ -0,0 +1,44 @@
+# ADR-003: TOML Configuration Format
+
+## Status
+
+Accepted
+
+## Context
+
+The proxy needs a configuration file format for defining sites, TLS settings,
+bind addresses, and rate limits. Options include TOML, YAML, JSON, and custom
+binary formats.
+
+## Decision
+
+Use TOML as the configuration file format.
+
+## Rationale
+
+- **Rust-native**: TOML is the configuration format for Cargo (Rust's package
+  manager). The Rust ecosystem has first-class TOML support via `serde` +
+  `toml` crate.
+- **Unambiguous**: TOML has a single canonical representation for any given
+  data structure, unlike YAML which has multiple equivalent representations and
+  surprising type coercion rules (e.g., `no` → boolean, `1.0` → float).
+- **Human-friendly**: TOML is easy to read and write for simple configurations
+  like ours. It supports sections (tables), arrays, and inline tables.
+- **Good error messages**: The `toml` crate provides clear deserialization
+  error messages pointing to the exact field that failed.
+
+## Consequences
+
+**Positive:**
+- Familiar to Rust developers (Cargo.toml)
+- Clear, unambiguous syntax
+- Excellent serde integration with detailed error reporting
+- No type coercion surprises
+
+**Negative:**
+- Not as widely used for config outside Rust (but our audience is ourselves)
+- No `#include` or file composition (each config file is self-contained)
+
+## References
+
+- [config.md](../config.md)
--- a/docs/architecture/decisions/004-rustls-acme.md
+++ b/docs/architecture/decisions/004-rustls-acme.md
@@ -0,0 +1,67 @@
+# ADR-004: ACME-Primary Certificate Management
+
+## Status
+
+Accepted
+
+## Context
+
+The proxy needs TLS certificates for HTTPS. Two approaches are available:
+
+1. **certbot (external ACME client)**: Run certbot as a cron job or systemd
+   timer to obtain and renew certificates. The proxy loads certificates from
+   files on disk. Renewal requires either SIGHUP/restart or inotify file
+   watching to pick up new certs.
+
+2. **rustls-acme (built-in ACME client)**: The proxy handles ACME
+   certificate provisioning and renewal internally as a background task. No
+   external certbot dependency. The `ResolvesServerCertAcme` cert resolver
+   automatically serves the correct certificate and updates when renewed.
+
+The alknet project has successfully implemented the rustls-acme approach, and
+its patterns are directly reusable.
+
+## Decision
+
+Use `rustls-acme` as the primary certificate management mode, with manual
+certificate paths as a fallback mode for testing, self-signed certs, and
+corporate CA environments.
+
+## Rationale
+
+- **Eliminates certbot dependency**: No external cron job, no deploy hooks, no
+  certbot package to install and maintain. The proxy is self-contained.
+- **Automatic renewal**: `rustls-acme` runs as a background tokio task that
+  handles certificate provisioning and renewal automatically (~30 days before
+  expiry).
+- **No restart needed**: When `rustls-acme` provisions a new certificate, the
+  `ResolvesServerCertAcme` resolver updates atomically. No SIGHUP, no restart,
+  no file watching.
+- **Proven pattern**: alknet uses the same approach successfully.
+- **Cache persistence**: `DirCache` persists ACME state between restarts,
+  avoiding re-provisioning.
+- **Fallback mode**: Manual cert paths are still supported for environments
+  where ACME is not possible.
+
+## Consequences
+
+**Positive:**
+- Single binary deployment (no certbot dependency)
+- Zero-downtime certificate renewal
+- Simpler operational model (no certbot cron, no deploy hooks)
+- Proven in alknet
+
+**Negative:**
+- `rustls-acme` is an additional dependency
+- ACME challenges require either port 80 (HTTP-01) or TLS-ALPN-01 on port 443,
+  which our proxy already listens on
+- Less control over certificate issuance compared to certbot (e.g., no DNS-01
+  challenge support, though rustls-acme supports TLS-ALPN-01 which is sufficient
+  for our use case)
+- Manual mode requires restart for cert changes (acceptable for fallback)
+
+## References
+
+- [tls.md](../tls.md)
+- alknet ADR-008: ACME/Let's Encrypt decision
+- `rustls-acme` crate: https://github.com/FlorianUekermann/rustls-acme
--- a/docs/architecture/decisions/005-tokio-rustls-direct.md
+++ b/docs/architecture/decisions/005-tokio-rustls-direct.md
@@ -0,0 +1,65 @@
+# ADR-005: tokio-rustls Directly, Not axum-server
+
+## Status
+
+Accepted
+
+## Context
+
+We need to serve HTTPS (TLS) traffic through axum. Two approaches exist for
+integrating TLS with axum:
+
+1. **`axum-server`**: A wrapper that provides TLS support for axum via
+   `tls_rustls` feature. Handles TCP binding, TLS accept, and passing TLS
+   streams to axum. Simple API but limited control over the TLS configuration.
+
+2. **`tokio-rustls` directly**: Bind TCP manually, perform TLS handshake with
+   `TlsAcceptor`, then serve the TLS stream to axum/hyper. More code but full
+   control over `ServerConfig`, cipher suites, ALPN protocols, and cert
+   resolvers.
+
+The alknet project uses tokio-rustls directly and has proven this pattern for
+both manual and ACME certificate management.
+
+## Decision
+
+Use `tokio-rustls` directly for TLS termination, with `hyper` serving the
+resulting TLS streams to axum. Do not use `axum-server`.
+
+## Rationale
+
+- **ACME integration**: The `rustls-acme` `ResolvesServerCertAcme` resolver
+  needs to be set as the certificate resolver on `ServerConfig` via
+  `with_cert_resolver()`. `axum-server` does not expose this level of control
+  over the `ServerConfig`.
+- **Cipher suite control**: We may need to configure cipher suites beyond the
+  defaults (see OQ-01). `axum-server` wraps the `ServerConfig` construction
+  and may not expose `CryptoProvider` configuration. Direct `tokio-rustls`
+  usage gives us full control.
+- **ALPN configuration**: ACME TLS-ALPN-01 challenge requires adding
+  `acme-tls/1` to the ALPN protocol list. This is only possible with direct
+  `ServerConfig` access.
+- **Proven pattern**: alknet uses exactly this approach (`TlsAcceptor` wrapping
+  `tokio-rustls`, with manual or ACME `ServerConfig` construction).
+- **No abstraction cost**: The code to bind TCP, accept TLS, and serve to
+  axum/hyper is ~50 lines. `axum-server` saves little for our simple case.
+
+## Consequences
+
+**Positive:**
+- Full control over TLS configuration
+- Direct `rustls-acme` integration
+- Ability to add ALPN protocols for ACME challenges
+- Proven pattern from alknet
+
+**Negative:**
+- Slightly more code than `axum-server` (~50 lines for the TLS acceptor loop)
+- Need to manage the TCP listener and TLS accept explicitly
+- Must handle the `TlsStream<TcpStream>` → `hyper::service_fn` → axum
+  integration manually (well-documented pattern from Felix Knorr's blog and
+  alknet)
+
+## References
+
+- [tls.md](../tls.md)
+- alknet transport layer (`alknet-core/src/transport/tls.rs`, `alknet-core/src/transport/acme.rs`)
--- a/docs/architecture/decisions/006-rate-limiting-approach.md
+++ b/docs/architecture/decisions/006-rate-limiting-approach.md
@@ -0,0 +1,77 @@
+# ADR-006: Token Bucket Rate Limiting with In-Memory State
+
+## Status
+
+Accepted
+
+## Context
+
+The proxy must enforce request rate limits per client IP address, replacing
+nginx's `limit_req_zone` directive. Rate limiting is critical for preventing
+abuse and for fail2ban integration (rate-limited requests trigger fail2ban
+actions).
+
+Several rate limiting approaches exist:
+- **Token bucket**: Tokens accumulate at a fixed rate; each request consumes a
+  token. Allows short bursts up to the bucket capacity.
+- **Leaky bucket**: Requests are processed at a fixed rate; excess requests
+  queue or are rejected. No burst allowance.
+- **Fixed window**: Count requests in fixed time windows (e.g., per minute).
+  Allows burst at window boundaries.
+- **Sliding window**: Count requests in a rolling time window. More accurate
+  than fixed window but more complex.
+
+The current nginx config uses `limit_req zone=gitea_limit burst=20 nodelay`,
+which is a token bucket with burst allowance.
+
+For state storage:
+- **In-memory HashMap**: Fast, no external dependencies, lost on restart.
+- **External store (Redis, etc.)**: Shared across instances, persists across
+  restarts. Adds operational complexity.
+- **tower-governor crate**: Pre-built rate limiting middleware. Uses
+  generalized cell algorithm. Adds dependency.
+
+## Decision
+
+Use a token bucket algorithm with in-memory `HashMap<IpAddr, TokenBucket>`
+state, protected by `tokio::sync::Mutex`. Rate limiting runs as axum middleware
+before the proxy handler.
+
+Rate limits are global per-IP (not per-site) in Phase 1. Per-site rate limits
+may be added in Phase 2 as the config model evolves.
+
+Stale entries in the HashMap are cleaned up periodically. A background task
+scans the HashMap at a configurable interval (default: 60 seconds) and removes
+entries that haven't been accessed within the cleanup interval.
+
+## Rationale
+
+- Token bucket matches nginx's `limit_req burst` semantics, ensuring
+  behavioral compatibility during migration.
+- In-memory state is sufficient for a single-instance proxy (no shared state
+  needed).
+- `tokio::sync::Mutex` (not `std::sync::Mutex`) avoids holding the lock across
+  await points and integrates with the async runtime.
+- Custom implementation gives full control over logging output for fail2ban
+  integration (ADR-007).
+- State loss on restart is acceptable — rate limit state is inherently
+  ephemeral.
+
+## Consequences
+
+**Positive:**
+- Behavioral compatibility with nginx rate limiting
+- Full control over fail2ban log format
+- No external dependencies (Redis, etc.)
+- Simple implementation (~100 lines)
+
+**Negative:**
+- Rate limit state is lost on restart (acceptable for single-instance deploy)
+- Not suitable for multi-instance deployments without external state store
+  (Phase 1 is single-instance)
+- HashMap grows over time without eviction (mitigated by periodic cleanup)
+
+## References
+
+- [operations.md](../operations.md)
+- nginx `limit_req` documentation
--- a/docs/architecture/decisions/007-custom-log-format.md
+++ b/docs/architecture/decisions/007-custom-log-format.md
@@ -0,0 +1,67 @@
+# ADR-007: Custom Structured Log Format for Fail2ban
+
+## Status
+
+Accepted
+
+## Context
+
+The proxy needs to produce log output that fail2ban can parse to detect and ban
+abusive IP addresses. The current nginx setup uses nginx's default log format
+with standard fail2ban filters.
+
+Options for fail2ban integration:
+- **nginx-compatible format**: Replicate nginx's log format so existing
+  fail2ban filters work unchanged. Couples us to nginx's format.
+- **Custom structured format**: Design a clean, parseable format with a
+  corresponding custom fail2ban filter. Gives us control and clarity.
+- **JSON format**: Machine-readable but harder for fail2ban regex matching.
+
+## Decision
+
+Use a custom structured log format with a corresponding custom fail2ban filter.
+
+The format for rate-limited requests:
+
+```
+RATE_LIMIT client_ip=<IP> host=<host> path=<path> status=429
+```
+
+The format for general access logs:
+
+```
+REQUEST client_ip=<IP> host=<host> method=<METHOD> path=<path> status=<code> upstream=<addr> duration_ms=<ms>
+```
+
+A corresponding fail2ban filter (`/etc/fail2ban/filter.d/reverse-proxy.conf`)
+uses regex matching on the `RATE_LIMIT` prefix and `client_ip=<HOST>` field.
+
+## Rationale
+
+- Custom format is clear, unambiguous, and self-documenting
+- No coupling to nginx's format, which may change or include fields we don't
+  produce
+- `key=value` pairs are easy to parse with regex and easy to extend
+- The `RATE_LIMIT` prefix makes it trivial to distinguish rate-limit events
+  from other logs
+- Writing a custom fail2ban filter is straightforward (5 lines of config)
+- We control both sides (the proxy and the filter), so compatibility is
+  guaranteed
+
+## Consequences
+
+**Positive:**
+- Clean, purpose-built format
+- Easy to extend with new fields
+- No dependency on nginx log format
+- Custom fail2ban filter is simple to maintain
+
+**Negative:**
+- Cannot reuse existing nginx fail2ban filters (trivial to write our own)
+- Existing fail2ban configurations need updating (acceptable since we're
+  replacing nginx entirely)
+
+## References
+
+- [operations.md](../operations.md)
+- [open-questions.md](../open-questions.md) OQ-02 (now resolved)
--- a/docs/architecture/decisions/008-static-dynamic-config-split.md
+++ b/docs/architecture/decisions/008-static-dynamic-config-split.md
@@ -0,0 +1,76 @@
+# ADR-008: Static/Dynamic Configuration Split with ArcSwap
+
+## Status
+
+Accepted
+
+## Context
+
+The proxy needs configuration that can be partially reloaded at runtime (site
+definitions, rate limits) without restarting the process and dropping active
+connections. However, some configuration (bind addresses, TLS mode) fundamentally
+requires creating new listeners and cannot be changed at runtime.
+
+Two approaches:
+- **Full restart for all config changes**: Simple, but requires dropping
+  active connections for every change, including trivial rate limit adjustments.
+- **Static/dynamic split**: Immutable parameters (bind address, TLS mode) in a
+  `StaticConfig` that requires restart; runtime-adjustable parameters (sites,
+  rate limits) in a `DynamicConfig` that can be atomically swapped via
+  `Arc<ArcSwap<DynamicConfig>>` without dropping connections.
+
+This pattern is proven in the alknet project, which uses the same
+`ArcSwap<DynamicConfig>` approach for auth policy, forwarding rules, and rate
+limits.
+
+## Decision
+
+Split configuration into `StaticConfig` (immutable after startup) and
+`DynamicConfig` (hot-reloadable via `ArcSwap`). The split is:
+
+**StaticConfig** (restart required):
+- Bind address, HTTP port, HTTPS port
+- TLS mode (ACME vs. manual), cert paths, ACME settings
+- Log level and format
+
+**DynamicConfig** (hot-reloadable via SIGHUP):
+- Site definitions (hostname → upstream mappings)
+- Rate limits (requests per second, burst)
+- Body size limits
+
+`ConfigReloadHandle` provides a `reload(DynamicConfig)` method that atomically
+swaps the entire config. All request handlers read `DynamicConfig` via
+`ArcSwap::load()` — a lock-free operation.
+
+## Rationale
+
+- Rate limits and site definitions change more frequently than bind addresses
+  and TLS settings. Hot-reload avoids unnecessary downtime.
+- `ArcSwap` provides lock-free reads and atomic writes — no partial updates,
+  no lock contention on the hot path.
+- Proven pattern from alknet, where it's used for auth policy, forwarding
+  rules, and rate limits.
+- SIGHUP trigger is simple, well-understood, and compatible with systemd and
+  process supervisors.
+- The entire config is swapped at once, preventing inconsistent states where
+  some sites use the old config and others use the new one.
+
+## Consequences
+
+**Positive:**
+- Zero-downtime config reload for sites and rate limits
+- Lock-free reads on the request hot path
+- Atomic config updates — no partial states
+- Proven pattern from alknet
+
+**Negative:**
+- Two config types add conceptual complexity
+- SIGHUP reload requires reading the config file from disk (need to handle
+  file read errors gracefully)
+- Must validate DynamicConfig before swapping (invalid config must not replace
+  valid config)
+
+## References
+
+- [config.md](../config.md)
+- alknet ADR-030 (static/dynamic config split)
--- a/docs/architecture/decisions/009-signal-handling.md
+++ b/docs/architecture/decisions/009-signal-handling.md
@@ -0,0 +1,62 @@
+# ADR-009: Signal Handling Strategy
+
+## Status
+
+Accepted
+
+## Context
+
+The proxy needs to handle Unix signals for:
+- **Graceful shutdown**: SIGTERM and SIGINT should stop accepting new
+  connections, drain in-flight requests, then exit.
+- **Config reload**: SIGHUP should trigger a DynamicConfig reload from disk.
+
+Two approaches for signal handling:
+- **`tokio::signal`**: Built into tokio. Handles SIGTERM and SIGINT via
+  `ctrl_c()`. Does not directly handle SIGHUP.
+- **`signal-hook`**: External crate. Handles all Unix signals including SIGHUP.
+  More flexible but adds a dependency.
+
+## Decision
+
+Use `signal-hook` for all signal handling. Specifically:
+- `signal-hook::flag` to set termination flags on SIGTERM/SIGINT
+- `signal-hook` to register a SIGHUP handler that triggers config reload
+
+`tokio::signal::ctrl_c()` is registered as a secondary shutdown trigger; both
+mechanisms converge on the same shutdown path. This is a belt-and-suspenders
+approach: `signal-hook` handles all signals including SIGHUP, while
+`ctrl_c()` provides a fallback for environments where signal handling may not
+be fully wired (e.g., container runtimes).
+
+The shutdown sequence:
+1. On SIGTERM or SIGINT: stop accepting new connections, wait up to 30 seconds
+   for in-flight requests to complete, then exit with code 0.
+2. On SIGHUP: re-read config file, validate, and swap DynamicConfig if valid.
+   Log the result.
+
+## Rationale
+
+- SIGHUP handling is required for config reload — `tokio::signal` doesn't
+  support SIGHUP.
+- `signal-hook` is well-maintained, widely used, and handles all Unix signals.
+- Using one signal handling mechanism (rather than mixing `tokio::signal` and
+  `signal-hook`) is simpler and avoids edge cases.
+- `signal-hook::flag` is a minimal, safe API for signal-triggered flags.
+
+## Consequences
+
+**Positive:**
+- SIGHUP for config reload is simple and well-understood
+- Single signal handling mechanism for all signals
+- Compatible with systemd (SIGTERM for shutdown) and standard Unix conventions
+
+**Negative:**
+- `signal-hook` is an additional dependency (but a well-established one)
+- Signal handling requires careful coordination with the tokio runtime (async
+  signal receivers must be properly integrated)
+
+## References
+
+- [operations.md](../operations.md)
+- [config.md](../config.md)
--- a/docs/architecture/open-questions.md
+++ b/docs/architecture/open-questions.md
@@ -0,0 +1,86 @@
+---
+status: draft
+last_updated: 2026-06-11
+---
+
+# Open Questions
+
+## TLS
+
+### OQ-01: Should cipher suites be restricted beyond rustls defaults?
+
+- **Origin**: [tls.md](tls.md)
+- **Status**: open
+- **Priority**: medium
+- **Context**: Our current nginx config explicitly restricts cipher suites to
+  four ECDHE-AES-GCM suites. rustls 0.23 with `aws_lc_rs` defaults to a
+  conservative set that excludes all weak ciphers (no SHA-1, no 3DES, no RC4,
+  no CBC-mode suites, no RSA key exchange). The defaults include TLS 1.3 suites
+  which nginx also allows. Restricting further would reduce compatibility with
+  older clients; not restricting means accepting a wider (but still safe) set
+  than the current nginx config.
+- **Cross-references**: ADR-005
+
+## Logging and Monitoring
+
+### ~~OQ-02: What log format should fail2ban consume?~~
+
+- **Origin**: [operations.md](operations.md), [proxy.md](proxy.md)
+- **Status**: resolved
+- **Priority**: high
+- **Resolution**: Custom structured log format with `key=value` pairs and
+  `RATE_LIMIT` prefix. A corresponding custom fail2ban filter will be provided.
+  See ADR-007.
+- **Cross-references**: ADR-007
+
+### OQ-03: Should the health check endpoint be on a separate port?
+
+- **Origin**: [operations.md](operations.md)
+- **Status**: open
+- **Priority**: low
+- **Context**: Currently the health check is on the main HTTPS listener at
+  `/health`. Alternatives: (a) separate unencrypted port for health checks
+  (simpler for load balancers but less secure), (b) admin port with its own
+  listener (more complex but isolates operational traffic), (c) on the main
+  listener (simplest, proposed approach). For a single-server deployment behind
+  no external load balancer, the main listener is fine.
+- **Cross-references**: None
+
+## Configuration
+
+### OQ-04: Should config reload support a Unix domain socket API in addition to SIGHUP?
+
+- **Origin**: [config.md](config.md)
+- **Status**: open
+- **Priority**: low
+- **Context**: Phase 1 uses SIGHUP for config reload, which is simple and proven.
+  A Unix domain socket API would allow programmatic reload (e.g., from an admin
+  tool or CI/CD pipeline) and could return success/failure status. This adds
+  complexity and is not needed for Phase 1.
+- **Cross-references**: None
+
+## Deployment
+
+### OQ-05: Should the proxy bind to multiple addresses or just one?
+
+- **Origin**: [overview.md](overview.md)
+- **Status**: open
+- **Priority**: low
+- **Context**: Current nginx config binds to a specific IP (`15.235.125.95`).
+  The proposed config uses `bind_addr` which could be any IP. For Phase 1, the
+  config will specify a single IP address. Multi-address binding (listening on
+  multiple IPs) is not needed but could be added as an array of addresses.
+- **Cross-references**: None
+
+## Proxy
+
+### OQ-06: Should upstream timeouts be configurable per-site?
+
+- **Origin**: [proxy.md](proxy.md)
+- **Status**: open
+- **Priority**: low
+- **Context**: Phase 1 uses global defaults (5s connect timeout, 60s request
+  timeout) for all upstream connections. Per-site timeout configuration would
+  allow tuning for different upstream services (e.g., a slow database-backed
+  API vs. a fast static site). Not needed for Phase 1 with a single upstream.
+- **Cross-references**: None
--- a/docs/architecture/operations.md
+++ b/docs/architecture/operations.md
@@ -0,0 +1,250 @@
+---
+status: draft
+last_updated: 2026-06-11
+---
+
+# Operations
+
+## What It Is
+
+The operations component covers everything related to running the proxy in
+production: rate limiting, logging (fail2ban integration), health checks,
+systemd integration, and graceful shutdown.
+
+## Why It Exists
+
+A reverse proxy that can't be monitored, rate-limited, or gracefully restarted
+is not production-ready. These concerns are cross-cutting — they affect the
+proxy handler, the TLS layer, and the config system.
+
+## Rate Limiting
+
+### Requirements
+
+- Limit requests per IP address (replacing nginx's `limit_req_zone`)
+- Default: 10 requests/second with burst of 20 (matching current nginx config)
+- Configurable via DynamicConfig (no restart needed)
+- Must produce logs that fail2ban can consume
+
+### Design
+
+The rate limiter runs as axum middleware before the proxy handler. It uses a
+token bucket algorithm per client IP, matching nginx's `limit_req burst`
+semantics.
+
+Rate limits are global per-IP in Phase 1 (not per-site). A request from IP
+address X counts against the same bucket regardless of which site it targets.
+Per-site rate limits may be added in Phase 2.
+
+When a request exceeds the rate limit, the middleware returns `429 Too Many
+Requests` and logs the event with structured fields.
+
+### State Eviction
+
+The per-IP token bucket state grows over time as new IPs are seen. A
+background task runs at a configurable interval (default: 60 seconds) and
+removes entries that haven't been accessed within the cleanup interval. This
+prevents unbounded memory growth.
+
+### Fail2ban Integration
+
+Rate limit events are logged in a structured format that a custom fail2ban
+filter can parse. See [ADR-007](decisions/007-custom-log-format.md) for the
+format decision.
+
+The log format uses `key=value` pairs with a `RATE_LIMIT` prefix:
+
+```
+RATE_LIMIT client_ip=X.X.X.X host=Y.Z path=/W status=429
+```
+
+A corresponding fail2ban filter and jail configuration are provided as part
+of the deployment documentation.
+
+## Logging
+
+### Structure
+
+All logs use `tracing` with structured fields. The proxy outputs two types of
+log entries:
+
+1. **Access logs**: Every proxied request is logged at `info` level with
+   structured fields.
+
+   ```
+   REQUEST client_ip=1.2.3.4 host=git.alk.dev method=GET path=/user/repo status=200 upstream=127.0.0.1:3000 duration_ms=45
+   ```
+
+2. **Event logs**: Rate limits, TLS errors, upstream failures, config reloads,
+   etc.
+
+   ```
+   RATE_LIMIT client_ip=1.2.3.4 host=git.alk.dev path=/login status=429
+   UPSTREAM_ERROR host=git.alk.dev upstream=127.0.0.1:3000 error="connection refused"
+   CONFIG_RELOAD status=success sites=1
+   ```
+
+### Output
+
+Logs are written to:
+- **stdout/stderr**: For systemd/journald integration
+- **File** (optional): For fail2ban consumption at
+  `/var/log/reverse-proxy/access.log`
+
+The `tracing-subscriber` layer configuration supports both simultaneously via
+`Layer` composition.
+
+### Log Levels
+
+| Level | Use |
+|-------|-----|
+| `error` | Unrecoverable failures (TLS handshake failure, config validation) |
+| `warn` | Rate limit exceeded, upstream unreachable, upstream timeout |
+| `info` | Access logs, config reloads, ACME events, startup/shutdown |
+| `debug` | Request/response headers, connection details |
+| `trace` | Detailed protocol-level information |
+
+Configurable via `log_level` in StaticConfig.
+
+## Health Check
+
+### Endpoint
+
+```
+GET /health → 200 OK (empty body)
+```
+
+The health check endpoint is accessible on the main HTTPS listener. It returns
+200 if the process is alive and serving requests.
+
+**Limitation**: Since `/health` is served over TLS, it cannot detect TLS
+configuration errors that prevent the TLS handshake from completing. External
+monitoring should also check TCP connectivity to port 443 independently.
+
+### What It Checks
+
+- Process is running and the tokio runtime is responsive
+- TLS listener is accepting connections
+- Config is loaded (StaticConfig and DynamicConfig are initialized)
+
+It does **not** check upstream reachability. The health check answers "is the
+proxy process healthy?", not "is the upstream reachable?" — upstream health is
+a separate concern that would produce 502/504 responses in the proxy handler.
+
+### Future Extensions
+
+- `/health/ready` — readiness check that includes upstream reachability
+- Prometheus metrics at `/metrics`
+
+## Systemd Integration
+
+### Unit File
+
+```ini
+[Unit]
+Description=Reverse Proxy
+After=network.target
+Wants=network-online.target
+
+[Service]
+Type=notify
+NotifyAccess=all
+ExecStart=/usr/local/bin/reverse-proxy --config /etc/reverse-proxy/config.toml
+Restart=on-failure
+RestartSec=5
+
+# Security hardening
+NoNewPrivileges=yes
+ProtectSystem=strict
+ProtectHome=yes
+PrivateTmp=yes
+ReadWritePaths=/var/lib/reverse-proxy /var/log/reverse-proxy
+
+# ACME challenge cache directory
+StateDirectory=reverse-proxy
+
+[Install]
+WantedBy=multi-user.target
+```
+
+The proxy signals readiness to systemd via `sd_notify` after binding listeners
+and completing the initial configuration load.
+
+## Graceful Shutdown
+
+### Signal Handling
+
+The proxy handles three signals via `signal-hook` (see [ADR-009](decisions/009-signal-handling.md)):
+
+- **SIGTERM / SIGINT**: Graceful shutdown. Stop accepting new connections, wait
+  for in-flight requests to complete (up to a configurable timeout), then exit.
+- **SIGHUP**: Config reload. Re-read the config file, validate, and swap
+  DynamicConfig if valid.
+
+### SIGHUP for Config Reload
+
+SIGHUP triggers config reload (see [config.md](config.md) for details). The
+process does not exit on SIGHUP.
+
+### Timeout
+
+In-flight requests have a configurable shutdown timeout (default: 30 seconds).
+After the timeout, remaining connections are forcefully closed and the process
+exits.
+
+## Deployment
+
+### Binary
+
+Single static binary, no runtime dependencies:
+
+```bash
+cargo build --release
+# Produces: target/release/reverse-proxy
+```
+
+The binary is self-contained — no system libraries beyond libc for DNS
+resolution. The `aws_lc_rs` crypto provider is statically linked.
+
+### Configuration
+
+```bash
+# Config file
+/etc/reverse-proxy/config.toml
+
+# ACME cache directory
+/var/lib/reverse-proxy/acme-cache/
+
+# Log directory (optional, for fail2ban)
+/var/log/reverse-proxy/
+```
+
+### CLI
+
+```bash
+reverse-proxy [OPTIONS]
+
+Options:
+  --config <PATH>      Path to config file [default: /etc/reverse-proxy/config.toml]
+  --validate          Validate config and exit
+  --help              Show help
+  --version           Show version
+```
+
+## Design Decisions
+
+All design decisions are documented as ADRs in [decisions/](decisions/).
+
+| ADR | Decision | Summary |
+|-----|----------|---------|
+| [001](decisions/001-rust-axum.md) | Rust with axum | Memory safety; single binary deployment |
+| [006](decisions/006-rate-limiting-approach.md) | Token bucket rate limiting | In-memory per-IP token bucket matching nginx burst semantics |
+| [007](decisions/007-custom-log-format.md) | Custom structured log format | key=value pairs with RATE_LIMIT prefix for fail2ban |
+| [009](decisions/009-signal-handling.md) | Signal handling strategy | signal-hook for SIGTERM/SIGINT/SIGHUP |
+
+## Open Questions
+
+Open questions are tracked in [open-questions.md](open-questions.md). Key
+questions affecting this document:
+
+- **OQ-03**: Should the health check endpoint be on a separate port? (open)
--- a/docs/architecture/overview.md
+++ b/docs/architecture/overview.md
@@ -0,0 +1,166 @@
+---
+status: draft
+last_updated: 2026-06-11
+---
+
+# Overview
+
+## Vision
+
+A memory-safe, minimal reverse proxy that replaces our vulnerable nginx instance
+for forward-proxying to backend services. The proxy terminates TLS, injects
+standard proxy headers, enforces rate limits, and forwards requests to upstream
+services — with operational feature parity for our current single-domain Gitea
+setup.
+
+## Why This Exists
+
+Our nginx 1.24.0 installation is vulnerable to multiple actively-exploited
+CVEs, including CVE-2026-42945 (unauthenticated RCE via `rewrite`/`set`
+directives). The broader threat landscape is worsening: LLM-assisted fuzzing
+is accelerating bug discovery in nginx's C codebase, and security researchers
+report additional undisclosed vulnerabilities. Upgrading nginx patches known
+CVEs but does not address the structural problem — memory corruption bugs are
+endemic to C, and the discovery rate is accelerating.
+
+Rust's memory safety eliminates the entire class of buffer overflow,
+use-after-free, and double-free bugs that constitute 6 of 7 recent nginx CVEs.
+Combined with rustls (pure Rust TLS, no OpenSSL dependency), this provides a
+fundamentally safer baseline.
+
+See [threat-landscape.md](../research/threat-landscape.md) for full vulnerability
+details.
+
+## Scope
+
+### In Scope
+
+- **Phase 1**: Replace nginx for `git.alk.dev` with feature parity
+  - TLS termination with ACME (Let's Encrypt) certificate management
+  - Manual certificate paths as fallback mode
+  - HTTP → HTTPS redirect
+  - Reverse proxy to Gitea at `127.0.0.1:3000`
+  - Proxy header injection (Host, X-Real-IP, X-Forwarded-For, X-Forwarded-Proto)
+  - Request rate limiting with fail2ban-compatible logging (global per-IP; per-site in Phase 2)
+  - 100 MB body size limit (global; per-site in Phase 2)
+  - Configurable bind address (no `0.0.0.0` default)
+  - Health check endpoint
+  - Graceful shutdown (SIGTERM handling)
+  - Systemd unit file
+
+- **Phase 2**: Multi-site support
+  - SNI-based TLS routing for multiple domains
+  - Config file for site definitions
+  - Dynamic config reload (ArcSwap pattern)
+
+- **Phase 3**: Operational hardening
+  - Metrics endpoint (Prometheus-compatible)
+  - Connection limits and timeouts
+  - Log rotation
+
+### Out of Scope
+
+- HTTP/2 or HTTP/3 proxying (services that need these run their own native
+  Rust servers — e.g., `api.alk.dev`)
+- Load balancing or round-robin upstream selection
+- WebSocket proxying (can be added later if needed)
+- Static file serving
+- Access control beyond rate limiting (no auth, no IP allowlists in Phase 1)
+- CGI, SCGI, uWSGI, FastCGI
+
+## Architecture
+
+```
+                    ┌────────────────────────────────────┐
+                    │     reverse-proxy (Rust/axum)       │
+config.toml ──────► │  StaticConfig + DynamicConfig       │
+                    │  (ArcSwap for hot-reload)            │
+                    │                                      │
+bind_addr:80   ──►  │  HTTP listener → 301 redirect        │
+                    │     to HTTPS                         │
+                    │                                      │
+bind_addr:443  ──►  │  TLS listener (tokio-rustls)         │
+                    │  ├─ ACME mode: rustls-acme resolver  │
+                    │  │  (auto cert provisioning/renewal) │
+                    │  └─ Manual mode: cert/key file paths  │
+                    │                                      │
+                    │  axum router                         │
+                    │  ├─ Host-based routing                │
+                    │  ├─ Rate limiting middleware          │
+                    │  ├─ Proxy header injection            │
+                    │  ├─ Body size limit (100MB)           │
+                    │  └─ Reverse proxy handler             │
+                    │     └─ hyper Client → upstream        │
+                    │                                      │
+                    │  /health → 200 OK                    │
+                    └────────────────────────────────────┘
+```
+
+## Crate Dependencies
+
+### Core
+
+| Crate | Version | Purpose | Notes |
+|-------|---------|---------|-------|
+| `axum` | 0.8 | HTTP framework | Routing, middleware, extractors |
+| `tokio` | 1 (full) | Async runtime | Multi-threaded runtime |
+| `hyper` | 1 | HTTP protocol | Used via axum, and directly for proxy `Client` |
+| `tower` | 0.5 | Middleware ecosystem | Service trait, layers |
+| `rustls` | 0.23 | TLS implementation | `aws_lc_rs` crypto provider |
+| `tokio-rustls` | 0.26 | Async TLS I/O | Wraps TCP with TLS |
+| `rustls-acme` | 0.12 | ACME client | Let's Encrypt auto-provisioning and renewal |
+
+### Supporting
+
+| Crate | Version | Purpose | Notes |
+|-------|---------|---------|-------|
+| `serde` | 1 | Serialization | TOML config deserialization |
+| `toml` | 0.8 | Config format | Declarative site definitions |
+| `arc-swap` | 1 | Atomic config swap | Lock-free DynamicConfig reload |
+| `tracing` | 0.1 | Structured logging | fail2ban-compatible output |
+| `tracing-subscriber` | 0.3 | Log output | File + journald support |
+| `rustls-pemfile` | 2 | PEM parsing | Manual cert loading |
+| `rustls-pki-types` | 1 | TLS types | CertificateDer, PrivateKeyDer |
+| `clap` | 4 | CLI arguments | Server startup options |
+| `signal-hook` | 0.3 | Signal handling | SIGTERM/SIGINT for shutdown, SIGHUP for config reload |
+
+Versions listed are minimum major versions. Implementation should pin exact
+versions in `Cargo.toml` per standard Rust practice.
+
+## Exports
+
+This is a single-binary deployment. There are no library exports. The product
+is the `reverse-proxy` binary plus a systemd unit file and a config file.
+
+## Dependencies on Other Projects
+
+- **alknet**: The `ArcSwap<DynamicConfig>` pattern, `tokio-rustls` TLS acceptor
+  construction, `rustls-acme` integration, and `ServerConfig` builder patterns
+  are adapted from alknet's transport and config layers. These patterns are
+  referenced as validation that the approaches work in production; all code
+  in this project is written from scratch.
+
+## Design Decisions
+
+All design decisions are documented as ADRs in [decisions/](decisions/).
+
+| ADR | Decision | Summary |
+|-----|----------|---------|
+| [001](decisions/001-rust-axum.md) | Rust with axum | Memory safety eliminates the bug class causing nginx CVEs; axum provides ergonomic tower integration |
+| [002](decisions/002-custom-proxy-handler.md) | Custom proxy handler | Single upstream, single domain — axum-reverse-proxy adds unnecessary complexity |
+| [003](decisions/003-toml-config.md) | TOML configuration format | Rust-native, unambiguous, excellent serde support |
+| [004](decisions/004-rustls-acme.md) | ACME-primary certificate management | Eliminates certbot dependency; automatic provisioning and renewal |
+| [005](decisions/005-tokio-rustls-direct.md) | tokio-rustls directly, not axum-server | Full control over TLS config, ACME resolver integration, cipher suite configuration |
+| [006](decisions/006-rate-limiting-approach.md) | Token bucket rate limiting | In-memory per-IP token bucket matching nginx burst semantics |
+| [007](decisions/007-custom-log-format.md) | Custom structured log format | key=value pairs with RATE_LIMIT prefix for fail2ban |
+| [008](decisions/008-static-dynamic-config-split.md) | Static/dynamic config with ArcSwap | Immutable StaticConfig, hot-reloadable DynamicConfig via ArcSwap |
+| [009](decisions/009-signal-handling.md) | Signal handling strategy | signal-hook for SIGTERM/SIGINT/SIGHUP |
+
+## Open Questions
+
+Open questions are tracked in [open-questions.md](open-questions.md). Key
+questions affecting this document:
+
+- **OQ-01**: Should cipher suites be restricted beyond rustls defaults? (open)
+- **OQ-03**: Should the health check endpoint be on a separate port? (open)
+- **OQ-05**: Should the proxy bind to multiple addresses or just one? (open)
--- a/docs/architecture/proxy.md
+++ b/docs/architecture/proxy.md
@@ -0,0 +1,169 @@
+---
+status: draft
+last_updated: 2026-06-11
+---
+
+# Proxy Handler
+
+## What It Is
+
+The proxy handler is the core component that receives an incoming HTTP request
+on the TLS-terminated connection, applies middleware (rate limiting, header
+injection, body size limits), and forwards it to the upstream service.
+
+## Why It Exists
+
+This component replaces nginx's `proxy_pass` directive. For our use case —
+single upstream per domain, no load balancing, no HTTP/2 proxying — a custom
+handler is simpler and more maintainable than a general-purpose proxy library.
+
+## Architecture
+
+```
+Incoming HTTPS request
+        │
+        ▼
+┌─────────────────┐
+│  axum Router     │
+│  (Host-based)    │─── /health → 200 OK
+│                  │
+│  match Host      │
+│  header on       │
+│  incoming req    │
+└───────┬─────────┘
+        │
+        ▼
+┌─────────────────┐
+│ Rate Limiting    │  ← tower middleware layer
+│ Middleware        │
+└───────┬─────────┘
+        │
+        ▼
+┌─────────────────┐
+│ Proxy Header     │  ← custom middleware / handler
+│ Injection        │
+│                  │
+│ X-Real-IP        │  ← connect_info remote_addr
+│ X-Forwarded-For  │  ← append to existing or set
+│ X-Forwarded-Proto │  ← "https" (or "http" on port 80)
+│ Host             │  ← original host header (already set)
+└───────┬─────────┘
+        │
+        ▼
+┌─────────────────┐
+│ Body Size Limit  │  ← DefaultBodyLimit(100 MB)
+│ Middleware        │
+└───────┬─────────┘
+        │
+        ▼
+┌─────────────────┐
+│ Reverse Proxy    │  ← hyper Client request forwarding
+│ Handler          │
+│                  │
+│ 1. Build upstream│
+│    URI from      │
+│    original req   │
+│ 2. Forward req   │
+│    to upstream    │
+│ 3. Stream        │
+│    response back  │
+└─────────────────┘
+```
+
+## Request Flow
+
+### 1. Host-Based Routing
+
+The axum router uses a `Host` extractor to match incoming requests to site
+definitions from `DynamicConfig`. Each site definition maps a hostname to an
+upstream address.
+
+Where `host_based_proxy` reads the `Host` header, looks up the site in
+`DynamicConfig.sites`, and either proxies to the upstream or returns 404.
+
+### 2. Proxy Header Injection
+
+Headers are injected before forwarding. The handler reads connection metadata
+from axum's `ConnectInfo` and the original request:
+
+| Header | Value Source | Notes |
+|--------|-------------|-------|
+| `Host` | Original request `Host` header | Already present; preserved as-is |
+| `X-Real-IP` | `ConnectInfo<SocketAddr>` remote IP | Set to client's IP address |
+| `X-Forwarded-For` | Client IP, appended if header exists | Comma-separated list of proxies |
+| `X-Forwarded-Proto` | Determined by listener | `https` on port 443, `http` on port 80 |
+
+The `X-Forwarded-For` handling must append the client IP to any existing value
+(rather than replacing it), to support chained proxies correctly.
+
+### 3. Request Forwarding
+
+The proxy handler constructs a new request to the upstream:
+
+1. Build the upstream URI using the site's `upstream_scheme` and `upstream`
+   address, preserving the original path and query string
+2. Copy the request method, headers, and body from the original
+3. Inject proxy headers (X-Real-IP, X-Forwarded-For, X-Forwarded-Proto)
+4. Send the request via a shared hyper Client instance
+5. Stream the response back to the client
+
+The hyper Client is created once at startup and shared via axum's `State`. It
+must be configured with:
+- Connection pooling (hyper default behavior)
+- Connect timeout: 5 seconds
+- Request timeout: 60 seconds
+- No redirect following (proxies should not follow redirects)
+
+### 4. Error Handling
+
+| Upstream Condition | Response | Notes |
+|-------------------|----------|-------|
+| Upstream reachable | Stream response as-is | Headers, status, body all forwarded |
+| Upstream unreachable | 502 Bad Gateway | Logged at `warn` level |
+| Upstream timeout | 504 Gateway Timeout | Logged at `warn` level |
+| Request body too large | 413 Payload Too Large | From `DefaultBodyLimit` middleware |
+| Rate limit exceeded | 429 Too Many Requests | Logged at `info` level |
+| Unknown Host header | 404 Not Found | No matching site definition |
+
+### 5. HTTP → HTTPS Redirect
+
+A separate HTTP listener on port 80 handles redirect. It reads the `Host`
+header from the incoming request and returns a 301 Permanent Redirect to the
+HTTPS equivalent URL (preserving the path and query string).
+
+This listener runs on the same bind address as the TLS listener but on port 80.
+
+## Upstream Connection
+
+The upstream connection scheme defaults to `http://` since the proxy and backend
+services typically run on the same host (e.g., `127.0.0.1:3000`). The
+`upstream_scheme` field in each site's configuration allows specifying `https://`
+for upstreams that require TLS (e.g., separate hosts or secure internal services).
+
+For the initial deployment (`git.alk.dev` → `127.0.0.1:3000`), the upstream
+connection uses plain HTTP, as TLS between the proxy and Gitea on loopback is
+unnecessary.
+
+## Body Size Limit
+
+axum's `DefaultBodyLimit` layer sets the maximum request body size. For
+compatibility with Gitea's push operations (large pack files), this defaults
+to 100 MB. In Phase 1, the body limit is a global setting; Phase 2 may add
+per-site body limits.
+
+## Design Decisions
+
+All design decisions are documented as ADRs in [decisions/](decisions/).
+
+| ADR | Decision | Summary |
+|-----|----------|---------|
+| [002](decisions/002-custom-proxy-handler.md) | Custom proxy handler | Single upstream, single domain — simpler than a general proxy library |
+| [007](decisions/007-custom-log-format.md) | Custom structured log format | key=value pairs with RATE_LIMIT prefix for fail2ban |
+
+## Open Questions
+
+Open questions are tracked in [open-questions.md](open-questions.md). Key
+questions affecting this document:
+
+- **OQ-06**: Should upstream timeouts be configurable per-site? (open — Phase 1
+  uses global defaults of 5s connect, 60s request)
--- a/docs/architecture/tls.md
+++ b/docs/architecture/tls.md
@@ -0,0 +1,220 @@
+---
+status: draft
+last_updated: 2026-06-11
+---
+
+# TLS Termination
+
+## What It Is
+
+The TLS termination component handles all aspects of encrypted connections:
+certificate provisioning (ACME and manual), TLS handshake, SNI-based certificate
+selection, and connection wrapping for the axum router.
+
+## Why It Exists
+
+TLS termination is the security boundary between the public internet and our
+upstream services. It replaces nginx's `ssl_certificate`, `ssl_protocols`, and
+`ssl_ciphers` configuration with a memory-safe Rust implementation using rustls.
+
+## Architecture
+
+```
+                    ┌──────────────────────────────────────────┐
+                    │          TLS Termination                   │
+                    │                                            │
+  bind_addr:443 ──► │  TcpListener::bind(bind_addr)             │
+                    │       │                                    │
+                    │       ▼                                    │
+                    │  tokio-rustls::TlsAcceptor                 │
+                    │       │                                    │
+                    │       ├─ ACME mode:                        │
+                    │       │  rustls-acme::ResolvesServerCertAcme │
+                    │       │  (auto-provisions & renews certs)   │
+                    │       │                                    │
+                    │       └─ Manual mode:                        │
+                    │          rustls::ServerConfig               │
+                    │          .with_single_cert(cert_chain, key) │
+                    │                                            │
+                    │       │                                    │
+                    │       ▼                                    │
+                    │  TlsStream<TcpStream>                      │
+                    │       │                                    │
+                    │       ▼                                    │
+                    │  hyper::service_fn → axum router            │
+                    └──────────────────────────────────────────┘
+
+  bind_addr:80  ──►  HTTP listener (redirect to HTTPS, no TLS)
+```
+
+## Certificate Provisioning
+
+### ACME Mode (Primary)
+
+Uses `rustls-acme` for automatic certificate provisioning and renewal through
+Let's Encrypt. This is the primary mode — no certbot dependency, no cron jobs,
+no deploy hooks.
+
+**How it works:**
+
+1. `AcmeCertProvider` configures the ACME client with the domain, cache
+   directory, and Let's Encrypt directory (staging or production).
+2. `AcmeConfig::new(vec![domain])` creates an ACME configuration for the
+   domain.
+3. The ACME state machine runs as a background tokio task, handling:
+   - Account registration with Let's Encrypt
+   - Certificate ordering
+   - TLS-ALPN-01 challenge (or HTTP-01 challenge)
+   - Certificate issuance
+   - Certificate renewal (automatic, ~30 days before expiry)
+4. `ResolvesServerCertAcme` is a rustls `ResolvesServerCert` implementation
+   that automatically serves the ACME-provisioned certificate.
+5. When a new certificate is issued, the resolver updates atomically — no
+   restart or signal handling needed.
+
+**Configuration:**
+
+```toml
+[tls]
+mode = "acme"
+acme_domain = "git.alk.dev"
+acme_cache_dir = "/var/lib/reverse-proxy/acme-cache"
+acme_directory = "production"  # or "staging" for testing
+```
+
+**Cache directory:** The `DirCache` from rustls-acme persists ACME account data,
+private keys, and certificates between restarts. This avoids re-provisioning on
+every restart.
+
+### Manual Mode (Fallback)
+
+For environments where ACME is not desired (testing, self-signed certs,
+corporate CAs, or BYO certificates), the proxy loads certificates from file
+paths at startup.
+
+```toml
+[tls]
+mode = "manual"
+cert_path = "/etc/letsencrypt/live/git.alk.dev/fullchain.pem"
+key_path = "/etc/letsencrypt/live/git.alk.dev/privkey.pem"
+```
+
+Certificate files are loaded once at startup using `rustls_pemfile`. Manual
+mode requires a restart to pick up new certificates.
+
+**Why not hot-reload manual certs?** ACME mode handles renewal automatically.
+Manual mode is for cases where you control cert rotation externally (certbot,
+manual renewal). In that case, a SIGHUP-triggered restart is simpler and more
+reliable than file watching. If zero-downtime cert rotation is needed, use ACME
+mode.
+
+## TLS Configuration
+
+### Protocol Versions
+
+The proxy supports TLS 1.2 and TLS 1.3 only, matching the minimum security
+level of the current nginx configuration. The `aws_lc_rs` crypto provider
+defaults to these protocol versions; explicit configuration ensures no
+regression if defaults change in future rustls releases.
+
+### Cipher Suites
+
+rustls 0.23 with the `aws_lc_rs` crypto provider defaults to a conservative
+cipher suite selection that excludes all weak ciphers (no SHA-1, no 3DES, no
+RC4, no CBC-mode suites, no RSA key exchange).
+
+The current nginx config explicitly restricts to:
+
+```
+ECDHE-ECDSA-AES128-GCM-SHA256
+ECDHE-RSA-AES128-GCM-SHA256
+ECDHE-ECDSA-AES256-GCM-SHA384
+ECDHE-RSA-AES256-GCM-SHA384
+```
+
+rustls's defaults include these plus TLS 1.3 suites (which nginx's config
+also allows via `TLSv1.3`). The default rustls cipher list is a strict subset
+of what browsers accept.
+
+See [open-questions.md](open-questions.md) OQ-01 for whether to further
+restrict cipher suites beyond rustls defaults.
+
+### ServerConfig Construction
+
+For manual mode, the `ServerConfig` is built with `with_no_client_auth()` and
+`with_single_cert()`, loading the certificate chain and private key from disk.
+
+For ACME mode, the `ServerConfig` is built with `with_cert_resolver()`, passing
+the `ResolvesServerCertAcme` resolver. The ACME TLS-ALPN-01 protocol identifier
+(`acme-tls/1`) must be registered in the `alpn_protocols` list so the server
+can respond to TLS-ALPN-01 challenges.
+
+Both modes use the `aws_lc_rs` crypto provider with safe default protocol
+versions (TLS 1.2 and TLS 1.3).
+
+## SNI-Based Certificate Selection
+
+### Current (Single Domain)
+
+For single-domain setups, SNI selection is trivial: there's only one
+certificate, so `with_single_cert()` or `ResolvesServerCertAcme` (which
+handles the domain) is sufficient.
+
+### Future (Multi-Domain)
+
+When multiple domains are served, SNI selection works as follows:
+
+1. **TLS handshake**: The client sends the SNI extension indicating which
+   hostname it's connecting to.
+2. **Certificate resolution**: In ACME mode, `ResolvesServerCertAcme` handles
+   this automatically — it stores certificates keyed by domain. In manual mode,
+   a custom `ResolvesServerCert` implementation maps SNI hostname to the
+   correct `CertifiedKey`.
+3. **HTTP routing**: After the TLS handshake, axum's `Host` extractor routes
+   the request to the correct site handler based on the `Host` header.
+
+This is the same pattern nginx uses — SNI selects the cert during TLS, then
+`Host` header selects the server block. In manual mode, a `ResolvesServerCert`
+implementation maps SNI hostname to the correct `CertifiedKey`.
+
+## HTTP Listener (Port 80)
+
+The HTTP listener on port 80 is a plain TCP listener with no TLS. It has one
+job: redirect all requests to the HTTPS equivalent.
+
+The listener binds to the same IP address as the TLS listener, but on port 80.
+
+### ACME Challenge Type
+
+The default ACME challenge type is **TLS-ALPN-01**, since the proxy already
+listens on port 443. This avoids requiring a separate HTTP-01 challenge server.
+HTTP-01 is available as a fallback for environments where TLS-ALPN-01 is not
+suitable (e.g., behind a CDN that terminates TLS). When using HTTP-01, the
+port 80 listener serves `/.well-known/acme-challenge/{token}` paths for
+challenge verification.
+
+## Key Files and Crates
+
+| Component | Crate | Purpose |
+|-----------|-------|---------|
+| TLS acceptor | `tokio-rustls` 0.26 | Async TLS handshake over TCP streams |
+| TLS config | `rustls` 0.23 | ServerConfig, CryptoProvider, cipher suites |
+| ACME client | `rustls-acme` 0.12 | Automatic cert provisioning and renewal |
+| PEM parsing | `rustls-pemfile` 2 | Load cert/key from PEM files (manual mode) |
+| PKI types | `rustls-pki-types` 1 | CertificateDer, PrivateKeyDer |
+
+## Design Decisions
+
+All design decisions are documented as ADRs in [decisions/](decisions/).
+
+| ADR | Decision | Summary |
+|-----|----------|---------|
+| [004](decisions/004-rustls-acme.md) | ACME-primary cert management | Eliminates certbot; automatic provisioning and renewal |
+| [005](decisions/005-tokio-rustls-direct.md) | tokio-rustls directly | Full control over TLS config and ACME resolver integration |
+
+## Open Questions
+
+Open questions are tracked in [open-questions.md](open-questions.md). Key
+questions affecting this document:
+
+- **OQ-01**: Should cipher suites be restricted beyond rustls defaults? (open)