Add architecture specification for Rust/axum reverse proxy
Phase 1 architecture docs covering proxy handler, TLS termination (ACME + manual), TOML config with static/dynamic split (ArcSwap), and operations (rate limiting, logging, health check, systemd, graceful shutdown). Nine ADRs documenting key decisions: Rust/axum, custom proxy handler, TOML config, rustls-acme for cert management, tokio-rustls direct, token bucket rate limiting, custom log format for fail2ban, static/dynamic config split, and signal handling strategy. Includes threat landscape research documenting the nginx CVEs motivating this project.
This commit is contained in:
61
docs/architecture/README.md
Normal file
61
docs/architecture/README.md
Normal file
@@ -0,0 +1,61 @@
|
||||
---
|
||||
status: draft
|
||||
last_updated: 2026-06-11
|
||||
---
|
||||
|
||||
# Reverse Proxy — Architecture
|
||||
|
||||
## Current State
|
||||
|
||||
**Phase 0 (Exploration) — Complete.** Phase 1 (Architecture) — In progress.
|
||||
|
||||
This project replaces our vulnerable nginx 1.24.0 installation with a
|
||||
memory-safe Rust/axum reverse proxy. The primary motivation is CVE-2026-42945
|
||||
(unauthenticated RCE in nginx's rewrite module) and the broader pattern of
|
||||
memory corruption bugs in nginx's C codebase.
|
||||
|
||||
## Architecture Documents
|
||||
|
||||
| Document | Status | Description |
|
||||
|----------|--------|-------------|
|
||||
| [overview.md](overview.md) | Draft | Vision, scope, crate dependencies, exports |
|
||||
| [proxy.md](proxy.md) | Draft | Reverse proxy handler, request flow, header injection |
|
||||
| [tls.md](tls.md) | Draft | TLS termination, ACME, manual certs, SNI |
|
||||
| [config.md](config.md) | Draft | TOML config format, static/dynamic split, ArcSwap reload |
|
||||
| [operations.md](operations.md) | Draft | Rate limiting, logging, health check, systemd, shutdown |
|
||||
|
||||
## ADR Table
|
||||
|
||||
| ADR | Title | Status |
|
||||
|-----|-------|--------|
|
||||
| [001](decisions/001-rust-axum.md) | Rust with Axum | Accepted |
|
||||
| [002](decisions/002-custom-proxy-handler.md) | Custom Proxy Handler | Accepted |
|
||||
| [003](decisions/003-toml-config.md) | TOML Configuration Format | Accepted |
|
||||
| [004](decisions/004-rustls-acme.md) | ACME-Primary Certificate Management | Accepted |
|
||||
| [005](decisions/005-tokio-rustls-direct.md) | tokio-rustls Directly, Not axum-server | Accepted |
|
||||
| [006](decisions/006-rate-limiting-approach.md) | Token Bucket Rate Limiting | Accepted |
|
||||
| [007](decisions/007-custom-log-format.md) | Custom Structured Log Format | Accepted |
|
||||
| [008](decisions/008-static-dynamic-config-split.md) | Static/Dynamic Config Split with ArcSwap | Accepted |
|
||||
| [009](decisions/009-signal-handling.md) | Signal Handling Strategy | Accepted |
|
||||
|
||||
## Open Questions
|
||||
|
||||
See [open-questions.md](open-questions.md) for the full tracker.
|
||||
|
||||
| OQ | Question | Priority | Status |
|
||||
|----|----------|----------|--------|
|
||||
| OQ-01 | Should cipher suites be restricted beyond rustls defaults? | medium | open |
|
||||
| ~~OQ-02~~ | ~~What log format should fail2ban consume?~~ | ~~high~~ | **resolved** (ADR-007) |
|
||||
| OQ-03 | Should the health check endpoint be on a separate port? | low | open |
|
||||
| OQ-04 | Config reload: SIGHUP only or also Unix socket API? | low | open |
|
||||
| OQ-05 | Should the proxy bind to multiple addresses? | low | open |
|
||||
| OQ-06 | Should upstream timeouts be configurable per-site? | low | open |
|
||||
|
||||
## Document Lifecycle
|
||||
|
||||
| Status | Meaning | Transitions |
|
||||
|--------|---------|-------------|
|
||||
| `draft` | Under active development. May change significantly. | → `reviewed` when open questions are resolved |
|
||||
| `reviewed` | Architecture is final. Implementation may begin. | → `stable` when implementation is complete |
|
||||
| `stable` | Locked. Changes require review and may warrant an ADR. | → `deprecated` when superseded |
|
||||
| `deprecated` | Superseded. Kept for reference. | Removed when no longer referenced |
|
||||
206
docs/architecture/config.md
Normal file
206
docs/architecture/config.md
Normal file
@@ -0,0 +1,206 @@
|
||||
---
|
||||
status: draft
|
||||
last_updated: 2026-06-11
|
||||
---
|
||||
|
||||
# Configuration
|
||||
|
||||
## What It Is
|
||||
|
||||
The configuration system defines how the proxy is configured, how configuration
|
||||
is loaded, and how dynamic configuration can be reloaded without restarting the
|
||||
process.
|
||||
|
||||
## Why It Exists
|
||||
|
||||
The proxy needs to be configurable without hard-coding domains, upstream
|
||||
addresses, or TLS settings. The configuration system separates immutable
|
||||
startup parameters (bind addresses, TLS mode) from runtime-adjustable
|
||||
parameters (site definitions, rate limits) using the `ArcSwap` pattern proven
|
||||
in the alknet project.
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
config.toml
|
||||
│
|
||||
▼
|
||||
┌──────────────────────┐
|
||||
│ serde::Deserialize │
|
||||
│ (TOML → Config) │
|
||||
└──────────┬───────────┘
|
||||
│
|
||||
▼
|
||||
┌──────────────────────┐ ┌──────────────────────┐
|
||||
│ StaticConfig │ │ DynamicConfig │
|
||||
│ (immutable) │ │ (hot-reloadable) │
|
||||
│ │ │ │
|
||||
│ bind_addr │ │ sites[] │
|
||||
│ http_port │ │ rate_limit │
|
||||
│ https_port │ │ body_limit │
|
||||
│ tls.mode │ │ proxy_headers │
|
||||
│ tls.acme_domain │ │ │
|
||||
│ tls.cert_path │ │ ← ArcSwap → │
|
||||
│ tls.key_path │ │ ConfigReloadHandle │
|
||||
│ tls.cache_dir │ │ .reload(new_config) │
|
||||
│ log_level │ │ │
|
||||
│ log_format │ └───────────────────────┘
|
||||
└──────────────────────┘
|
||||
```
|
||||
|
||||
## Static vs Dynamic Configuration
|
||||
|
||||
This split follows the pattern established in alknet (ADR-030) and adapted
|
||||
for our simpler use case.
|
||||
|
||||
### StaticConfig
|
||||
|
||||
Immutable after startup. Changes require a process restart.
|
||||
|
||||
| Field | Type | Description |
|
||||
|-------|------|-------------|
|
||||
| `bind_addr` | `String` | IP address to bind to (e.g., `"15.235.125.95"`) |
|
||||
| `http_port` | `u16` | Port for HTTP→HTTPS redirect (default: `80`; set to `0` to disable) |
|
||||
| `https_port` | `u16` | Port for TLS listener (default: `443`) |
|
||||
| `tls.mode` | `"acme"` or `"manual"` | Certificate provisioning mode |
|
||||
| `tls.acme_domain` | `String` | Domain for ACME (ACME mode only) |
|
||||
| `tls.acme_cache_dir` | `String` | ACME state cache directory |
|
||||
| `tls.acme_directory` | `"production"` or `"staging"` | Let's Encrypt directory |
|
||||
| `tls.cert_path` | `String` | Certificate file path (manual mode only) |
|
||||
| `tls.key_path` | `String` | Private key file path (manual mode only) |
|
||||
| `log_level` | `"trace"`, `"debug"`, `"info"`, `"warn"`, `"error"` | Logging verbosity |
|
||||
| `log_format` | `"text"` or `"json"` | Log output format |
|
||||
|
||||
**Why these are static:** Changing bind addresses, ports, or TLS mode requires
|
||||
creating new listeners and TLS configurations — operations that fundamentally
|
||||
require a restart. There's no safe way to change these at runtime.
|
||||
|
||||
### DynamicConfig
|
||||
|
||||
Hot-reloadable at runtime via `ArcSwap`. Changes take effect for new
|
||||
connections immediately.
|
||||
|
||||
| Field | Type | Description |
|
||||
|-------|------|-------------|
|
||||
| `sites` | `Vec<SiteConfig>` | Site definitions (hostname → upstream mapping) |
|
||||
| `rate_limit.requests_per_second` | `u32` | Rate limit per IP (global in Phase 1) |
|
||||
| `rate_limit.burst` | `u32` | Burst capacity (global in Phase 1) |
|
||||
| `body_limit_bytes` | `u64` | Max request body size in bytes (global in Phase 1) |
|
||||
|
||||
**SiteConfig:**
|
||||
|
||||
| Field | Type | Description |
|
||||
|-------|------|-------------|
|
||||
| `host` | `String` | Hostname to match (e.g., `"git.alk.dev"`) |
|
||||
| `upstream` | `String` | Upstream address (e.g., `"127.0.0.1:3000"`) |
|
||||
| `upstream_scheme` | `"http"` or `"https"` | Protocol for upstream connection (default: `"http"`) |
|
||||
|
||||
**Why these are dynamic:** Site definitions and rate limits are per-request
|
||||
concerns. Adding a site or changing a rate limit should not require restarting
|
||||
the proxy and dropping active connections. Rate limits and body limits are
|
||||
global settings in Phase 1; per-site configuration for these may be added in
|
||||
Phase 2.
|
||||
|
||||
## Config Reload
|
||||
|
||||
### ArcSwap Pattern
|
||||
|
||||
`DynamicConfig` is wrapped in `Arc<ArcSwap<DynamicConfig>>`. This provides:
|
||||
|
||||
- **Lock-free reads**: Every handler reads the current config via a single
|
||||
`Arc` dereference — no lock contention on the request hot path.
|
||||
- **Atomic writes**: `ConfigReloadHandle::reload(new_config)` swaps the entire
|
||||
config atomically. All new requests see the new config immediately.
|
||||
- **No partial updates**: The entire config is swapped at once. There's no risk
|
||||
of reading a half-updated config.
|
||||
|
||||
See [ADR-008](decisions/008-static-dynamic-config-split.md) for the rationale
|
||||
behind this split.
|
||||
|
||||
### Reload Trigger
|
||||
|
||||
The initial implementation uses SIGHUP as the reload trigger. When the process
|
||||
receives SIGHUP:
|
||||
|
||||
1. Re-read the config file from disk
|
||||
2. Deserialize into `DynamicConfig`
|
||||
3. Validate (check upstream reachability is optional)
|
||||
4. Call `ConfigReloadHandle::reload(new_config)`
|
||||
|
||||
Future implementations could add a Unix domain socket API or HTTP endpoint for
|
||||
config reload, but SIGHUP is sufficient for Phase 1.
|
||||
|
||||
## TOML Config Format
|
||||
|
||||
```toml
|
||||
# reverse-proxy config
|
||||
|
||||
[server]
|
||||
bind_addr = "15.235.125.95"
|
||||
http_port = 80
|
||||
https_port = 443
|
||||
|
||||
[server.tls]
|
||||
mode = "acme" # "acme" or "manual"
|
||||
acme_domain = "git.alk.dev"
|
||||
acme_cache_dir = "/var/lib/reverse-proxy/acme-cache"
|
||||
acme_directory = "production" # "production" or "staging"
|
||||
|
||||
# Manual mode (uncomment and comment out ACME settings)
|
||||
# mode = "manual"
|
||||
# cert_path = "/etc/letsencrypt/live/git.alk.dev/fullchain.pem"
|
||||
# key_path = "/etc/letsencrypt/live/git.alk.dev/privkey.pem"
|
||||
|
||||
[server.logging]
|
||||
level = "info"
|
||||
format = "text" # "text" or "json"
|
||||
|
||||
[rate_limit]
|
||||
requests_per_second = 10
|
||||
burst = 20
|
||||
|
||||
[body]
|
||||
limit_bytes = 104857600 # 100 MB
|
||||
|
||||
[[sites]]
|
||||
host = "git.alk.dev"
|
||||
upstream = "127.0.0.1:3000"
|
||||
upstream_scheme = "http"
|
||||
```
|
||||
|
||||
### Validation
|
||||
|
||||
On startup, the config is validated:
|
||||
|
||||
1. `bind_addr` is not `0.0.0.0` (must be explicit)
|
||||
2. In ACME mode, `acme_domain` must be set
|
||||
3. In manual mode, `cert_path` and `key_path` must both be set and the files
|
||||
must be readable
|
||||
4. Each site must have a `host` and `upstream`
|
||||
5. `rate_limit.requests_per_second` must be > 0
|
||||
6. `body.limit_bytes` must be > 0
|
||||
|
||||
On SIGHUP reload, the same validation applies. If the new config fails
|
||||
validation, the reload is rejected and the old config remains active. An error
|
||||
is logged.
|
||||
|
||||
**On startup**: If config validation fails, the process exits with a non-zero
|
||||
code and logs the validation errors. The proxy will not start with an invalid
|
||||
configuration.
|
||||
|
||||
## Design Decisions
|
||||
|
||||
All design decisions are documented as ADRs in [decisions/](decisions/).
|
||||
|
||||
| ADR | Decision | Summary |
|
||||
|-----|----------|---------|
|
||||
| [003](decisions/003-toml-config.md) | TOML configuration format | Rust-native, unambiguous, excellent serde support |
|
||||
| [008](decisions/008-static-dynamic-config-split.md) | Static/dynamic config split | Immutable StaticConfig, hot-reloadable DynamicConfig via ArcSwap |
|
||||
|
||||
## Open Questions
|
||||
|
||||
Open questions are tracked in [open-questions.md](open-questions.md). Key
|
||||
questions affecting this document:
|
||||
|
||||
- **OQ-04**: Should config reload support a Unix domain socket API in addition
|
||||
to SIGHUP? (open)
|
||||
61
docs/architecture/decisions/001-rust-axum.md
Normal file
61
docs/architecture/decisions/001-rust-axum.md
Normal file
@@ -0,0 +1,61 @@
|
||||
# ADR-001: Rust with Axum
|
||||
|
||||
## Status
|
||||
|
||||
Accepted
|
||||
|
||||
## Context
|
||||
|
||||
Our current nginx 1.24.0 installation is vulnerable to multiple actively-exploited
|
||||
CVEs, most critically CVE-2026-42945 (CVSS 9.2, unauthenticated RCE via
|
||||
`ngx_http_rewrite_module`). Six of seven recent nginx CVEs are memory corruption
|
||||
bugs (buffer overflow, use-after-free, buffer overread) — the exact class of
|
||||
vulnerabilities that Rust eliminates by construction.
|
||||
|
||||
The threat landscape is worsening: LLM-assisted fuzzing is accelerating bug
|
||||
discovery in nginx's C codebase, and security researchers report additional
|
||||
undisclosed vulnerabilities.
|
||||
|
||||
We need to replace nginx with a memory-safe alternative that can handle:
|
||||
- TLS termination
|
||||
- HTTP reverse proxying to backend services
|
||||
- Rate limiting with fail2ban-compatible logging
|
||||
- Operational simplicity (single binary, systemd integration)
|
||||
|
||||
## Decision
|
||||
|
||||
Use Rust with the axum web framework for the reverse proxy implementation.
|
||||
|
||||
**Rust** provides:
|
||||
- Memory safety by construction (no buffer overflows, use-after-free, or
|
||||
double-free at runtime)
|
||||
- rustls (pure Rust TLS) avoids OpenSSL dependency and its CVE history
|
||||
- Single static binary deployment with no runtime dependencies
|
||||
- Excellent async I/O support via tokio
|
||||
|
||||
**axum** provides:
|
||||
- Ergonomic handler definitions with extractors
|
||||
- Tower middleware ecosystem (Service trait, layers)
|
||||
- Type-safe routing and state management
|
||||
- Well-maintained, widely used, good documentation
|
||||
|
||||
## Consequences
|
||||
|
||||
**Positive:**
|
||||
- Eliminates the entire class of memory corruption vulnerabilities affecting
|
||||
nginx
|
||||
- Single binary deployment simplifies operations
|
||||
- Rust's type system catches many errors at compile time
|
||||
- axum + tower provides composable middleware
|
||||
|
||||
**Negative:**
|
||||
- Smaller ecosystem than nginx for HTTP proxy features (but our use case is
|
||||
simple)
|
||||
- We maintain the code (vs. using a battle-tested C project)
|
||||
- Less granular control over HTTP/2 and connection pooling compared to nginx
|
||||
- Team needs Rust expertise (already available)
|
||||
|
||||
## References
|
||||
|
||||
- [threat-landscape.md](../../research/threat-landscape.md)
|
||||
- [overview.md](../overview.md)
|
||||
56
docs/architecture/decisions/002-custom-proxy-handler.md
Normal file
56
docs/architecture/decisions/002-custom-proxy-handler.md
Normal file
@@ -0,0 +1,56 @@
|
||||
# ADR-002: Custom Proxy Handler
|
||||
|
||||
## Status
|
||||
|
||||
Accepted
|
||||
|
||||
## Context
|
||||
|
||||
We need to implement HTTP reverse proxying — receiving requests and forwarding
|
||||
them to an upstream service (Gitea on localhost:3000). Two approaches are
|
||||
available:
|
||||
|
||||
1. **`axum-reverse-proxy` crate**: Provides path-based routing, header
|
||||
forwarding, round-robin load balancing, TLS support, retry mechanisms, and
|
||||
RFC 9110 compliance.
|
||||
2. **Custom handler** (Felix Knorr pattern): Build a handler using hyper's
|
||||
`Client` to forward requests. ~50-100 lines of Rust for our needs.
|
||||
|
||||
Our use case is minimal: single upstream per domain, single domain, no load
|
||||
balancing, no retry, no HTTP/2 proxying.
|
||||
|
||||
## Decision
|
||||
|
||||
Implement a custom proxy handler using hyper's `Client` for request forwarding,
|
||||
following the pattern demonstrated by Felix Knorr and used in the alknet
|
||||
project's channel proxy.
|
||||
|
||||
## Rationale
|
||||
|
||||
- `axum-reverse-proxy` adds complexity we don't need (load balancing, retry,
|
||||
path-based routing to multiple backends)
|
||||
- Our proxy case is the simplest possible: match a Host header, forward the
|
||||
entire request to a single upstream, stream the response back
|
||||
- The Felix Knorr pattern is proven, idiomatic, and ~50-100 lines
|
||||
- We maintain full control over header injection, error handling, and upstream
|
||||
connection behavior
|
||||
- If requirements grow, we can adopt `axum-reverse-proxy` later
|
||||
|
||||
## Consequences
|
||||
|
||||
**Positive:**
|
||||
- Minimal dependencies
|
||||
- Full control over proxy behavior
|
||||
- Easy to understand and audit (~100 lines of proxy code)
|
||||
- No unnecessary abstraction layers
|
||||
|
||||
**Negative:**
|
||||
- We implement and maintain proxy logic ourselves (but it's trivial for our
|
||||
use case)
|
||||
- If requirements grow to load balancing or retry, we'd need to add that
|
||||
ourselves or switch to `axum-reverse-proxy`
|
||||
|
||||
## References
|
||||
|
||||
- [proxy.md](../proxy.md)
|
||||
- Felix Knorr, "Replacing nginx with axum" (felix-knorr.net/posts/2024-10-13-replacing-nginx-with-axum.html)
|
||||
44
docs/architecture/decisions/003-toml-config.md
Normal file
44
docs/architecture/decisions/003-toml-config.md
Normal file
@@ -0,0 +1,44 @@
|
||||
# ADR-003: TOML Configuration Format
|
||||
|
||||
## Status
|
||||
|
||||
Accepted
|
||||
|
||||
## Context
|
||||
|
||||
The proxy needs a configuration file format for defining sites, TLS settings,
|
||||
bind addresses, and rate limits. Options include TOML, YAML, JSON, and custom
|
||||
binary formats.
|
||||
|
||||
## Decision
|
||||
|
||||
Use TOML as the configuration file format.
|
||||
|
||||
## Rationale
|
||||
|
||||
- **Rust-native**: TOML is the configuration format for Cargo (Rust's package
|
||||
manager). The Rust ecosystem has first-class TOML support via `serde` +
|
||||
`toml` crate.
|
||||
- **Unambiguous**: TOML has a single canonical representation for any given
|
||||
data structure, unlike YAML which has multiple equivalent representations and
|
||||
surprising type coercion rules (e.g., `no` → boolean, `1.0` → float).
|
||||
- **Human-friendly**: TOML is easy to read and write for simple configurations
|
||||
like ours. It supports sections (tables), arrays, and inline tables.
|
||||
- **Good error messages**: The `toml` crate provides clear deserialization
|
||||
error messages pointing to the exact field that failed.
|
||||
|
||||
## Consequences
|
||||
|
||||
**Positive:**
|
||||
- Familiar to Rust developers (Cargo.toml)
|
||||
- Clear, unambiguous syntax
|
||||
- Excellent serde integration with detailed error reporting
|
||||
- No type coercion surprises
|
||||
|
||||
**Negative:**
|
||||
- Not as widely used for config outside Rust (but our audience is ourselves)
|
||||
- No `#include` or file composition (each config file is self-contained)
|
||||
|
||||
## References
|
||||
|
||||
- [config.md](../config.md)
|
||||
67
docs/architecture/decisions/004-rustls-acme.md
Normal file
67
docs/architecture/decisions/004-rustls-acme.md
Normal file
@@ -0,0 +1,67 @@
|
||||
# ADR-004: ACME-Primary Certificate Management
|
||||
|
||||
## Status
|
||||
|
||||
Accepted
|
||||
|
||||
## Context
|
||||
|
||||
The proxy needs TLS certificates for HTTPS. Two approaches are available:
|
||||
|
||||
1. **certbot (external ACME client)**: Run certbot as a cron job or systemd
|
||||
timer to obtain and renew certificates. The proxy loads certificates from
|
||||
files on disk. Renewal requires either SIGHUP/restart or inotify file
|
||||
watching to pick up new certs.
|
||||
|
||||
2. **rustls-acme (built-in ACME client)**: The proxy handles ACME
|
||||
certificate provisioning and renewal internally as a background task. No
|
||||
external certbot dependency. The `ResolvesServerCertAcme` cert resolver
|
||||
automatically serves the correct certificate and updates when renewed.
|
||||
|
||||
The alknet project has successfully implemented the rustls-acme approach, and
|
||||
its patterns are directly reusable.
|
||||
|
||||
## Decision
|
||||
|
||||
Use `rustls-acme` as the primary certificate management mode, with manual
|
||||
certificate paths as a fallback mode for testing, self-signed certs, and
|
||||
corporate CA environments.
|
||||
|
||||
## Rationale
|
||||
|
||||
- **Eliminates certbot dependency**: No external cron job, no deploy hooks, no
|
||||
certbot package to install and maintain. The proxy is self-contained.
|
||||
- **Automatic renewal**: `rustls-acme` runs as a background tokio task that
|
||||
handles certificate provisioning and renewal automatically (~30 days before
|
||||
expiry).
|
||||
- **No restart needed**: When `rustls-acme` provisions a new certificate, the
|
||||
`ResolvesServerCertAcme` resolver updates atomically. No SIGHUP, no restart,
|
||||
no file watching.
|
||||
- **Proven pattern**: alknet uses the same approach successfully.
|
||||
- **Cache persistence**: `DirCache` persists ACME state between restarts,
|
||||
avoiding re-provisioning.
|
||||
- **Fallback mode**: Manual cert paths are still supported for environments
|
||||
where ACME is not possible.
|
||||
|
||||
## Consequences
|
||||
|
||||
**Positive:**
|
||||
- Single binary deployment (no certbot dependency)
|
||||
- Zero-downtime certificate renewal
|
||||
- Simpler operational model (no certbot cron, no deploy hooks)
|
||||
- Proven in alknet
|
||||
|
||||
**Negative:**
|
||||
- `rustls-acme` is an additional dependency
|
||||
- ACME challenges require either port 80 (HTTP-01) or TLS-ALPN-01 on port 443,
|
||||
which our proxy already listens on
|
||||
- Less control over certificate issuance compared to certbot (e.g., no DNS-01
|
||||
challenge support, though rustls-acme supports TLS-ALPN-01 which is sufficient
|
||||
for our use case)
|
||||
- Manual mode requires restart for cert changes (acceptable for fallback)
|
||||
|
||||
## References
|
||||
|
||||
- [tls.md](../tls.md)
|
||||
- alknet ADR-008: ACME/Let's Encrypt decision
|
||||
- `rustls-acme` crate: https://github.com/FlorianUekermann/rustls-acme
|
||||
65
docs/architecture/decisions/005-tokio-rustls-direct.md
Normal file
65
docs/architecture/decisions/005-tokio-rustls-direct.md
Normal file
@@ -0,0 +1,65 @@
|
||||
# ADR-005: tokio-rustls Directly, Not axum-server
|
||||
|
||||
## Status
|
||||
|
||||
Accepted
|
||||
|
||||
## Context
|
||||
|
||||
We need to serve HTTPS (TLS) traffic through axum. Two approaches exist for
|
||||
integrating TLS with axum:
|
||||
|
||||
1. **`axum-server`**: A wrapper that provides TLS support for axum via
|
||||
`tls_rustls` feature. Handles TCP binding, TLS accept, and passing TLS
|
||||
streams to axum. Simple API but limited control over the TLS configuration.
|
||||
|
||||
2. **`tokio-rustls` directly**: Bind TCP manually, perform TLS handshake with
|
||||
`TlsAcceptor`, then serve the TLS stream to axum/hyper. More code but full
|
||||
control over `ServerConfig`, cipher suites, ALPN protocols, and cert
|
||||
resolvers.
|
||||
|
||||
The alknet project uses tokio-rustls directly and has proven this pattern for
|
||||
both manual and ACME certificate management.
|
||||
|
||||
## Decision
|
||||
|
||||
Use `tokio-rustls` directly for TLS termination, with `hyper` serving the
|
||||
resulting TLS streams to axum. Do not use `axum-server`.
|
||||
|
||||
## Rationale
|
||||
|
||||
- **ACME integration**: The `rustls-acme` `ResolvesServerCertAcme` resolver
|
||||
needs to be set as the certificate resolver on `ServerConfig` via
|
||||
`with_cert_resolver()`. `axum-server` does not expose this level of control
|
||||
over the `ServerConfig`.
|
||||
- **Cipher suite control**: We may need to configure cipher suites beyond the
|
||||
defaults (see OQ-01). `axum-server` wraps the `ServerConfig` construction
|
||||
and may not expose `CryptoProvider` configuration. Direct `tokio-rustls`
|
||||
usage gives us full control.
|
||||
- **ALPN configuration**: ACME TLS-ALPN-01 challenge requires adding
|
||||
`acme-tls/1` to the ALPN protocol list. This is only possible with direct
|
||||
`ServerConfig` access.
|
||||
- **Proven pattern**: alknet uses exactly this approach (`TlsAcceptor` wrapping
|
||||
`tokio-rustls`, with manual or ACME `ServerConfig` construction).
|
||||
- **No abstraction cost**: The code to bind TCP, accept TLS, and serve to
|
||||
axum/hyper is ~50 lines. `axum-server` saves little for our simple case.
|
||||
|
||||
## Consequences
|
||||
|
||||
**Positive:**
|
||||
- Full control over TLS configuration
|
||||
- Direct `rustls-acme` integration
|
||||
- Ability to add ALPN protocols for ACME challenges
|
||||
- Proven pattern from alknet
|
||||
|
||||
**Negative:**
|
||||
- Slightly more code than `axum-server` (~50 lines for the TLS acceptor loop)
|
||||
- Need to manage the TCP listener and TLS accept explicitly
|
||||
- Must handle the `TlsStream<TcpStream>` → `hyper::service_fn` → axum
|
||||
integration manually (well-documented pattern from Felix Knorr's blog and
|
||||
alknet)
|
||||
|
||||
## References
|
||||
|
||||
- [tls.md](../tls.md)
|
||||
- alknet transport layer (`alknet-core/src/transport/tls.rs`, `alknet-core/src/transport/acme.rs`)
|
||||
77
docs/architecture/decisions/006-rate-limiting-approach.md
Normal file
77
docs/architecture/decisions/006-rate-limiting-approach.md
Normal file
@@ -0,0 +1,77 @@
|
||||
# ADR-006: Token Bucket Rate Limiting with In-Memory State
|
||||
|
||||
## Status
|
||||
|
||||
Accepted
|
||||
|
||||
## Context
|
||||
|
||||
The proxy must enforce request rate limits per client IP address, replacing
|
||||
nginx's `limit_req_zone` directive. Rate limiting is critical for preventing
|
||||
abuse and for fail2ban integration (rate-limited requests trigger fail2ban
|
||||
actions).
|
||||
|
||||
Several rate limiting approaches exist:
|
||||
- **Token bucket**: Tokens accumulate at a fixed rate; each request consumes a
|
||||
token. Allows short bursts up to the bucket capacity.
|
||||
- **Leaky bucket**: Requests are processed at a fixed rate; excess requests
|
||||
queue or are rejected. No burst allowance.
|
||||
- **Fixed window**: Count requests in fixed time windows (e.g., per minute).
|
||||
Allows burst at window boundaries.
|
||||
- **Sliding window**: Count requests in a rolling time window. More accurate
|
||||
than fixed window but more complex.
|
||||
|
||||
The current nginx config uses `limit_req zone=gitea_limit burst=20 nodelay`,
|
||||
which is a token bucket with burst allowance.
|
||||
|
||||
For state storage:
|
||||
- **In-memory HashMap**: Fast, no external dependencies, lost on restart.
|
||||
- **External store (Redis, etc.)**: Shared across instances, persists across
|
||||
restarts. Adds operational complexity.
|
||||
- **tower-governor crate**: Pre-built rate limiting middleware. Uses
|
||||
generalized cell algorithm. Adds dependency.
|
||||
|
||||
## Decision
|
||||
|
||||
Use a token bucket algorithm with in-memory `HashMap<IpAddr, TokenBucket>`
|
||||
state, protected by `tokio::sync::Mutex`. Rate limiting runs as axum middleware
|
||||
before the proxy handler.
|
||||
|
||||
Rate limits are global per-IP (not per-site) in Phase 1. Per-site rate limits
|
||||
may be added in Phase 2 as the config model evolves.
|
||||
|
||||
Stale entries in the HashMap are cleaned up periodically. A background task
|
||||
scans the HashMap at a configurable interval (default: 60 seconds) and removes
|
||||
entries that haven't been accessed within the cleanup interval.
|
||||
|
||||
## Rationale
|
||||
|
||||
- Token bucket matches nginx's `limit_req burst` semantics, ensuring
|
||||
behavioral compatibility during migration.
|
||||
- In-memory state is sufficient for a single-instance proxy (no shared state
|
||||
needed).
|
||||
- `tokio::sync::Mutex` (not `std::sync::Mutex`) avoids holding the lock across
|
||||
await points and integrates with the async runtime.
|
||||
- Custom implementation gives full control over logging output for fail2ban
|
||||
integration (ADR-007).
|
||||
- State loss on restart is acceptable — rate limit state is inherently
|
||||
ephemeral.
|
||||
|
||||
## Consequences
|
||||
|
||||
**Positive:**
|
||||
- Behavioral compatibility with nginx rate limiting
|
||||
- Full control over fail2ban log format
|
||||
- No external dependencies (Redis, etc.)
|
||||
- Simple implementation (~100 lines)
|
||||
|
||||
**Negative:**
|
||||
- Rate limit state is lost on restart (acceptable for single-instance deploy)
|
||||
- Not suitable for multi-instance deployments without external state store
|
||||
(Phase 1 is single-instance)
|
||||
- HashMap grows over time without eviction (mitigated by periodic cleanup)
|
||||
|
||||
## References
|
||||
|
||||
- [operations.md](../operations.md)
|
||||
- nginx `limit_req` documentation
|
||||
67
docs/architecture/decisions/007-custom-log-format.md
Normal file
67
docs/architecture/decisions/007-custom-log-format.md
Normal file
@@ -0,0 +1,67 @@
|
||||
# ADR-007: Custom Structured Log Format for Fail2ban
|
||||
|
||||
## Status
|
||||
|
||||
Accepted
|
||||
|
||||
## Context
|
||||
|
||||
The proxy needs to produce log output that fail2ban can parse to detect and ban
|
||||
abusive IP addresses. The current nginx setup uses nginx's default log format
|
||||
with standard fail2ban filters.
|
||||
|
||||
Options for fail2ban integration:
|
||||
- **nginx-compatible format**: Replicate nginx's log format so existing
|
||||
fail2ban filters work unchanged. Couples us to nginx's format.
|
||||
- **Custom structured format**: Design a clean, parseable format with a
|
||||
corresponding custom fail2ban filter. Gives us control and clarity.
|
||||
- **JSON format**: Machine-readable but harder for fail2ban regex matching.
|
||||
|
||||
## Decision
|
||||
|
||||
Use a custom structured log format with a corresponding custom fail2ban filter.
|
||||
|
||||
The format for rate-limited requests:
|
||||
|
||||
```
|
||||
RATE_LIMIT client_ip=<IP> host=<host> path=<path> status=429
|
||||
```
|
||||
|
||||
The format for general access logs:
|
||||
|
||||
```
|
||||
REQUEST client_ip=<IP> host=<host> method=<METHOD> path=<path> status=<code> upstream=<addr> duration_ms=<ms>
|
||||
```
|
||||
|
||||
A corresponding fail2ban filter (`/etc/fail2ban/filter.d/reverse-proxy.conf`)
|
||||
uses regex matching on the `RATE_LIMIT` prefix and `client_ip=<HOST>` field.
|
||||
|
||||
## Rationale
|
||||
|
||||
- Custom format is clear, unambiguous, and self-documenting
|
||||
- No coupling to nginx's format, which may change or include fields we don't
|
||||
produce
|
||||
- `key=value` pairs are easy to parse with regex and easy to extend
|
||||
- The `RATE_LIMIT` prefix makes it trivial to distinguish rate-limit events
|
||||
from other logs
|
||||
- Writing a custom fail2ban filter is straightforward (5 lines of config)
|
||||
- We control both sides (the proxy and the filter), so compatibility is
|
||||
guaranteed
|
||||
|
||||
## Consequences
|
||||
|
||||
**Positive:**
|
||||
- Clean, purpose-built format
|
||||
- Easy to extend with new fields
|
||||
- No dependency on nginx log format
|
||||
- Custom fail2ban filter is simple to maintain
|
||||
|
||||
**Negative:**
|
||||
- Cannot reuse existing nginx fail2ban filters (trivial to write our own)
|
||||
- Existing fail2ban configurations need updating (acceptable since we're
|
||||
replacing nginx entirely)
|
||||
|
||||
## References
|
||||
|
||||
- [operations.md](../operations.md)
|
||||
- [open-questions.md](../open-questions.md) OQ-02 (now resolved)
|
||||
@@ -0,0 +1,76 @@
|
||||
# ADR-008: Static/Dynamic Configuration Split with ArcSwap
|
||||
|
||||
## Status
|
||||
|
||||
Accepted
|
||||
|
||||
## Context
|
||||
|
||||
The proxy needs configuration that can be partially reloaded at runtime (site
|
||||
definitions, rate limits) without restarting the process and dropping active
|
||||
connections. However, some configuration (bind addresses, TLS mode) fundamentally
|
||||
requires creating new listeners and cannot be changed at runtime.
|
||||
|
||||
Two approaches:
|
||||
- **Full restart for all config changes**: Simple, but requires dropping
|
||||
active connections for every change, including trivial rate limit adjustments.
|
||||
- **Static/dynamic split**: Immutable parameters (bind address, TLS mode) in a
|
||||
`StaticConfig` that requires restart; runtime-adjustable parameters (sites,
|
||||
rate limits) in a `DynamicConfig` that can be atomically swapped via
|
||||
`Arc<ArcSwap<DynamicConfig>>` without dropping connections.
|
||||
|
||||
This pattern is proven in the alknet project, which uses the same
|
||||
`ArcSwap<DynamicConfig>` approach for auth policy, forwarding rules, and rate
|
||||
limits.
|
||||
|
||||
## Decision
|
||||
|
||||
Split configuration into `StaticConfig` (immutable after startup) and
|
||||
`DynamicConfig` (hot-reloadable via `ArcSwap`). The split is:
|
||||
|
||||
**StaticConfig** (restart required):
|
||||
- Bind address, HTTP port, HTTPS port
|
||||
- TLS mode (ACME vs. manual), cert paths, ACME settings
|
||||
- Log level and format
|
||||
|
||||
**DynamicConfig** (hot-reloadable via SIGHUP):
|
||||
- Site definitions (hostname → upstream mappings)
|
||||
- Rate limits (requests per second, burst)
|
||||
- Body size limits
|
||||
|
||||
`ConfigReloadHandle` provides a `reload(DynamicConfig)` method that atomically
|
||||
swaps the entire config. All request handlers read `DynamicConfig` via
|
||||
`ArcSwap::load()` — a lock-free operation.
|
||||
|
||||
## Rationale
|
||||
|
||||
- Rate limits and site definitions change more frequently than bind addresses
|
||||
and TLS settings. Hot-reload avoids unnecessary downtime.
|
||||
- `ArcSwap` provides lock-free reads and atomic writes — no partial updates,
|
||||
no lock contention on the hot path.
|
||||
- Proven pattern from alknet, where it's used for auth policy, forwarding
|
||||
rules, and rate limits.
|
||||
- SIGHUP trigger is simple, well-understood, and compatible with systemd and
|
||||
process supervisors.
|
||||
- The entire config is swapped at once, preventing inconsistent states where
|
||||
some sites use the old config and others use the new one.
|
||||
|
||||
## Consequences
|
||||
|
||||
**Positive:**
|
||||
- Zero-downtime config reload for sites and rate limits
|
||||
- Lock-free reads on the request hot path
|
||||
- Atomic config updates — no partial states
|
||||
- Proven pattern from alknet
|
||||
|
||||
**Negative:**
|
||||
- Two config types add conceptual complexity
|
||||
- SIGHUP reload requires reading the config file from disk (need to handle
|
||||
file read errors gracefully)
|
||||
- Must validate DynamicConfig before swapping (invalid config must not replace
|
||||
valid config)
|
||||
|
||||
## References
|
||||
|
||||
- [config.md](../config.md)
|
||||
- alknet ADR-030 (static/dynamic config split)
|
||||
62
docs/architecture/decisions/009-signal-handling.md
Normal file
62
docs/architecture/decisions/009-signal-handling.md
Normal file
@@ -0,0 +1,62 @@
|
||||
# ADR-009: Signal Handling Strategy
|
||||
|
||||
## Status
|
||||
|
||||
Accepted
|
||||
|
||||
## Context
|
||||
|
||||
The proxy needs to handle Unix signals for:
|
||||
- **Graceful shutdown**: SIGTERM and SIGINT should stop accepting new
|
||||
connections, drain in-flight requests, then exit.
|
||||
- **Config reload**: SIGHUP should trigger a DynamicConfig reload from disk.
|
||||
|
||||
Two approaches for signal handling:
|
||||
- **`tokio::signal`**: Built into tokio. Handles SIGTERM and SIGINT via
|
||||
`ctrl_c()`. Does not directly handle SIGHUP.
|
||||
- **`signal-hook`**: External crate. Handles all Unix signals including SIGHUP.
|
||||
More flexible but adds a dependency.
|
||||
|
||||
## Decision
|
||||
|
||||
Use `signal-hook` for all signal handling. Specifically:
|
||||
- `signal-hook::flag` to set termination flags on SIGTERM/SIGINT
|
||||
- `signal-hook` to register a SIGHUP handler that triggers config reload
|
||||
|
||||
`tokio::signal::ctrl_c()` is registered as a secondary shutdown trigger; both
|
||||
mechanisms converge on the same shutdown path. This is a belt-and-suspenders
|
||||
approach: `signal-hook` handles all signals including SIGHUP, while
|
||||
`ctrl_c()` provides a fallback for environments where signal handling may not
|
||||
be fully wired (e.g., container runtimes).
|
||||
|
||||
The shutdown sequence:
|
||||
1. On SIGTERM or SIGINT: stop accepting new connections, wait up to 30 seconds
|
||||
for in-flight requests to complete, then exit with code 0.
|
||||
2. On SIGHUP: re-read config file, validate, and swap DynamicConfig if valid.
|
||||
Log the result.
|
||||
|
||||
## Rationale
|
||||
|
||||
- SIGHUP handling is required for config reload — `tokio::signal` doesn't
|
||||
support SIGHUP.
|
||||
- `signal-hook` is well-maintained, widely used, and handles all Unix signals.
|
||||
- Using one signal handling mechanism (rather than mixing `tokio::signal` and
|
||||
`signal-hook`) is simpler and avoids edge cases.
|
||||
- `signal-hook::flag` is a minimal, safe API for signal-triggered flags.
|
||||
|
||||
## Consequences
|
||||
|
||||
**Positive:**
|
||||
- SIGHUP for config reload is simple and well-understood
|
||||
- Single signal handling mechanism for all signals
|
||||
- Compatible with systemd (SIGTERM for shutdown) and standard Unix conventions
|
||||
|
||||
**Negative:**
|
||||
- `signal-hook` is an additional dependency (but a well-established one)
|
||||
- Signal handling requires careful coordination with the tokio runtime (async
|
||||
signal receivers must be properly integrated)
|
||||
|
||||
## References
|
||||
|
||||
- [operations.md](../operations.md)
|
||||
- [config.md](../config.md)
|
||||
86
docs/architecture/open-questions.md
Normal file
86
docs/architecture/open-questions.md
Normal file
@@ -0,0 +1,86 @@
|
||||
---
|
||||
status: draft
|
||||
last_updated: 2026-06-11
|
||||
---
|
||||
|
||||
# Open Questions
|
||||
|
||||
## TLS
|
||||
|
||||
### OQ-01: Should cipher suites be restricted beyond rustls defaults?
|
||||
|
||||
- **Origin**: [tls.md](tls.md)
|
||||
- **Status**: open
|
||||
- **Priority**: medium
|
||||
- **Context**: Our current nginx config explicitly restricts cipher suites to
|
||||
four ECDHE-AES-GCM suites. rustls 0.23 with `aws_lc_rs` defaults to a
|
||||
conservative set that excludes all weak ciphers (no SHA-1, no 3DES, no RC4,
|
||||
no CBC-mode suites, no RSA key exchange). The defaults include TLS 1.3 suites
|
||||
which nginx also allows. Restricting further would reduce compatibility with
|
||||
older clients; not restricting means accepting a wider (but still safe) set
|
||||
than the current nginx config.
|
||||
- **Cross-references**: ADR-005
|
||||
|
||||
## Logging and Monitoring
|
||||
|
||||
### ~~OQ-02: What log format should fail2ban consume?~~
|
||||
|
||||
- **Origin**: [operations.md](operations.md), [proxy.md](proxy.md)
|
||||
- **Status**: resolved
|
||||
- **Priority**: high
|
||||
- **Resolution**: Custom structured log format with `key=value` pairs and
|
||||
`RATE_LIMIT` prefix. A corresponding custom fail2ban filter will be provided.
|
||||
See ADR-007.
|
||||
- **Cross-references**: ADR-007
|
||||
|
||||
### OQ-03: Should the health check endpoint be on a separate port?
|
||||
|
||||
- **Origin**: [operations.md](operations.md)
|
||||
- **Status**: open
|
||||
- **Priority**: low
|
||||
- **Context**: Currently the health check is on the main HTTPS listener at
|
||||
`/health`. Alternatives: (a) separate unencrypted port for health checks
|
||||
(simpler for load balancers but less secure), (b) admin port with its own
|
||||
listener (more complex but isolates operational traffic), (c) on the main
|
||||
listener (simplest, proposed approach). For a single-server deployment behind
|
||||
no external load balancer, the main listener is fine.
|
||||
- **Cross-references**: None
|
||||
|
||||
## Configuration
|
||||
|
||||
### OQ-04: Should config reload support a Unix domain socket API in addition to SIGHUP?
|
||||
|
||||
- **Origin**: [config.md](config.md)
|
||||
- **Status**: open
|
||||
- **Priority**: low
|
||||
- **Context**: Phase 1 uses SIGHUP for config reload, which is simple and proven.
|
||||
A Unix domain socket API would allow programmatic reload (e.g., from an admin
|
||||
tool or CI/CD pipeline) and could return success/failure status. This adds
|
||||
complexity and is not needed for Phase 1.
|
||||
- **Cross-references**: None
|
||||
|
||||
## Deployment
|
||||
|
||||
### OQ-05: Should the proxy bind to multiple addresses or just one?
|
||||
|
||||
- **Origin**: [overview.md](overview.md)
|
||||
- **Status**: open
|
||||
- **Priority**: low
|
||||
- **Context**: Current nginx config binds to a specific IP (`15.235.125.95`).
|
||||
The proposed config uses `bind_addr` which could be any IP. For Phase 1, the
|
||||
config will specify a single IP address. Multi-address binding (listening on
|
||||
multiple IPs) is not needed but could be added as an array of addresses.
|
||||
- **Cross-references**: None
|
||||
|
||||
## Proxy
|
||||
|
||||
### OQ-06: Should upstream timeouts be configurable per-site?
|
||||
|
||||
- **Origin**: [proxy.md](proxy.md)
|
||||
- **Status**: open
|
||||
- **Priority**: low
|
||||
- **Context**: Phase 1 uses global defaults (5s connect timeout, 60s request
|
||||
timeout) for all upstream connections. Per-site timeout configuration would
|
||||
allow tuning for different upstream services (e.g., a slow database-backed
|
||||
API vs. a fast static site). Not needed for Phase 1 with a single upstream.
|
||||
- **Cross-references**: None
|
||||
250
docs/architecture/operations.md
Normal file
250
docs/architecture/operations.md
Normal file
@@ -0,0 +1,250 @@
|
||||
---
|
||||
status: draft
|
||||
last_updated: 2026-06-11
|
||||
---
|
||||
|
||||
# Operations
|
||||
|
||||
## What It Is
|
||||
|
||||
The operations component covers everything related to running the proxy in
|
||||
production: rate limiting, logging (fail2ban integration), health checks,
|
||||
systemd integration, and graceful shutdown.
|
||||
|
||||
## Why It Exists
|
||||
|
||||
A reverse proxy that can't be monitored, rate-limited, or gracefully restarted
|
||||
is not production-ready. These concerns are cross-cutting — they affect the
|
||||
proxy handler, the TLS layer, and the config system.
|
||||
|
||||
## Rate Limiting
|
||||
|
||||
### Requirements
|
||||
|
||||
- Limit requests per IP address (replacing nginx's `limit_req_zone`)
|
||||
- Default: 10 requests/second with burst of 20 (matching current nginx config)
|
||||
- Configurable via DynamicConfig (no restart needed)
|
||||
- Must produce logs that fail2ban can consume
|
||||
|
||||
### Design
|
||||
|
||||
The rate limiter runs as axum middleware before the proxy handler. It uses a
|
||||
token bucket algorithm per client IP, matching nginx's `limit_req burst`
|
||||
semantics.
|
||||
|
||||
Rate limits are global per-IP in Phase 1 (not per-site). A request from IP
|
||||
address X counts against the same bucket regardless of which site it targets.
|
||||
Per-site rate limits may be added in Phase 2.
|
||||
|
||||
When a request exceeds the rate limit, the middleware returns `429 Too Many
|
||||
Requests` and logs the event with structured fields.
|
||||
|
||||
### State Eviction
|
||||
|
||||
The per-IP token bucket state grows over time as new IPs are seen. A
|
||||
background task runs at a configurable interval (default: 60 seconds) and
|
||||
removes entries that haven't been accessed within the cleanup interval. This
|
||||
prevents unbounded memory growth.
|
||||
|
||||
### Fail2ban Integration
|
||||
|
||||
Rate limit events are logged in a structured format that a custom fail2ban
|
||||
filter can parse. See [ADR-007](decisions/007-custom-log-format.md) for the
|
||||
format decision.
|
||||
|
||||
The log format uses `key=value` pairs with a `RATE_LIMIT` prefix:
|
||||
|
||||
```
|
||||
RATE_LIMIT client_ip=X.X.X.X host=Y.Z path=/W status=429
|
||||
```
|
||||
|
||||
A corresponding fail2ban filter and jail configuration are provided as part
|
||||
of the deployment documentation.
|
||||
|
||||
## Logging
|
||||
|
||||
### Structure
|
||||
|
||||
All logs use `tracing` with structured fields. The proxy outputs two types of
|
||||
log entries:
|
||||
|
||||
1. **Access logs**: Every proxied request is logged at `info` level with
|
||||
structured fields.
|
||||
|
||||
```
|
||||
REQUEST client_ip=1.2.3.4 host=git.alk.dev method=GET path=/user/repo status=200 upstream=127.0.0.1:3000 duration_ms=45
|
||||
```
|
||||
|
||||
2. **Event logs**: Rate limits, TLS errors, upstream failures, config reloads,
|
||||
etc.
|
||||
|
||||
```
|
||||
RATE_LIMIT client_ip=1.2.3.4 host=git.alk.dev path=/login status=429
|
||||
UPSTREAM_ERROR host=git.alk.dev upstream=127.0.0.1:3000 error="connection refused"
|
||||
CONFIG_RELOAD status=success sites=1
|
||||
```
|
||||
|
||||
### Output
|
||||
|
||||
Logs are written to:
|
||||
- **stdout/stderr**: For systemd/journald integration
|
||||
- **File** (optional): For fail2ban consumption at
|
||||
`/var/log/reverse-proxy/access.log`
|
||||
|
||||
The `tracing-subscriber` layer configuration supports both simultaneously via
|
||||
`Layer` composition.
|
||||
|
||||
### Log Levels
|
||||
|
||||
| Level | Use |
|
||||
|-------|-----|
|
||||
| `error` | Unrecoverable failures (TLS handshake failure, config validation) |
|
||||
| `warn` | Rate limit exceeded, upstream unreachable, upstream timeout |
|
||||
| `info` | Access logs, config reloads, ACME events, startup/shutdown |
|
||||
| `debug` | Request/response headers, connection details |
|
||||
| `trace` | Detailed protocol-level information |
|
||||
|
||||
Configurable via `log_level` in StaticConfig.
|
||||
|
||||
## Health Check
|
||||
|
||||
### Endpoint
|
||||
|
||||
```
|
||||
GET /health → 200 OK (empty body)
|
||||
```
|
||||
|
||||
The health check endpoint is accessible on the main HTTPS listener. It returns
|
||||
200 if the process is alive and serving requests.
|
||||
|
||||
**Limitation**: Since `/health` is served over TLS, it cannot detect TLS
|
||||
configuration errors that prevent the TLS handshake from completing. External
|
||||
monitoring should also check TCP connectivity to port 443 independently.
|
||||
|
||||
### What It Checks
|
||||
|
||||
- Process is running and the tokio runtime is responsive
|
||||
- TLS listener is accepting connections
|
||||
- Config is loaded (StaticConfig and DynamicConfig are initialized)
|
||||
|
||||
It does **not** check upstream reachability. The health check answers "is the
|
||||
proxy process healthy?", not "is the upstream reachable?" — upstream health is
|
||||
a separate concern that would produce 502/504 responses in the proxy handler.
|
||||
|
||||
### Future Extensions
|
||||
|
||||
- `/health/ready` — readiness check that includes upstream reachability
|
||||
- Prometheus metrics at `/metrics`
|
||||
|
||||
## Systemd Integration
|
||||
|
||||
### Unit File
|
||||
|
||||
```ini
|
||||
[Unit]
|
||||
Description=Reverse Proxy
|
||||
After=network.target
|
||||
Wants=network-online.target
|
||||
|
||||
[Service]
|
||||
Type=notify
|
||||
NotifyAccess=all
|
||||
ExecStart=/usr/local/bin/reverse-proxy --config /etc/reverse-proxy/config.toml
|
||||
Restart=on-failure
|
||||
RestartSec=5
|
||||
|
||||
# Security hardening
|
||||
NoNewPrivileges=yes
|
||||
ProtectSystem=strict
|
||||
ProtectHome=yes
|
||||
PrivateTmp=yes
|
||||
ReadWritePaths=/var/lib/reverse-proxy /var/log/reverse-proxy
|
||||
|
||||
# ACME challenge cache directory
|
||||
StateDirectory=reverse-proxy
|
||||
|
||||
[Install]
|
||||
WantedBy=multi-user.target
|
||||
```
|
||||
|
||||
The proxy signals readiness to systemd via `sd_notify` after binding listeners
|
||||
and completing the initial configuration load.
|
||||
|
||||
## Graceful Shutdown
|
||||
|
||||
### Signal Handling
|
||||
|
||||
The proxy handles three signals via `signal-hook` (see [ADR-009](decisions/009-signal-handling.md)):
|
||||
|
||||
- **SIGTERM / SIGINT**: Graceful shutdown. Stop accepting new connections, wait
|
||||
for in-flight requests to complete (up to a configurable timeout), then exit.
|
||||
- **SIGHUP**: Config reload. Re-read the config file, validate, and swap
|
||||
DynamicConfig if valid.
|
||||
|
||||
### SIGHUP for Config Reload
|
||||
|
||||
SIGHUP triggers config reload (see [config.md](config.md) for details). The
|
||||
process does not exit on SIGHUP.
|
||||
|
||||
### Timeout
|
||||
|
||||
In-flight requests have a configurable shutdown timeout (default: 30 seconds).
|
||||
After the timeout, remaining connections are forcefully closed and the process
|
||||
exits.
|
||||
|
||||
## Deployment
|
||||
|
||||
### Binary
|
||||
|
||||
Single static binary, no runtime dependencies:
|
||||
|
||||
```bash
|
||||
cargo build --release
|
||||
# Produces: target/release/reverse-proxy
|
||||
```
|
||||
|
||||
The binary is self-contained — no system libraries beyond libc for DNS
|
||||
resolution. The `aws_lc_rs` crypto provider is statically linked.
|
||||
|
||||
### Configuration
|
||||
|
||||
```bash
|
||||
# Config file
|
||||
/etc/reverse-proxy/config.toml
|
||||
|
||||
# ACME cache directory
|
||||
/var/lib/reverse-proxy/acme-cache/
|
||||
|
||||
# Log directory (optional, for fail2ban)
|
||||
/var/log/reverse-proxy/
|
||||
```
|
||||
|
||||
### CLI
|
||||
|
||||
```bash
|
||||
reverse-proxy [OPTIONS]
|
||||
|
||||
Options:
|
||||
--config <PATH> Path to config file [default: /etc/reverse-proxy/config.toml]
|
||||
--validate Validate config and exit
|
||||
--help Show help
|
||||
--version Show version
|
||||
```
|
||||
|
||||
## Design Decisions
|
||||
|
||||
All design decisions are documented as ADRs in [decisions/](decisions/).
|
||||
|
||||
| ADR | Decision | Summary |
|
||||
|-----|----------|---------|
|
||||
| [001](decisions/001-rust-axum.md) | Rust with axum | Memory safety; single binary deployment |
|
||||
| [006](decisions/006-rate-limiting-approach.md) | Token bucket rate limiting | In-memory per-IP token bucket matching nginx burst semantics |
|
||||
| [007](decisions/007-custom-log-format.md) | Custom structured log format | key=value pairs with RATE_LIMIT prefix for fail2ban |
|
||||
| [009](decisions/009-signal-handling.md) | Signal handling strategy | signal-hook for SIGTERM/SIGINT/SIGHUP |
|
||||
|
||||
## Open Questions
|
||||
|
||||
Open questions are tracked in [open-questions.md](open-questions.md). Key
|
||||
questions affecting this document:
|
||||
|
||||
- **OQ-03**: Should the health check endpoint be on a separate port? (open)
|
||||
166
docs/architecture/overview.md
Normal file
166
docs/architecture/overview.md
Normal file
@@ -0,0 +1,166 @@
|
||||
---
|
||||
status: draft
|
||||
last_updated: 2026-06-11
|
||||
---
|
||||
|
||||
# Overview
|
||||
|
||||
## Vision
|
||||
|
||||
A memory-safe, minimal reverse proxy that replaces our vulnerable nginx instance
|
||||
for forward-proxying to backend services. The proxy terminates TLS, injects
|
||||
standard proxy headers, enforces rate limits, and forwards requests to upstream
|
||||
services — with operational feature parity for our current single-domain Gitea
|
||||
setup.
|
||||
|
||||
## Why This Exists
|
||||
|
||||
Our nginx 1.24.0 installation is vulnerable to multiple actively-exploited
|
||||
CVEs, including CVE-2026-42945 (unauthenticated RCE via `rewrite`/`set`
|
||||
directives). The broader threat landscape is worsening: LLM-assisted fuzzing
|
||||
is accelerating bug discovery in nginx's C codebase, and security researchers
|
||||
report additional undisclosed vulnerabilities. Upgrading nginx patches known
|
||||
CVEs but does not address the structural problem — memory corruption bugs are
|
||||
endemic to C, and the discovery rate is accelerating.
|
||||
|
||||
Rust's memory safety eliminates the entire class of buffer overflow,
|
||||
use-after-free, and double-free bugs that constitute 6 of 7 recent nginx CVEs.
|
||||
Combined with rustls (pure Rust TLS, no OpenSSL dependency), this provides a
|
||||
fundamentally safer baseline.
|
||||
|
||||
See [threat-landscape.md](../research/threat-landscape.md) for full vulnerability
|
||||
details.
|
||||
|
||||
## Scope
|
||||
|
||||
### In Scope
|
||||
|
||||
- **Phase 1**: Replace nginx for `git.alk.dev` with feature parity
|
||||
- TLS termination with ACME (Let's Encrypt) certificate management
|
||||
- Manual certificate paths as fallback mode
|
||||
- HTTP → HTTPS redirect
|
||||
- Reverse proxy to Gitea at `127.0.0.1:3000`
|
||||
- Proxy header injection (Host, X-Real-IP, X-Forwarded-For, X-Forwarded-Proto)
|
||||
- Request rate limiting with fail2ban-compatible logging (global per-IP; per-site in Phase 2)
|
||||
- 100 MB body size limit (global; per-site in Phase 2)
|
||||
- Configurable bind address (no `0.0.0.0` default)
|
||||
- Health check endpoint
|
||||
- Graceful shutdown (SIGTERM handling)
|
||||
- Systemd unit file
|
||||
|
||||
- **Phase 2**: Multi-site support
|
||||
- SNI-based TLS routing for multiple domains
|
||||
- Config file for site definitions
|
||||
- Dynamic config reload (ArcSwap pattern)
|
||||
|
||||
- **Phase 3**: Operational hardening
|
||||
- Metrics endpoint (Prometheus-compatible)
|
||||
- Connection limits and timeouts
|
||||
- Log rotation
|
||||
|
||||
### Out of Scope
|
||||
|
||||
- HTTP/2 or HTTP/3 proxying (services that need these run their own native
|
||||
Rust servers — e.g., `api.alk.dev`)
|
||||
- Load balancing or round-robin upstream selection
|
||||
- WebSocket proxying (can be added later if needed)
|
||||
- Static file serving
|
||||
- Access control beyond rate limiting (no auth, no IP allowlists in Phase 1)
|
||||
- CGI, SCGI, uWSGI, FastCGI
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌────────────────────────────────────┐
|
||||
│ reverse-proxy (Rust/axum) │
|
||||
config.toml ──────► │ StaticConfig + DynamicConfig │
|
||||
│ (ArcSwap for hot-reload) │
|
||||
│ │
|
||||
bind_addr:80 ──► │ HTTP listener → 301 redirect │
|
||||
│ to HTTPS │
|
||||
│ │
|
||||
bind_addr:443 ──► │ TLS listener (tokio-rustls) │
|
||||
│ ├─ ACME mode: rustls-acme resolver │
|
||||
│ │ (auto cert provisioning/renewal) │
|
||||
│ └─ Manual mode: cert/key file paths │
|
||||
│ │
|
||||
│ axum router │
|
||||
│ ├─ Host-based routing │
|
||||
│ ├─ Rate limiting middleware │
|
||||
│ ├─ Proxy header injection │
|
||||
│ ├─ Body size limit (100MB) │
|
||||
│ └─ Reverse proxy handler │
|
||||
│ └─ hyper Client → upstream │
|
||||
│ │
|
||||
│ /health → 200 OK │
|
||||
└────────────────────────────────────┘
|
||||
```
|
||||
|
||||
## Crate Dependencies
|
||||
|
||||
### Core
|
||||
|
||||
| Crate | Version | Purpose | Notes |
|
||||
|-------|---------|---------|-------|
|
||||
| `axum` | 0.8 | HTTP framework | Routing, middleware, extractors |
|
||||
| `tokio` | 1 (full) | Async runtime | Multi-threaded runtime |
|
||||
| `hyper` | 1 | HTTP protocol | Used via axum, and directly for proxy `Client` |
|
||||
| `tower` | 0.5 | Middleware ecosystem | Service trait, layers |
|
||||
| `rustls` | 0.23 | TLS implementation | `aws_lc_rs` crypto provider |
|
||||
| `tokio-rustls` | 0.26 | Async TLS I/O | Wraps TCP with TLS |
|
||||
| `rustls-acme` | 0.12 | ACME client | Let's Encrypt auto-provisioning and renewal |
|
||||
|
||||
### Supporting
|
||||
|
||||
| Crate | Version | Purpose | Notes |
|
||||
|-------|---------|---------|-------|
|
||||
| `serde` | 1 | Serialization | TOML config deserialization |
|
||||
| `toml` | 0.8 | Config format | Declarative site definitions |
|
||||
| `arc-swap` | 1 | Atomic config swap | Lock-free DynamicConfig reload |
|
||||
| `tracing` | 0.1 | Structured logging | fail2ban-compatible output |
|
||||
| `tracing-subscriber` | 0.3 | Log output | File + journald support |
|
||||
| `rustls-pemfile` | 2 | PEM parsing | Manual cert loading |
|
||||
| `rustls-pki-types` | 1 | TLS types | CertificateDer, PrivateKeyDer |
|
||||
| `clap` | 4 | CLI arguments | Server startup options |
|
||||
| `signal-hook` | 0.3 | Signal handling | SIGTERM/SIGINT for shutdown, SIGHUP for config reload |
|
||||
|
||||
Versions listed are minimum major versions. Implementation should pin exact
|
||||
versions in `Cargo.toml` per standard Rust practice.
|
||||
|
||||
## Exports
|
||||
|
||||
This is a single-binary deployment. There are no library exports. The product
|
||||
is the `reverse-proxy` binary plus a systemd unit file and a config file.
|
||||
|
||||
## Dependencies on Other Projects
|
||||
|
||||
- **alknet**: The `ArcSwap<DynamicConfig>` pattern, `tokio-rustls` TLS acceptor
|
||||
construction, `rustls-acme` integration, and `ServerConfig` builder patterns
|
||||
are adapted from alknet's transport and config layers. These patterns are
|
||||
referenced as validation that the approaches work in production; all code
|
||||
in this project is written from scratch.
|
||||
|
||||
## Design Decisions
|
||||
|
||||
All design decisions are documented as ADRs in [decisions/](decisions/).
|
||||
|
||||
| ADR | Decision | Summary |
|
||||
|-----|----------|---------|
|
||||
| [001](decisions/001-rust-axum.md) | Rust with axum | Memory safety eliminates the bug class causing nginx CVEs; axum provides ergonomic tower integration |
|
||||
| [002](decisions/002-custom-proxy-handler.md) | Custom proxy handler | Single upstream, single domain — axum-reverse-proxy adds unnecessary complexity |
|
||||
| [003](decisions/003-toml-config.md) | TOML configuration format | Rust-native, unambiguous, excellent serde support |
|
||||
| [004](decisions/004-rustls-acme.md) | ACME-primary certificate management | Eliminates certbot dependency; automatic provisioning and renewal |
|
||||
| [005](decisions/005-tokio-rustls-direct.md) | tokio-rustls directly, not axum-server | Full control over TLS config, ACME resolver integration, cipher suite configuration |
|
||||
| [006](decisions/006-rate-limiting-approach.md) | Token bucket rate limiting | In-memory per-IP token bucket matching nginx burst semantics |
|
||||
| [007](decisions/007-custom-log-format.md) | Custom structured log format | key=value pairs with RATE_LIMIT prefix for fail2ban |
|
||||
| [008](decisions/008-static-dynamic-config-split.md) | Static/dynamic config with ArcSwap | Immutable StaticConfig, hot-reloadable DynamicConfig via ArcSwap |
|
||||
| [009](decisions/009-signal-handling.md) | Signal handling strategy | signal-hook for SIGTERM/SIGINT/SIGHUP |
|
||||
|
||||
## Open Questions
|
||||
|
||||
Open questions are tracked in [open-questions.md](open-questions.md). Key
|
||||
questions affecting this document:
|
||||
|
||||
- **OQ-01**: Should cipher suites be restricted beyond rustls defaults? (open)
|
||||
- **OQ-03**: Should the health check endpoint be on a separate port? (open)
|
||||
- **OQ-05**: Should the proxy bind to multiple addresses or just one? (open)
|
||||
169
docs/architecture/proxy.md
Normal file
169
docs/architecture/proxy.md
Normal file
@@ -0,0 +1,169 @@
|
||||
---
|
||||
status: draft
|
||||
last_updated: 2026-06-11
|
||||
---
|
||||
|
||||
# Proxy Handler
|
||||
|
||||
## What It Is
|
||||
|
||||
The proxy handler is the core component that receives an incoming HTTP request
|
||||
on the TLS-terminated connection, applies middleware (rate limiting, header
|
||||
injection, body size limits), and forwards it to the upstream service.
|
||||
|
||||
## Why It Exists
|
||||
|
||||
This component replaces nginx's `proxy_pass` directive. For our use case —
|
||||
single upstream per domain, no load balancing, no HTTP/2 proxying — a custom
|
||||
handler is simpler and more maintainable than a general-purpose proxy library.
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
Incoming HTTPS request
|
||||
│
|
||||
▼
|
||||
┌─────────────────┐
|
||||
│ axum Router │
|
||||
│ (Host-based) │─── /health → 200 OK
|
||||
│ │
|
||||
│ match Host │
|
||||
│ header on │
|
||||
│ incoming req │
|
||||
└───────┬─────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────┐
|
||||
│ Rate Limiting │ ← tower middleware layer
|
||||
│ Middleware │
|
||||
└───────┬─────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────┐
|
||||
│ Proxy Header │ ← custom middleware / handler
|
||||
│ Injection │
|
||||
│ │
|
||||
│ X-Real-IP │ ← connect_info remote_addr
|
||||
│ X-Forwarded-For │ ← append to existing or set
|
||||
│ X-Forwarded-Proto │ ← "https" (or "http" on port 80)
|
||||
│ Host │ ← original host header (already set)
|
||||
└───────┬─────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────┐
|
||||
│ Body Size Limit │ ← DefaultBodyLimit(100 MB)
|
||||
│ Middleware │
|
||||
└───────┬─────────┘
|
||||
│
|
||||
▼
|
||||
┌─────────────────┐
|
||||
│ Reverse Proxy │ ← hyper Client request forwarding
|
||||
│ Handler │
|
||||
│ │
|
||||
│ 1. Build upstream│
|
||||
│ URI from │
|
||||
│ original req │
|
||||
│ 2. Forward req │
|
||||
│ to upstream │
|
||||
│ 3. Stream │
|
||||
│ response back │
|
||||
└─────────────────┘
|
||||
```
|
||||
|
||||
## Request Flow
|
||||
|
||||
### 1. Host-Based Routing
|
||||
|
||||
The axum router uses a `Host` extractor to match incoming requests to site
|
||||
definitions from `DynamicConfig`. Each site definition maps a hostname to an
|
||||
upstream address.
|
||||
|
||||
Where `host_based_proxy` reads the `Host` header, looks up the site in
|
||||
`DynamicConfig.sites`, and either proxies to the upstream or returns 404.
|
||||
|
||||
### 2. Proxy Header Injection
|
||||
|
||||
Headers are injected before forwarding. The handler reads connection metadata
|
||||
from axum's `ConnectInfo` and the original request:
|
||||
|
||||
| Header | Value Source | Notes |
|
||||
|--------|-------------|-------|
|
||||
| `Host` | Original request `Host` header | Already present; preserved as-is |
|
||||
| `X-Real-IP` | `ConnectInfo<SocketAddr>` remote IP | Set to client's IP address |
|
||||
| `X-Forwarded-For` | Client IP, appended if header exists | Comma-separated list of proxies |
|
||||
| `X-Forwarded-Proto` | Determined by listener | `https` on port 443, `http` on port 80 |
|
||||
|
||||
The `X-Forwarded-For` handling must append the client IP to any existing value
|
||||
(rather than replacing it), to support chained proxies correctly.
|
||||
|
||||
### 3. Request Forwarding
|
||||
|
||||
The proxy handler constructs a new request to the upstream:
|
||||
|
||||
1. Build the upstream URI using the site's `upstream_scheme` and `upstream`
|
||||
address, preserving the original path and query string
|
||||
2. Copy the request method, headers, and body from the original
|
||||
3. Inject proxy headers (X-Real-IP, X-Forwarded-For, X-Forwarded-Proto)
|
||||
4. Send the request via a shared hyper Client instance
|
||||
5. Stream the response back to the client
|
||||
|
||||
The hyper Client is created once at startup and shared via axum's `State`. It
|
||||
must be configured with:
|
||||
- Connection pooling (hyper default behavior)
|
||||
- Connect timeout: 5 seconds
|
||||
- Request timeout: 60 seconds
|
||||
- No redirect following (proxies should not follow redirects)
|
||||
|
||||
### 4. Error Handling
|
||||
|
||||
| Upstream Condition | Response | Notes |
|
||||
|-------------------|----------|-------|
|
||||
| Upstream reachable | Stream response as-is | Headers, status, body all forwarded |
|
||||
| Upstream unreachable | 502 Bad Gateway | Logged at `warn` level |
|
||||
| Upstream timeout | 504 Gateway Timeout | Logged at `warn` level |
|
||||
| Request body too large | 413 Payload Too Large | From `DefaultBodyLimit` middleware |
|
||||
| Rate limit exceeded | 429 Too Many Requests | Logged at `info` level |
|
||||
| Unknown Host header | 404 Not Found | No matching site definition |
|
||||
|
||||
### 5. HTTP → HTTPS Redirect
|
||||
|
||||
A separate HTTP listener on port 80 handles redirect. It reads the `Host`
|
||||
header from the incoming request and returns a 301 Permanent Redirect to the
|
||||
HTTPS equivalent URL (preserving the path and query string).
|
||||
|
||||
This listener runs on the same bind address as the TLS listener but on port 80.
|
||||
|
||||
## Upstream Connection
|
||||
|
||||
The upstream connection scheme defaults to `http://` since the proxy and backend
|
||||
services typically run on the same host (e.g., `127.0.0.1:3000`). The
|
||||
`upstream_scheme` field in each site's configuration allows specifying `https://`
|
||||
for upstreams that require TLS (e.g., separate hosts or secure internal services).
|
||||
|
||||
For the initial deployment (`git.alk.dev` → `127.0.0.1:3000`), the upstream
|
||||
connection uses plain HTTP, as TLS between the proxy and Gitea on loopback is
|
||||
unnecessary.
|
||||
|
||||
## Body Size Limit
|
||||
|
||||
axum's `DefaultBodyLimit` layer sets the maximum request body size. For
|
||||
compatibility with Gitea's push operations (large pack files), this defaults
|
||||
to 100 MB. In Phase 1, the body limit is a global setting; Phase 2 may add
|
||||
per-site body limits.
|
||||
|
||||
## Design Decisions
|
||||
|
||||
All design decisions are documented as ADRs in [decisions/](decisions/).
|
||||
|
||||
| ADR | Decision | Summary |
|
||||
|-----|----------|---------|
|
||||
| [002](decisions/002-custom-proxy-handler.md) | Custom proxy handler | Single upstream, single domain — simpler than a general proxy library |
|
||||
| [007](decisions/007-custom-log-format.md) | Custom structured log format | key=value pairs with RATE_LIMIT prefix for fail2ban |
|
||||
|
||||
## Open Questions
|
||||
|
||||
Open questions are tracked in [open-questions.md](open-questions.md). Key
|
||||
questions affecting this document:
|
||||
|
||||
- **OQ-06**: Should upstream timeouts be configurable per-site? (open — Phase 1
|
||||
uses global defaults of 5s connect, 60s request)
|
||||
220
docs/architecture/tls.md
Normal file
220
docs/architecture/tls.md
Normal file
@@ -0,0 +1,220 @@
|
||||
---
|
||||
status: draft
|
||||
last_updated: 2026-06-11
|
||||
---
|
||||
|
||||
# TLS Termination
|
||||
|
||||
## What It Is
|
||||
|
||||
The TLS termination component handles all aspects of encrypted connections:
|
||||
certificate provisioning (ACME and manual), TLS handshake, SNI-based certificate
|
||||
selection, and connection wrapping for the axum router.
|
||||
|
||||
## Why It Exists
|
||||
|
||||
TLS termination is the security boundary between the public internet and our
|
||||
upstream services. It replaces nginx's `ssl_certificate`, `ssl_protocols`, and
|
||||
`ssl_ciphers` configuration with a memory-safe Rust implementation using rustls.
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌──────────────────────────────────────────┐
|
||||
│ TLS Termination │
|
||||
│ │
|
||||
bind_addr:443 ──► │ TcpListener::bind(bind_addr) │
|
||||
│ │ │
|
||||
│ ▼ │
|
||||
│ tokio-rustls::TlsAcceptor │
|
||||
│ │ │
|
||||
│ ├─ ACME mode: │
|
||||
│ │ rustls-acme::ResolvesServerCertAcme │
|
||||
│ │ (auto-provisions & renews certs) │
|
||||
│ │ │
|
||||
│ └─ Manual mode: │
|
||||
│ rustls::ServerConfig │
|
||||
│ .with_single_cert(cert_chain, key) │
|
||||
│ │
|
||||
│ │ │
|
||||
│ ▼ │
|
||||
│ TlsStream<TcpStream> │
|
||||
│ │ │
|
||||
│ ▼ │
|
||||
│ hyper::service_fn → axum router │
|
||||
└──────────────────────────────────────────┘
|
||||
|
||||
bind_addr:80 ──► HTTP listener (redirect to HTTPS, no TLS)
|
||||
```
|
||||
|
||||
## Certificate Provisioning
|
||||
|
||||
### ACME Mode (Primary)
|
||||
|
||||
Uses `rustls-acme` for automatic certificate provisioning and renewal through
|
||||
Let's Encrypt. This is the primary mode — no certbot dependency, no cron jobs,
|
||||
no deploy hooks.
|
||||
|
||||
**How it works:**
|
||||
|
||||
1. `AcmeCertProvider` configures the ACME client with the domain, cache
|
||||
directory, and Let's Encrypt directory (staging or production).
|
||||
2. `AcmeConfig::new(vec![domain])` creates an ACME configuration for the
|
||||
domain.
|
||||
3. The ACME state machine runs as a background tokio task, handling:
|
||||
- Account registration with Let's Encrypt
|
||||
- Certificate ordering
|
||||
- TLS-ALPN-01 challenge (or HTTP-01 challenge)
|
||||
- Certificate issuance
|
||||
- Certificate renewal (automatic, ~30 days before expiry)
|
||||
4. `ResolvesServerCertAcme` is a rustls `ResolvesServerCert` implementation
|
||||
that automatically serves the ACME-provisioned certificate.
|
||||
5. When a new certificate is issued, the resolver updates atomically — no
|
||||
restart or signal handling needed.
|
||||
|
||||
**Configuration:**
|
||||
|
||||
```toml
|
||||
[tls]
|
||||
mode = "acme"
|
||||
acme_domain = "git.alk.dev"
|
||||
acme_cache_dir = "/var/lib/reverse-proxy/acme-cache"
|
||||
acme_directory = "production" # or "staging" for testing
|
||||
```
|
||||
|
||||
**Cache directory:** The `DirCache` from rustls-acme persists ACME account data,
|
||||
private keys, and certificates between restarts. This avoids re-provisioning on
|
||||
every restart.
|
||||
|
||||
### Manual Mode (Fallback)
|
||||
|
||||
For environments where ACME is not desired (testing, self-signed certs,
|
||||
corporate CAs, or BYO certificates), the proxy loads certificates from file
|
||||
paths at startup.
|
||||
|
||||
```toml
|
||||
[tls]
|
||||
mode = "manual"
|
||||
cert_path = "/etc/letsencrypt/live/git.alk.dev/fullchain.pem"
|
||||
key_path = "/etc/letsencrypt/live/git.alk.dev/privkey.pem"
|
||||
```
|
||||
|
||||
Certificate files are loaded once at startup using `rustls_pemfile`. Manual
|
||||
mode requires a restart to pick up new certificates.
|
||||
|
||||
**Why not hot-reload manual certs?** ACME mode handles renewal automatically.
|
||||
Manual mode is for cases where you control cert rotation externally (certbot,
|
||||
manual renewal). In that case, a SIGHUP-triggered restart is simpler and more
|
||||
reliable than file watching. If zero-downtime cert rotation is needed, use ACME
|
||||
mode.
|
||||
|
||||
## TLS Configuration
|
||||
|
||||
### Protocol Versions
|
||||
|
||||
The proxy supports TLS 1.2 and TLS 1.3 only, matching the minimum security
|
||||
level of the current nginx configuration. The `aws_lc_rs` crypto provider
|
||||
defaults to these protocol versions; explicit configuration ensures no
|
||||
regression if defaults change in future rustls releases.
|
||||
|
||||
### Cipher Suites
|
||||
|
||||
rustls 0.23 with the `aws_lc_rs` crypto provider defaults to a conservative
|
||||
cipher suite selection that excludes all weak ciphers (no SHA-1, no 3DES, no
|
||||
RC4, no CBC-mode suites, no RSA key exchange).
|
||||
|
||||
The current nginx config explicitly restricts to:
|
||||
|
||||
```
|
||||
ECDHE-ECDSA-AES128-GCM-SHA256
|
||||
ECDHE-RSA-AES128-GCM-SHA256
|
||||
ECDHE-ECDSA-AES256-GCM-SHA384
|
||||
ECDHE-RSA-AES256-GCM-SHA384
|
||||
```
|
||||
|
||||
rustls's defaults include these plus TLS 1.3 suites (which nginx's config
|
||||
also allows via `TLSv1.3`). The default rustls cipher list is a strict subset
|
||||
of what browsers accept.
|
||||
|
||||
See [open-questions.md](open-questions.md) OQ-01 for whether to further
|
||||
restrict cipher suites beyond rustls defaults.
|
||||
|
||||
### ServerConfig Construction
|
||||
|
||||
For manual mode, the `ServerConfig` is built with `with_no_client_auth()` and
|
||||
`with_single_cert()`, loading the certificate chain and private key from disk.
|
||||
|
||||
For ACME mode, the `ServerConfig` is built with `with_cert_resolver()`, passing
|
||||
the `ResolvesServerCertAcme` resolver. The ACME TLS-ALPN-01 protocol identifier
|
||||
(`acme-tls/1`) must be registered in the `alpn_protocols` list so the server
|
||||
can respond to TLS-ALPN-01 challenges.
|
||||
|
||||
Both modes use the `aws_lc_rs` crypto provider with safe default protocol
|
||||
versions (TLS 1.2 and TLS 1.3).
|
||||
|
||||
## SNI-Based Certificate Selection
|
||||
|
||||
### Current (Single Domain)
|
||||
|
||||
For single-domain setups, SNI selection is trivial: there's only one
|
||||
certificate, so `with_single_cert()` or `ResolvesServerCertAcme` (which
|
||||
handles the domain) is sufficient.
|
||||
|
||||
### Future (Multi-Domain)
|
||||
|
||||
When multiple domains are served, SNI selection works as follows:
|
||||
|
||||
1. **TLS handshake**: The client sends the SNI extension indicating which
|
||||
hostname it's connecting to.
|
||||
2. **Certificate resolution**: In ACME mode, `ResolvesServerCertAcme` handles
|
||||
this automatically — it stores certificates keyed by domain. In manual mode,
|
||||
a custom `ResolvesServerCert` implementation maps SNI hostname to the
|
||||
correct `CertifiedKey`.
|
||||
3. **HTTP routing**: After the TLS handshake, axum's `Host` extractor routes
|
||||
the request to the correct site handler based on the `Host` header.
|
||||
|
||||
This is the same pattern nginx uses — SNI selects the cert during TLS, then
|
||||
`Host` header selects the server block. In manual mode, a `ResolvesServerCert`
|
||||
implementation maps SNI hostname to the correct `CertifiedKey`.
|
||||
|
||||
## HTTP Listener (Port 80)
|
||||
|
||||
The HTTP listener on port 80 is a plain TCP listener with no TLS. It has one
|
||||
job: redirect all requests to the HTTPS equivalent.
|
||||
|
||||
The listener binds to the same IP address as the TLS listener, but on port 80.
|
||||
|
||||
### ACME Challenge Type
|
||||
|
||||
The default ACME challenge type is **TLS-ALPN-01**, since the proxy already
|
||||
listens on port 443. This avoids requiring a separate HTTP-01 challenge server.
|
||||
HTTP-01 is available as a fallback for environments where TLS-ALPN-01 is not
|
||||
suitable (e.g., behind a CDN that terminates TLS). When using HTTP-01, the
|
||||
port 80 listener serves `/.well-known/acme-challenge/{token}` paths for
|
||||
challenge verification.
|
||||
|
||||
## Key Files and Crates
|
||||
|
||||
| Component | Crate | Purpose |
|
||||
|-----------|-------|---------|
|
||||
| TLS acceptor | `tokio-rustls` 0.26 | Async TLS handshake over TCP streams |
|
||||
| TLS config | `rustls` 0.23 | ServerConfig, CryptoProvider, cipher suites |
|
||||
| ACME client | `rustls-acme` 0.12 | Automatic cert provisioning and renewal |
|
||||
| PEM parsing | `rustls-pemfile` 2 | Load cert/key from PEM files (manual mode) |
|
||||
| PKI types | `rustls-pki-types` 1 | CertificateDer, PrivateKeyDer |
|
||||
|
||||
## Design Decisions
|
||||
|
||||
All design decisions are documented as ADRs in [decisions/](decisions/).
|
||||
|
||||
| ADR | Decision | Summary |
|
||||
|-----|----------|---------|
|
||||
| [004](decisions/004-rustls-acme.md) | ACME-primary cert management | Eliminates certbot; automatic provisioning and renewal |
|
||||
| [005](decisions/005-tokio-rustls-direct.md) | tokio-rustls directly | Full control over TLS config and ACME resolver integration |
|
||||
|
||||
## Open Questions
|
||||
|
||||
Open questions are tracked in [open-questions.md](open-questions.md). Key
|
||||
questions affecting this document:
|
||||
|
||||
- **OQ-01**: Should cipher suites be restricted beyond rustls defaults? (open)
|
||||
Reference in New Issue
Block a user