Expand architecture: multi-site Phase 1, multi-domain TLS, fix review issues

Promote multi-site support from Phase 2 to Phase 1 (ADR-010): the proxy
must support git.alk.dev and alk.dev from initial release. Add multi-domain
TLS configuration (ADR-011): acme_domains array replaces acme_domain string,
single SAN certificate via rustls-acme.

Key changes:
- ADR-010: Multi-site in Phase 1 — avoids config format migration later
- ADR-011: Multi-domain TLS — single SAN cert, acme_domains Vec<String>
- ADR-002: Updated rationale for multi-site (one upstream per domain)
- overview.md: Phase 1 now includes multi-site, alk.dev pass-through,
  dual licensing (MIT OR Apache-2.0), real IP removed
- config.md: acme_domain → acme_domains, TOML example shows both sites,
  validation adds unique host check, real IP replaced with 203.0.113.10
- tls.md: Multi-domain SNI section moved from Future to current, manual
  mode uses ResolvesServerCert for SNI mapping, TOML header fixed
- proxy.md: Updated for multi-site, removed single-domain language
- operations.md: RFC 5737 documentation IPs, clarified rate limit eviction
  semantics (distinct scan interval vs eviction age)
- open-questions.md: OQ-05 resolved (single bind_addr sufficient), new
  OQ-07 (per-site TLS overrides)

Review fixes:
- acme_domains (plural) consistently used across all docs and diagram
- ADR-011 clearly scopes acme_domain as previous design
- Inline decision rationale extracted: tls.md hot-reload → ADR-004 ref,
  config.md static/dynamic → ADR-008 ref
- TOML section headers consistent (server.tls)
This commit is contained in:
2026-06-11 08:50:03 +00:00
parent 8ee6284b62
commit 7efc142406
10 changed files with 356 additions and 108 deletions

View File

@@ -14,6 +14,10 @@ memory-safe Rust/axum reverse proxy. The primary motivation is CVE-2026-42945
(unauthenticated RCE in nginx's rewrite module) and the broader pattern of (unauthenticated RCE in nginx's rewrite module) and the broader pattern of
memory corruption bugs in nginx's C codebase. memory corruption bugs in nginx's C codebase.
The proxy supports multiple domains from initial release (git.alk.dev and
alk.dev), with per-domain host-based routing and a single multi-domain SAN
certificate via ACME.
## Architecture Documents ## Architecture Documents
| Document | Status | Description | | Document | Status | Description |
@@ -37,6 +41,8 @@ memory corruption bugs in nginx's C codebase.
| [007](decisions/007-custom-log-format.md) | Custom Structured Log Format | Accepted | | [007](decisions/007-custom-log-format.md) | Custom Structured Log Format | Accepted |
| [008](decisions/008-static-dynamic-config-split.md) | Static/Dynamic Config Split with ArcSwap | Accepted | | [008](decisions/008-static-dynamic-config-split.md) | Static/Dynamic Config Split with ArcSwap | Accepted |
| [009](decisions/009-signal-handling.md) | Signal Handling Strategy | Accepted | | [009](decisions/009-signal-handling.md) | Signal Handling Strategy | Accepted |
| [010](decisions/010-multi-site-phase1.md) | Multi-Site Support in Phase 1 | Accepted |
| [011](decisions/011-multi-domain-tls.md) | Multi-Domain TLS Configuration | Accepted |
## Open Questions ## Open Questions
@@ -48,8 +54,9 @@ See [open-questions.md](open-questions.md) for the full tracker.
| ~~OQ-02~~ | ~~What log format should fail2ban consume?~~ | ~~high~~ | **resolved** (ADR-007) | | ~~OQ-02~~ | ~~What log format should fail2ban consume?~~ | ~~high~~ | **resolved** (ADR-007) |
| OQ-03 | Should the health check endpoint be on a separate port? | low | open | | OQ-03 | Should the health check endpoint be on a separate port? | low | open |
| OQ-04 | Config reload: SIGHUP only or also Unix socket API? | low | open | | OQ-04 | Config reload: SIGHUP only or also Unix socket API? | low | open |
| OQ-05 | Should the proxy bind to multiple addresses? | low | open | | ~~OQ-05~~ | ~~Should the proxy bind to multiple addresses?~~ | ~~low~~ | **resolved** (single bind_addr sufficient) |
| OQ-06 | Should upstream timeouts be configurable per-site? | low | open | | OQ-06 | Should upstream timeouts be configurable per-site? | low | open |
| OQ-07 | Should per-site TLS overrides be supported for mixed ACME/manual domains? | low | open |
## Document Lifecycle ## Document Lifecycle

View File

@@ -39,7 +39,7 @@ config.toml
│ http_port │ │ rate_limit │ │ http_port │ │ rate_limit │
│ https_port │ │ body_limit │ │ https_port │ │ body_limit │
│ tls.mode │ │ proxy_headers │ │ tls.mode │ │ proxy_headers │
│ tls.acme_domain │ │ │ │ tls.acme_domains │ │ │
│ tls.cert_path │ │ ← ArcSwap → │ │ tls.cert_path │ │ ← ArcSwap → │
│ tls.key_path │ │ ConfigReloadHandle │ │ tls.key_path │ │ ConfigReloadHandle │
│ tls.cache_dir │ │ .reload(new_config) │ │ tls.cache_dir │ │ .reload(new_config) │
@@ -59,11 +59,11 @@ Immutable after startup. Changes require a process restart.
| Field | Type | Description | | Field | Type | Description |
|-------|------|-------------| |-------|------|-------------|
| `bind_addr` | `String` | IP address to bind to (e.g., `"15.235.125.95"`) | | `bind_addr` | `String` | IP address to bind to (must be explicit, no `0.0.0.0`) |
| `http_port` | `u16` | Port for HTTP→HTTPS redirect (default: `80`; set to `0` to disable) | | `http_port` | `u16` | Port for HTTP→HTTPS redirect (default: `80`; set to `0` to disable) |
| `https_port` | `u16` | Port for TLS listener (default: `443`) | | `https_port` | `u16` | Port for TLS listener (default: `443`) |
| `tls.mode` | `"acme"` or `"manual"` | Certificate provisioning mode | | `tls.mode` | `"acme"` or `"manual"` | Certificate provisioning mode |
| `tls.acme_domain` | `String` | Domain for ACME (ACME mode only) | | `tls.acme_domains` | `Vec<String>` | Domains for ACME SAN certificate (ACME mode only) |
| `tls.acme_cache_dir` | `String` | ACME state cache directory | | `tls.acme_cache_dir` | `String` | ACME state cache directory |
| `tls.acme_directory` | `"production"` or `"staging"` | Let's Encrypt directory | | `tls.acme_directory` | `"production"` or `"staging"` | Let's Encrypt directory |
| `tls.cert_path` | `String` | Certificate file path (manual mode only) | | `tls.cert_path` | `String` | Certificate file path (manual mode only) |
@@ -71,9 +71,10 @@ Immutable after startup. Changes require a process restart.
| `log_level` | `"trace"`, `"debug"`, `"info"`, `"warn"`, `"error"` | Logging verbosity | | `log_level` | `"trace"`, `"debug"`, `"info"`, `"warn"`, `"error"` | Logging verbosity |
| `log_format` | `"text"` or `"json"` | Log output format | | `log_format` | `"text"` or `"json"` | Log output format |
**Why these are static:** Changing bind addresses, ports, or TLS mode requires **Why these are static:** See ADR-008 for the rationale behind the
creating new listeners and TLS configurations — operations that fundamentally static/dynamic split. In summary: changing bind addresses, ports, or TLS mode
require a restart. There's no safe way to change these at runtime. requires creating new listeners and TLS configurations — operations that
fundamentally require a restart.
### DynamicConfig ### DynamicConfig
@@ -95,10 +96,10 @@ connections immediately.
| `upstream` | `String` | Upstream address (e.g., `"127.0.0.1:3000"`) | | `upstream` | `String` | Upstream address (e.g., `"127.0.0.1:3000"`) |
| `upstream_scheme` | `"http"` or `"https"` | Protocol for upstream connection (default: `"http"`) | | `upstream_scheme` | `"http"` or `"https"` | Protocol for upstream connection (default: `"http"`) |
**Why these are dynamic:** Site definitions and rate limits are per-request **Why these are dynamic:** See ADR-008 for the rationale. Site definitions
concerns. Adding a site or changing a rate limit should not require restarting and rate limits are per-request concerns that should not require restarting
the proxy and dropping active connections. Rate limits and body limits are the proxy or dropping active connections. Rate limits and body limits are
global settings in Phase 1; per-site configuration for these may be added in global settings in Phase 1; per-site configuration for these is deferred to
Phase 2. Phase 2.
## Config Reload ## Config Reload
@@ -136,13 +137,13 @@ config reload, but SIGHUP is sufficient for Phase 1.
# reverse-proxy config # reverse-proxy config
[server] [server]
bind_addr = "15.235.125.95" bind_addr = "203.0.113.10" # Replace with actual bind address
http_port = 80 http_port = 80
https_port = 443 https_port = 443
[server.tls] [server.tls]
mode = "acme" # "acme" or "manual" mode = "acme" # "acme" or "manual"
acme_domain = "git.alk.dev" acme_domains = ["git.alk.dev", "alk.dev"]
acme_cache_dir = "/var/lib/reverse-proxy/acme-cache" acme_cache_dir = "/var/lib/reverse-proxy/acme-cache"
acme_directory = "production" # "production" or "staging" acme_directory = "production" # "production" or "staging"
@@ -166,6 +167,11 @@ limit_bytes = 104857600 # 100 MB
host = "git.alk.dev" host = "git.alk.dev"
upstream = "127.0.0.1:3000" upstream = "127.0.0.1:3000"
upstream_scheme = "http" upstream_scheme = "http"
[[sites]]
host = "alk.dev"
upstream = "127.0.0.1:8080"
upstream_scheme = "http"
``` ```
### Validation ### Validation
@@ -173,12 +179,13 @@ upstream_scheme = "http"
On startup, the config is validated: On startup, the config is validated:
1. `bind_addr` is not `0.0.0.0` (must be explicit) 1. `bind_addr` is not `0.0.0.0` (must be explicit)
2. In ACME mode, `acme_domain` must be set 2. In ACME mode, `acme_domains` must be non-empty
3. In manual mode, `cert_path` and `key_path` must both be set and the files 3. In manual mode, `cert_path` and `key_path` must both be set and the files
must be readable must be readable
4. Each site must have a `host` and `upstream` 4. Each site must have a `host` and `upstream`
5. `rate_limit.requests_per_second` must be > 0 5. Site `host` values must be unique (no duplicate hostnames)
6. `body.limit_bytes` must be > 0 6. `rate_limit.requests_per_second` must be > 0
7. `body.limit_bytes` must be > 0
On SIGHUP reload, the same validation applies. If the new config fails On SIGHUP reload, the same validation applies. If the new config fails
validation, the reload is rejected and the old config remains active. An error validation, the reload is rejected and the old config remains active. An error
@@ -196,6 +203,8 @@ All design decisions are documented as ADRs in [decisions/](decisions/).
|-----|----------|---------| |-----|----------|---------|
| [003](decisions/003-toml-config.md) | TOML configuration format | Rust-native, unambiguous, excellent serde support | | [003](decisions/003-toml-config.md) | TOML configuration format | Rust-native, unambiguous, excellent serde support |
| [008](decisions/008-static-dynamic-config-split.md) | Static/dynamic config split | Immutable StaticConfig, hot-reloadable DynamicConfig via ArcSwap | | [008](decisions/008-static-dynamic-config-split.md) | Static/dynamic config split | Immutable StaticConfig, hot-reloadable DynamicConfig via ArcSwap |
| [010](decisions/010-multi-site-phase1.md) | Multi-site in Phase 1 | Multiple domains from initial release |
| [011](decisions/011-multi-domain-tls.md) | Multi-domain TLS config | Single SAN certificate covering all domains |
## Open Questions ## Open Questions
@@ -204,3 +213,5 @@ questions affecting this document:
- **OQ-04**: Should config reload support a Unix domain socket API in addition - **OQ-04**: Should config reload support a Unix domain socket API in addition
to SIGHUP? (open) to SIGHUP? (open)
- **OQ-07**: Should per-site TLS overrides be supported for mixed ACME/manual
domains? (open)

View File

@@ -16,8 +16,9 @@ available:
2. **Custom handler** (Felix Knorr pattern): Build a handler using hyper's 2. **Custom handler** (Felix Knorr pattern): Build a handler using hyper's
`Client` to forward requests. ~50-100 lines of Rust for our needs. `Client` to forward requests. ~50-100 lines of Rust for our needs.
Our use case is minimal: single upstream per domain, single domain, no load Our use case is minimal: single upstream per domain, no load balancing, no
balancing, no retry, no HTTP/2 proxying. retry, no HTTP/2 proxying. While the proxy supports multiple domains
(ADR-010), each domain routes to exactly one upstream.
## Decision ## Decision
@@ -31,6 +32,8 @@ project's channel proxy.
path-based routing to multiple backends) path-based routing to multiple backends)
- Our proxy case is the simplest possible: match a Host header, forward the - Our proxy case is the simplest possible: match a Host header, forward the
entire request to a single upstream, stream the response back entire request to a single upstream, stream the response back
- Multi-domain support (ADR-010) doesn't change this — each domain still maps
to one upstream
- The Felix Knorr pattern is proven, idiomatic, and ~50-100 lines - The Felix Knorr pattern is proven, idiomatic, and ~50-100 lines
- We maintain full control over header injection, error handling, and upstream - We maintain full control over header injection, error handling, and upstream
connection behavior connection behavior
@@ -46,11 +49,12 @@ project's channel proxy.
**Negative:** **Negative:**
- We implement and maintain proxy logic ourselves (but it's trivial for our - We implement and maintain proxy logic ourselves (but it's trivial for our
use case) use case — each domain maps to one upstream)
- If requirements grow to load balancing or retry, we'd need to add that - If requirements grow to load balancing or retry, we'd need to add that
ourselves or switch to `axum-reverse-proxy` ourselves or switch to `axum-reverse-proxy`
## References ## References
- [proxy.md](../proxy.md) - [proxy.md](../proxy.md)
- [ADR-010](010-multi-site-phase1.md) (multi-site in Phase 1)
- Felix Knorr, "Replacing nginx with axum" (felix-knorr.net/posts/2024-10-13-replacing-nginx-with-axum.html) - Felix Knorr, "Replacing nginx with axum" (felix-knorr.net/posts/2024-10-13-replacing-nginx-with-axum.html)

View File

@@ -0,0 +1,90 @@
# ADR-010: Multi-Site Support in Phase 1
## Status
Accepted
## Context
The original architecture phased multi-site support into Phase 2, treating
Phase 1 as a single-domain replacement for nginx serving only `git.alk.dev`.
This was based on the assumption that only one domain needed proxying initially.
However, `alk.dev` (the bare domain) will need proxying in the near future.
While `alk.dev` is a simple case — proxying to a Deno/Fresh container with no
special requirements — the proxy must support multiple sites from day one. The
config format, routing logic, and TLS certificate provisioning all need
multi-site awareness.
Additionally, `api.alk.dev` is explicitly out of scope (it runs its own
HTTP/2+ server natively), but the proxy must not prevent future sites from
being added.
The cost of deferring multi-site is high: we'd need a config format migration,
routing logic rewrite, and TLS cert management changes later. Supporting
multi-site from the start costs very little — the config format just uses an
array of sites (which it already does), host-based routing is trivial in axum,
and `rustls-acme` supports multi-domain certificates natively.
## Decision
Move multi-site support from Phase 2 into Phase 1. The proxy supports multiple
sites from the initial release:
- `[[sites]]` array in config (already the planned format)
- Host-based routing via axum's `Host` extractor (already the planned approach)
- Multi-domain ACME certificate provisioning via `rustls-acme`
- Each site maps a hostname to an upstream address
Phase 1 scope becomes:
1. Multi-site reverse proxy with TLS termination
2. ACME certificate management (multi-domain)
3. HTTP → HTTPS redirect
4. Rate limiting, logging, health check, graceful shutdown
5. Systemd integration
Phase 2 scope shifts to operational hardening:
1. Per-site rate limits and body limits
2. Per-site upstream timeouts
3. Metrics endpoint (Prometheus-compatible)
4. Connection limits and timeouts
5. Log rotation
Phase 3 remains future enhancements.
## Rationale
- The config format already uses `[[sites]]` — no format change needed
- Host-based routing is the natural axum pattern and was already planned
- `rustls-acme` accepts `Vec<domain>` — multi-domain is its default usage
- The cost of adding multi-site later (config migration, routing rewrite,
cert management changes) far exceeds the cost of supporting it now (zero
additional complexity)
- `alk.dev` is confirmed as a near-term need, not a hypothetical
- The proxy's value proposition is being a memory-safe reverse proxy for *our
infrastructure*, which has multiple domains
## Consequences
**Positive:**
- No config format migration needed later
- `alk.dev` can be added to the config without code changes
- TLS cert management handles multiple domains from the start
- Eliminates an entire phase of work
**Negative:**
- Slightly more testing surface (must verify correct routing with multiple
sites)
- Must test multi-domain ACME provisioning (not just single-domain)
- Wildcard or fallback site behavior needs to be defined (addressed in
OQ-07)
## References
- [overview.md](../overview.md)
- [config.md](../config.md)
- [tls.md](../tls.md)
- [proxy.md](../proxy.md)
- ADR-002 (custom proxy handler — rationale updated for multi-site)

View File

@@ -0,0 +1,92 @@
# ADR-011: Multi-Domain TLS Configuration
## Status
Accepted
## Context
With multi-site support in Phase 1 (ADR-010), the TLS configuration must
support multiple domains. The previous design used a single `tls.acme_domain`
string field, which only works for one domain.
There are several approaches to multi-domain TLS:
1. **Single ACME config with domain list**: `acme_domains = ["git.alk.dev",
"alk.dev"]` — one certificate covering all domains (SAN certificate)
2. **Per-site TLS configuration**: Each site entry specifies its own TLS
mode (ACME or manual) and domain — more flexible but complex
3. **Hybrid**: A global TLS section with ACME domains, plus per-site overrides
for manual certificates
For our use case, all proxied domains use the same ACME certificate authority
(Let's Encrypt) and the same challenge type (TLS-ALPN-01). There's no need
for per-site TLS configuration in Phase 1.
## Decision
Use a single ACME configuration with a list of domains, producing one SAN
certificate covering all proxied domains. Manual mode uses certificate file
paths (single cert file with all domains, or one cert per domain resolved via
SNI).
The config format changes from the previous single-domain format:
```toml
# Previous (single-domain) format — no longer used
[tls]
mode = "acme"
acme_domain = "git.alk.dev" # single string
```
To the current multi-domain format:
```toml
[tls]
mode = "acme"
acme_domains = ["git.alk.dev", "alk.dev"] # array of strings
```
In ACME mode, `rustls-acme` provisions a single certificate covering all
listed domains via Subject Alternative Names (SAN). This is the standard
Let's Encrypt approach for multi-domain certificates.
In manual mode, the cert and key files must cover all domains (either a SAN
certificate or separate certificates resolved via SNI).
## Rationale
- A single SAN certificate is simpler to manage (one renewal, one cert)
- Let's Encrypt supports SAN certificates with up to 100 domains
- `rustls-acme` accepts `Vec<String>` for domain lists — this is its natural
API
- All our domains use the same ACME configuration (Let's Encrypt production,
TLS-ALPN-01 challenge)
- Per-site TLS overrides add complexity with no current benefit
- If per-site TLS configuration is needed later (e.g., a site with a manual
cert), it can be added as an optional override without changing the global
config structure
## Consequences
**Positive:**
- Single certificate for all domains — simpler renewal, simpler cert management
- Matches `rustls-acme`'s natural API (`AcmeConfig::new(domains: Vec<String>)`)
- All domains in one cert means SNI resolution is handled by ACME automatically
- Config format is a minimal change from single-domain
**Negative:**
- Adding or removing a domain requires re-provisioning the certificate (ACME
handles this automatically, but it means cert changes affect all domains)
- If one domain fails ACME validation, the entire cert renewal fails (all
domains must be validated) — mitigated by Let's Encrypt's domain-level
validation
- Per-site TLS configuration (e.g., a domain with a manual cert) requires a
future config extension (OQ-07)
## References
- [tls.md](../tls.md)
- [config.md](../config.md)
- ADR-010 (multi-site in Phase 1)
- ADR-004 (ACME-primary certificate management)

View File

@@ -21,8 +21,6 @@ last_updated: 2026-06-11
than the current nginx config. than the current nginx config.
- **Cross-references**: ADR-005 - **Cross-references**: ADR-005
## Logging and Monitoring
### ~~OQ-02: What log format should fail2ban consume?~~ ### ~~OQ-02: What log format should fail2ban consume?~~
- **Origin**: [operations.md](operations.md), [proxy.md](proxy.md) - **Origin**: [operations.md](operations.md), [proxy.md](proxy.md)
@@ -33,6 +31,22 @@ last_updated: 2026-06-11
See ADR-007. See ADR-007.
- **Cross-references**: ADR-007 - **Cross-references**: ADR-007
### OQ-07: Should per-site TLS overrides be supported for mixed ACME/manual domains?
- **Origin**: [tls.md](tls.md), [config.md](config.md)
- **Status**: open
- **Priority**: low
- **Context**: Phase 1 uses a single TLS configuration (ACME or manual) for all
domains. All domains share the same ACME config and certificate. If a future
domain needs a manual certificate (e.g., a corporate CA cert) while other
domains use ACME, a per-site TLS override would be needed. This would require
a custom `ResolvesServerCert` that combines ACME-provisioned certs with
manually loaded certs. For now, all proxied domains use the same ACME config,
so this is not needed.
- **Cross-references**: ADR-011
## Logging and Monitoring
### OQ-03: Should the health check endpoint be on a separate port? ### OQ-03: Should the health check endpoint be on a separate port?
- **Origin**: [operations.md](operations.md) - **Origin**: [operations.md](operations.md)
@@ -61,15 +75,15 @@ last_updated: 2026-06-11
## Deployment ## Deployment
### OQ-05: Should the proxy bind to multiple addresses or just one? ### ~~OQ-05: Should the proxy bind to multiple addresses or just one?~~
- **Origin**: [overview.md](overview.md) - **Origin**: [overview.md](overview.md)
- **Status**: open - **Status**: resolved
- **Priority**: low - **Priority**: low
- **Context**: Current nginx config binds to a specific IP (`15.235.125.95`). - **Resolution**: A single `bind_addr` is sufficient. The proxy binds to one
The proposed config uses `bind_addr` which could be any IP. For Phase 1, the explicit IP address (not `0.0.0.0`). Multi-address binding is not needed for
config will specify a single IP address. Multi-address binding (listening on this single-server deployment. If needed in the future, `bind_addr` could be
multiple IPs) is not needed but could be added as an array of addresses. extended to an array. See config.md for the `bind_addr` field.
- **Cross-references**: None - **Cross-references**: None
## Proxy ## Proxy

View File

@@ -42,9 +42,10 @@ Requests` and logs the event with structured fields.
### State Eviction ### State Eviction
The per-IP token bucket state grows over time as new IPs are seen. A The per-IP token bucket state grows over time as new IPs are seen. A
background task runs at a configurable interval (default: 60 seconds) and background task runs every 60 seconds (configurable) and removes entries
removes entries that haven't been accessed within the cleanup interval. This whose last access timestamp is older than a configurable eviction age
prevents unbounded memory growth. (default: 300 seconds / 5 minutes). This prevents unbounded memory growth
while preserving recent entries that may still receive traffic.
### Fail2ban Integration ### Fail2ban Integration
@@ -55,7 +56,7 @@ format decision.
The log format uses `key=value` pairs with a `RATE_LIMIT` prefix: The log format uses `key=value` pairs with a `RATE_LIMIT` prefix:
``` ```
RATE_LIMIT client_ip=X.X.X.X host=Y.Z path=/W status=429 RATE_LIMIT client_ip=203.0.113.50 host=Y.Z path=/W status=429
``` ```
A corresponding fail2ban filter and jail configuration are provided as part A corresponding fail2ban filter and jail configuration are provided as part
@@ -71,15 +72,15 @@ log entries:
1. **Access logs**: Every proxied request is logged at `info` level with 1. **Access logs**: Every proxied request is logged at `info` level with
structured fields. structured fields.
``` ```
REQUEST client_ip=1.2.3.4 host=git.alk.dev method=GET path=/user/repo status=200 upstream=127.0.0.1:3000 duration_ms=45 REQUEST client_ip=203.0.113.50 host=git.alk.dev method=GET path=/user/repo status=200 upstream=127.0.0.1:3000 duration_ms=45
``` ```
2. **Event logs**: Rate limits, TLS errors, upstream failures, config reloads, 2. **Event logs**: Rate limits, TLS errors, upstream failures, config reloads,
etc. etc.
``` ```
RATE_LIMIT client_ip=1.2.3.4 host=git.alk.dev path=/login status=429 RATE_LIMIT client_ip=203.0.113.50 host=git.alk.dev path=/login status=429
UPSTREAM_ERROR host=git.alk.dev upstream=127.0.0.1:3000 error="connection refused" UPSTREAM_ERROR host=git.alk.dev upstream=127.0.0.1:3000 error="connection refused"
CONFIG_RELOAD status=success sites=1 CONFIG_RELOAD status=success sites=1
``` ```

View File

@@ -8,10 +8,12 @@ last_updated: 2026-06-11
## Vision ## Vision
A memory-safe, minimal reverse proxy that replaces our vulnerable nginx instance A memory-safe, minimal reverse proxy that replaces our vulnerable nginx instance
for forward-proxying to backend services. The proxy terminates TLS, injects for forwarding requests to backend services. The proxy terminates TLS, injects
standard proxy headers, enforces rate limits, and forwards requests to upstream standard proxy headers, enforces rate limits, and forwards requests to upstream
services — with operational feature parity for our current single-domain Gitea services — supporting multiple domains from initial release.
setup.
This project is open source under dual licensing: MIT OR Apache-2.0, consistent
with standard Rust project licensing.
## Why This Exists ## Why This Exists
@@ -35,38 +37,44 @@ details.
### In Scope ### In Scope
- **Phase 1**: Replace nginx for `git.alk.dev` with feature parity - **Phase 1**: Multi-site reverse proxy with TLS termination
- TLS termination with ACME (Let's Encrypt) certificate management - TLS termination with ACME (Let's Encrypt) multi-domain certificate management
- Manual certificate paths as fallback mode - Manual certificate paths as fallback mode
- HTTP → HTTPS redirect - HTTP → HTTPS redirect
- Reverse proxy to Gitea at `127.0.0.1:3000` - Host-based routing to multiple upstream services
- Reverse proxy to Gitea at `127.0.0.1:3000` (git.alk.dev)
- Reverse proxy to Deno/Fresh container for alk.dev (simple pass-through)
- Proxy header injection (Host, X-Real-IP, X-Forwarded-For, X-Forwarded-Proto) - Proxy header injection (Host, X-Real-IP, X-Forwarded-For, X-Forwarded-Proto)
- Request rate limiting with fail2ban-compatible logging (global per-IP; per-site in Phase 2) - Request rate limiting with fail2ban-compatible logging (global per-IP)
- 100 MB body size limit (global; per-site in Phase 2) - 100 MB body size limit (global)
- Configurable bind address (no `0.0.0.0` default) - Configurable bind address (no `0.0.0.0` default)
- Health check endpoint - Health check endpoint
- Graceful shutdown (SIGTERM handling) - Graceful shutdown (SIGTERM handling)
- Systemd unit file - Systemd unit file
- Dual licensing: MIT OR Apache-2.0
- **Phase 2**: Multi-site support - **Phase 2**: Operational hardening
- SNI-based TLS routing for multiple domains - Per-site rate limits and body limits
- Config file for site definitions - Per-site upstream timeouts
- Dynamic config reload (ArcSwap pattern)
- **Phase 3**: Operational hardening
- Metrics endpoint (Prometheus-compatible) - Metrics endpoint (Prometheus-compatible)
- Connection limits and timeouts - Connection limits and timeouts
- Log rotation - Log rotation
- **Phase 3**: Future enhancements
- Wildcard subdomain support
- Per-site TLS overrides (manual certs for specific domains)
- Unix domain socket config reload API
### Out of Scope ### Out of Scope
- HTTP/2 or HTTP/3 proxying (services that need these run their own native - HTTP/2 or HTTP/3 proxying (services that need these run their own native
Rust servers — e.g., `api.alk.dev`) Rust servers — e.g., `api.alk.dev` runs its own HTTP/2+ server)
- Load balancing or round-robin upstream selection - Load balancing or round-robin upstream selection
- WebSocket proxying (can be added later if needed) - WebSocket proxying (can be added later if needed)
- Static file serving - Static file serving
- Access control beyond rate limiting (no auth, no IP allowlists in Phase 1) - Access control beyond rate limiting (no auth, no IP allowlists in Phase 1)
- CGI, SCGI, uWSGI, FastCGI - CGI, SCGI, uWSGI, FastCGI
- Per-site TLS configuration (all domains share one ACME config in Phase 1)
## Architecture ## Architecture
@@ -81,11 +89,14 @@ bind_addr:80 ──► │ HTTP listener → 301 redirect │
│ │ │ │
bind_addr:443 ──► │ TLS listener (tokio-rustls) │ bind_addr:443 ──► │ TLS listener (tokio-rustls) │
│ ├─ ACME mode: rustls-acme resolver │ │ ├─ ACME mode: rustls-acme resolver │
│ │ (auto cert provisioning/renewal) │ │ (multi-domain SAN cert,
│ │ auto-provision & renew) │
│ └─ Manual mode: cert/key file paths │ │ └─ Manual mode: cert/key file paths │
│ │ │ │
│ axum router │ │ axum router │
│ ├─ Host-based routing │ │ ├─ Host-based routing │
│ │ ├─ git.alk.dev → :3000 │
│ │ └─ alk.dev → :8080 │
│ ├─ Rate limiting middleware │ │ ├─ Rate limiting middleware │
│ ├─ Proxy header injection │ │ ├─ Proxy header injection │
│ ├─ Body size limit (100MB) │ │ ├─ Body size limit (100MB) │
@@ -147,7 +158,7 @@ All design decisions are documented as ADRs in [decisions/](decisions/).
| ADR | Decision | Summary | | ADR | Decision | Summary |
|-----|----------|---------| |-----|----------|---------|
| [001](decisions/001-rust-axum.md) | Rust with axum | Memory safety eliminates the bug class causing nginx CVEs; axum provides ergonomic tower integration | | [001](decisions/001-rust-axum.md) | Rust with axum | Memory safety eliminates the bug class causing nginx CVEs; axum provides ergonomic tower integration |
| [002](decisions/002-custom-proxy-handler.md) | Custom proxy handler | Single upstream, single domain — axum-reverse-proxy adds unnecessary complexity | | [002](decisions/002-custom-proxy-handler.md) | Custom proxy handler | Single upstream per domain — simpler than a general proxy library |
| [003](decisions/003-toml-config.md) | TOML configuration format | Rust-native, unambiguous, excellent serde support | | [003](decisions/003-toml-config.md) | TOML configuration format | Rust-native, unambiguous, excellent serde support |
| [004](decisions/004-rustls-acme.md) | ACME-primary certificate management | Eliminates certbot dependency; automatic provisioning and renewal | | [004](decisions/004-rustls-acme.md) | ACME-primary certificate management | Eliminates certbot dependency; automatic provisioning and renewal |
| [005](decisions/005-tokio-rustls-direct.md) | tokio-rustls directly, not axum-server | Full control over TLS config, ACME resolver integration, cipher suite configuration | | [005](decisions/005-tokio-rustls-direct.md) | tokio-rustls directly, not axum-server | Full control over TLS config, ACME resolver integration, cipher suite configuration |
@@ -155,6 +166,8 @@ All design decisions are documented as ADRs in [decisions/](decisions/).
| [007](decisions/007-custom-log-format.md) | Custom structured log format | key=value pairs with RATE_LIMIT prefix for fail2ban | | [007](decisions/007-custom-log-format.md) | Custom structured log format | key=value pairs with RATE_LIMIT prefix for fail2ban |
| [008](decisions/008-static-dynamic-config-split.md) | Static/dynamic config with ArcSwap | Immutable StaticConfig, hot-reloadable DynamicConfig via ArcSwap | | [008](decisions/008-static-dynamic-config-split.md) | Static/dynamic config with ArcSwap | Immutable StaticConfig, hot-reloadable DynamicConfig via ArcSwap |
| [009](decisions/009-signal-handling.md) | Signal handling strategy | signal-hook for SIGTERM/SIGINT/SIGHUP | | [009](decisions/009-signal-handling.md) | Signal handling strategy | signal-hook for SIGTERM/SIGINT/SIGHUP |
| [010](decisions/010-multi-site-phase1.md) | Multi-site in Phase 1 | Multiple domains from initial release; avoids config migration later |
| [011](decisions/011-multi-domain-tls.md) | Multi-domain TLS config | Single SAN certificate covering all domains via rustls-acme |
## Open Questions ## Open Questions
@@ -163,4 +176,4 @@ questions affecting this document:
- **OQ-01**: Should cipher suites be restricted beyond rustls defaults? (open) - **OQ-01**: Should cipher suites be restricted beyond rustls defaults? (open)
- **OQ-03**: Should the health check endpoint be on a separate port? (open) - **OQ-03**: Should the health check endpoint be on a separate port? (open)
- **OQ-05**: Should the proxy bind to multiple addresses or just one? (open) - **OQ-07**: Should per-site TLS overrides be supported for mixed ACME/manual domains? (open)

View File

@@ -14,8 +14,9 @@ injection, body size limits), and forwards it to the upstream service.
## Why It Exists ## Why It Exists
This component replaces nginx's `proxy_pass` directive. For our use case — This component replaces nginx's `proxy_pass` directive. For our use case —
single upstream per domain, no load balancing, no HTTP/2 proxying — a custom one upstream per domain across multiple domains, no load balancing, no HTTP/2
handler is simpler and more maintainable than a general-purpose proxy library. proxying — a custom handler is simpler and more maintainable than a
general-purpose proxy library (ADR-002, ADR-010).
## Architecture ## Architecture
@@ -140,9 +141,9 @@ services typically run on the same host (e.g., `127.0.0.1:3000`). The
`upstream_scheme` field in each site's configuration allows specifying `https://` `upstream_scheme` field in each site's configuration allows specifying `https://`
for upstreams that require TLS (e.g., separate hosts or secure internal services). for upstreams that require TLS (e.g., separate hosts or secure internal services).
For the initial deployment (`git.alk.dev``127.0.0.1:3000`), the upstream For the initial deployment, upstream connections use plain HTTP (e.g.,
connection uses plain HTTP, as TLS between the proxy and Gitea on loopback is `git.alk.dev``127.0.0.1:3000`, `alk.dev``127.0.0.1:8080`) since TLS
unnecessary. between the proxy and backend services on loopback is unnecessary.
## Body Size Limit ## Body Size Limit
@@ -157,8 +158,9 @@ All design decisions are documented as ADRs in [decisions/](decisions/).
| ADR | Decision | Summary | | ADR | Decision | Summary |
|-----|----------|---------| |-----|----------|---------|
| [002](decisions/002-custom-proxy-handler.md) | Custom proxy handler | Single upstream, single domain — simpler than a general proxy library | | [002](decisions/002-custom-proxy-handler.md) | Custom proxy handler | One upstream per domain — simpler than a general proxy library |
| [007](decisions/007-custom-log-format.md) | Custom structured log format | key=value pairs with RATE_LIMIT prefix for fail2ban | | [007](decisions/007-custom-log-format.md) | Custom structured log format | key=value pairs with RATE_LIMIT prefix for fail2ban |
| [010](decisions/010-multi-site-phase1.md) | Multi-site in Phase 1 | Multiple domains from initial release |
## Open Questions ## Open Questions

View File

@@ -57,10 +57,11 @@ no deploy hooks.
**How it works:** **How it works:**
1. `AcmeCertProvider` configures the ACME client with the domain, cache 1. `AcmeCertProvider` configures the ACME client with the domain list, cache
directory, and Let's Encrypt directory (staging or production). directory, and Let's Encrypt directory (staging or production).
2. `AcmeConfig::new(vec![domain])` creates an ACME configuration for the 2. `AcmeConfig::new(domains)` creates an ACME configuration for all listed
domain. domains. Let's Encrypt will issue a single SAN certificate covering all
domains.
3. The ACME state machine runs as a background tokio task, handling: 3. The ACME state machine runs as a background tokio task, handling:
- Account registration with Let's Encrypt - Account registration with Let's Encrypt
- Certificate ordering - Certificate ordering
@@ -75,9 +76,9 @@ no deploy hooks.
**Configuration:** **Configuration:**
```toml ```toml
[tls] [server.tls]
mode = "acme" mode = "acme"
acme_domain = "git.alk.dev" acme_domains = ["git.alk.dev", "alk.dev"]
acme_cache_dir = "/var/lib/reverse-proxy/acme-cache" acme_cache_dir = "/var/lib/reverse-proxy/acme-cache"
acme_directory = "production" # or "staging" for testing acme_directory = "production" # or "staging" for testing
``` ```
@@ -100,13 +101,8 @@ key_path = "/etc/letsencrypt/live/git.alk.dev/privkey.pem"
``` ```
Certificate files are loaded once at startup using `rustls_pemfile`. Manual Certificate files are loaded once at startup using `rustls_pemfile`. Manual
mode requires a restart to pick up new certificates. mode requires a restart to pick up new certificates. See ADR-004 for the
rationale behind making ACME the primary mode and manual mode restart-dependent.
**Why not hot-reload manual certs?** ACME mode handles renewal automatically.
Manual mode is for cases where you control cert rotation externally (certbot,
manual renewal). In that case, a SIGHUP-triggered restart is simpler and more
reliable than file watching. If zero-downtime cert rotation is needed, use ACME
mode.
## TLS Configuration ## TLS Configuration
@@ -142,10 +138,13 @@ restrict cipher suites beyond rustls defaults.
### ServerConfig Construction ### ServerConfig Construction
For manual mode, the `ServerConfig` is built with `with_no_client_auth()` and For manual mode, the `ServerConfig` is built with `with_no_client_auth()` and
`with_single_cert()`, loading the certificate chain and private key from disk. a custom `ResolvesServerCert` implementation that maps SNI hostnames to
certificate/key pairs loaded from disk.
For ACME mode, the `ServerConfig` is built with `with_cert_resolver()`, passing For ACME mode, the `ServerConfig` is built with `with_cert_resolver()`, passing
the `ResolvesServerCertAcme` resolver. The ACME TLS-ALPN-01 protocol identifier the `ResolvesServerCertAcme` resolver. The ACME configuration includes all
domains listed in `acme_domains`, and the resolver manages a single SAN
certificate covering all of them. The ACME TLS-ALPN-01 protocol identifier
(`acme-tls/1`) must be registered in the `alpn_protocols` list so the server (`acme-tls/1`) must be registered in the `alpn_protocols` list so the server
can respond to TLS-ALPN-01 challenges. can respond to TLS-ALPN-01 challenges.
@@ -154,28 +153,39 @@ versions (TLS 1.2 and TLS 1.3).
## SNI-Based Certificate Selection ## SNI-Based Certificate Selection
### Current (Single Domain) ### ACME Mode (Multi-Domain)
For single-domain setups, SNI selection is trivial: there's only one In ACME mode, `rustls-acme` manages a single SAN certificate covering all
certificate, so `with_single_cert()` or `ResolvesServerCertAcme` (which configured domains. The `ResolvesServerCertAcme` resolver automatically serves
handles the domain) is sufficient. the correct certificate during the TLS handshake.
### Future (Multi-Domain)
When multiple domains are served, SNI selection works as follows:
1. **TLS handshake**: The client sends the SNI extension indicating which 1. **TLS handshake**: The client sends the SNI extension indicating which
hostname it's connecting to. hostname it's connecting to.
2. **Certificate resolution**: In ACME mode, `ResolvesServerCertAcme` handles 2. **Certificate resolution**: `ResolvesServerCertAcme` matches the SNI
this automatically — it stores certificates keyed by domain. In manual mode, hostname against the provisioned certificate's Subject Alternative Names
a custom `ResolvesServerCert` implementation maps SNI hostname to the and serves the certificate.
correct `CertifiedKey`.
3. **HTTP routing**: After the TLS handshake, axum's `Host` extractor routes 3. **HTTP routing**: After the TLS handshake, axum's `Host` extractor routes
the request to the correct site handler based on the `Host` header. the request to the correct site handler based on the `Host` header.
This is the same pattern nginx uses — SNI selects the cert during TLS, then This is the same pattern nginx uses — SNI selects the cert during TLS, then
`Host` header selects the server block. In manual mode, a `ResolvesServerCert` `Host` header selects the server block. ACME mode handles this automatically
implementation maps SNI hostname to the correct `CertifiedKey`. through the cert resolver.
### Manual Mode (Multi-Domain)
In manual mode, a custom `ResolvesServerCert` implementation is required to
map SNI hostnames to the correct `CertifiedKey`. This implementation:
1. Loads certificate files at startup (or on SIGHUP for reload)
2. Maps each domain name to its certificate chain and private key
3. During the TLS handshake, looks up the SNI hostname and returns the
matching `CertifiedKey`
The custom resolver must handle the case where no matching certificate exists
for the SNI hostname — in this case, the handshake fails, which is the
correct behavior (we don't serve a default certificate for unknown domains).
See [open-questions.md](open-questions.md) OQ-07 for per-site TLS overrides.
## HTTP Listener (Port 80) ## HTTP Listener (Port 80)
@@ -211,6 +221,8 @@ All design decisions are documented as ADRs in [decisions/](decisions/).
|-----|----------|---------| |-----|----------|---------|
| [004](decisions/004-rustls-acme.md) | ACME-primary cert management | Eliminates certbot; automatic provisioning and renewal | | [004](decisions/004-rustls-acme.md) | ACME-primary cert management | Eliminates certbot; automatic provisioning and renewal |
| [005](decisions/005-tokio-rustls-direct.md) | tokio-rustls directly | Full control over TLS config and ACME resolver integration | | [005](decisions/005-tokio-rustls-direct.md) | tokio-rustls directly | Full control over TLS config and ACME resolver integration |
| [010](decisions/010-multi-site-phase1.md) | Multi-site in Phase 1 | Multiple domains from initial release |
| [011](decisions/011-multi-domain-tls.md) | Multi-domain TLS config | Single SAN certificate covering all domains via rustls-acme |
## Open Questions ## Open Questions
@@ -218,3 +230,5 @@ Open questions are tracked in [open-questions.md](open-questions.md). Key
questions affecting this document: questions affecting this document:
- **OQ-01**: Should cipher suites be restricted beyond rustls defaults? (open) - **OQ-01**: Should cipher suites be restricted beyond rustls defaults? (open)
- **OQ-07**: Should per-site TLS overrides be supported for mixed ACME/manual
domains? (open)