Expand architecture: multi-site Phase 1, multi-domain TLS, fix review issues
Promote multi-site support from Phase 2 to Phase 1 (ADR-010): the proxy must support git.alk.dev and alk.dev from initial release. Add multi-domain TLS configuration (ADR-011): acme_domains array replaces acme_domain string, single SAN certificate via rustls-acme. Key changes: - ADR-010: Multi-site in Phase 1 — avoids config format migration later - ADR-011: Multi-domain TLS — single SAN cert, acme_domains Vec<String> - ADR-002: Updated rationale for multi-site (one upstream per domain) - overview.md: Phase 1 now includes multi-site, alk.dev pass-through, dual licensing (MIT OR Apache-2.0), real IP removed - config.md: acme_domain → acme_domains, TOML example shows both sites, validation adds unique host check, real IP replaced with 203.0.113.10 - tls.md: Multi-domain SNI section moved from Future to current, manual mode uses ResolvesServerCert for SNI mapping, TOML header fixed - proxy.md: Updated for multi-site, removed single-domain language - operations.md: RFC 5737 documentation IPs, clarified rate limit eviction semantics (distinct scan interval vs eviction age) - open-questions.md: OQ-05 resolved (single bind_addr sufficient), new OQ-07 (per-site TLS overrides) Review fixes: - acme_domains (plural) consistently used across all docs and diagram - ADR-011 clearly scopes acme_domain as previous design - Inline decision rationale extracted: tls.md hot-reload → ADR-004 ref, config.md static/dynamic → ADR-008 ref - TOML section headers consistent (server.tls)
This commit is contained in:
@@ -14,6 +14,10 @@ memory-safe Rust/axum reverse proxy. The primary motivation is CVE-2026-42945
|
||||
(unauthenticated RCE in nginx's rewrite module) and the broader pattern of
|
||||
memory corruption bugs in nginx's C codebase.
|
||||
|
||||
The proxy supports multiple domains from initial release (git.alk.dev and
|
||||
alk.dev), with per-domain host-based routing and a single multi-domain SAN
|
||||
certificate via ACME.
|
||||
|
||||
## Architecture Documents
|
||||
|
||||
| Document | Status | Description |
|
||||
@@ -37,6 +41,8 @@ memory corruption bugs in nginx's C codebase.
|
||||
| [007](decisions/007-custom-log-format.md) | Custom Structured Log Format | Accepted |
|
||||
| [008](decisions/008-static-dynamic-config-split.md) | Static/Dynamic Config Split with ArcSwap | Accepted |
|
||||
| [009](decisions/009-signal-handling.md) | Signal Handling Strategy | Accepted |
|
||||
| [010](decisions/010-multi-site-phase1.md) | Multi-Site Support in Phase 1 | Accepted |
|
||||
| [011](decisions/011-multi-domain-tls.md) | Multi-Domain TLS Configuration | Accepted |
|
||||
|
||||
## Open Questions
|
||||
|
||||
@@ -48,8 +54,9 @@ See [open-questions.md](open-questions.md) for the full tracker.
|
||||
| ~~OQ-02~~ | ~~What log format should fail2ban consume?~~ | ~~high~~ | **resolved** (ADR-007) |
|
||||
| OQ-03 | Should the health check endpoint be on a separate port? | low | open |
|
||||
| OQ-04 | Config reload: SIGHUP only or also Unix socket API? | low | open |
|
||||
| OQ-05 | Should the proxy bind to multiple addresses? | low | open |
|
||||
| ~~OQ-05~~ | ~~Should the proxy bind to multiple addresses?~~ | ~~low~~ | **resolved** (single bind_addr sufficient) |
|
||||
| OQ-06 | Should upstream timeouts be configurable per-site? | low | open |
|
||||
| OQ-07 | Should per-site TLS overrides be supported for mixed ACME/manual domains? | low | open |
|
||||
|
||||
## Document Lifecycle
|
||||
|
||||
|
||||
@@ -39,7 +39,7 @@ config.toml
|
||||
│ http_port │ │ rate_limit │
|
||||
│ https_port │ │ body_limit │
|
||||
│ tls.mode │ │ proxy_headers │
|
||||
│ tls.acme_domain │ │ │
|
||||
│ tls.acme_domains │ │ │
|
||||
│ tls.cert_path │ │ ← ArcSwap → │
|
||||
│ tls.key_path │ │ ConfigReloadHandle │
|
||||
│ tls.cache_dir │ │ .reload(new_config) │
|
||||
@@ -59,11 +59,11 @@ Immutable after startup. Changes require a process restart.
|
||||
|
||||
| Field | Type | Description |
|
||||
|-------|------|-------------|
|
||||
| `bind_addr` | `String` | IP address to bind to (e.g., `"15.235.125.95"`) |
|
||||
| `bind_addr` | `String` | IP address to bind to (must be explicit, no `0.0.0.0`) |
|
||||
| `http_port` | `u16` | Port for HTTP→HTTPS redirect (default: `80`; set to `0` to disable) |
|
||||
| `https_port` | `u16` | Port for TLS listener (default: `443`) |
|
||||
| `tls.mode` | `"acme"` or `"manual"` | Certificate provisioning mode |
|
||||
| `tls.acme_domain` | `String` | Domain for ACME (ACME mode only) |
|
||||
| `tls.acme_domains` | `Vec<String>` | Domains for ACME SAN certificate (ACME mode only) |
|
||||
| `tls.acme_cache_dir` | `String` | ACME state cache directory |
|
||||
| `tls.acme_directory` | `"production"` or `"staging"` | Let's Encrypt directory |
|
||||
| `tls.cert_path` | `String` | Certificate file path (manual mode only) |
|
||||
@@ -71,9 +71,10 @@ Immutable after startup. Changes require a process restart.
|
||||
| `log_level` | `"trace"`, `"debug"`, `"info"`, `"warn"`, `"error"` | Logging verbosity |
|
||||
| `log_format` | `"text"` or `"json"` | Log output format |
|
||||
|
||||
**Why these are static:** Changing bind addresses, ports, or TLS mode requires
|
||||
creating new listeners and TLS configurations — operations that fundamentally
|
||||
require a restart. There's no safe way to change these at runtime.
|
||||
**Why these are static:** See ADR-008 for the rationale behind the
|
||||
static/dynamic split. In summary: changing bind addresses, ports, or TLS mode
|
||||
requires creating new listeners and TLS configurations — operations that
|
||||
fundamentally require a restart.
|
||||
|
||||
### DynamicConfig
|
||||
|
||||
@@ -95,10 +96,10 @@ connections immediately.
|
||||
| `upstream` | `String` | Upstream address (e.g., `"127.0.0.1:3000"`) |
|
||||
| `upstream_scheme` | `"http"` or `"https"` | Protocol for upstream connection (default: `"http"`) |
|
||||
|
||||
**Why these are dynamic:** Site definitions and rate limits are per-request
|
||||
concerns. Adding a site or changing a rate limit should not require restarting
|
||||
the proxy and dropping active connections. Rate limits and body limits are
|
||||
global settings in Phase 1; per-site configuration for these may be added in
|
||||
**Why these are dynamic:** See ADR-008 for the rationale. Site definitions
|
||||
and rate limits are per-request concerns that should not require restarting
|
||||
the proxy or dropping active connections. Rate limits and body limits are
|
||||
global settings in Phase 1; per-site configuration for these is deferred to
|
||||
Phase 2.
|
||||
|
||||
## Config Reload
|
||||
@@ -136,13 +137,13 @@ config reload, but SIGHUP is sufficient for Phase 1.
|
||||
# reverse-proxy config
|
||||
|
||||
[server]
|
||||
bind_addr = "15.235.125.95"
|
||||
bind_addr = "203.0.113.10" # Replace with actual bind address
|
||||
http_port = 80
|
||||
https_port = 443
|
||||
|
||||
[server.tls]
|
||||
mode = "acme" # "acme" or "manual"
|
||||
acme_domain = "git.alk.dev"
|
||||
acme_domains = ["git.alk.dev", "alk.dev"]
|
||||
acme_cache_dir = "/var/lib/reverse-proxy/acme-cache"
|
||||
acme_directory = "production" # "production" or "staging"
|
||||
|
||||
@@ -166,6 +167,11 @@ limit_bytes = 104857600 # 100 MB
|
||||
host = "git.alk.dev"
|
||||
upstream = "127.0.0.1:3000"
|
||||
upstream_scheme = "http"
|
||||
|
||||
[[sites]]
|
||||
host = "alk.dev"
|
||||
upstream = "127.0.0.1:8080"
|
||||
upstream_scheme = "http"
|
||||
```
|
||||
|
||||
### Validation
|
||||
@@ -173,12 +179,13 @@ upstream_scheme = "http"
|
||||
On startup, the config is validated:
|
||||
|
||||
1. `bind_addr` is not `0.0.0.0` (must be explicit)
|
||||
2. In ACME mode, `acme_domain` must be set
|
||||
2. In ACME mode, `acme_domains` must be non-empty
|
||||
3. In manual mode, `cert_path` and `key_path` must both be set and the files
|
||||
must be readable
|
||||
4. Each site must have a `host` and `upstream`
|
||||
5. `rate_limit.requests_per_second` must be > 0
|
||||
6. `body.limit_bytes` must be > 0
|
||||
5. Site `host` values must be unique (no duplicate hostnames)
|
||||
6. `rate_limit.requests_per_second` must be > 0
|
||||
7. `body.limit_bytes` must be > 0
|
||||
|
||||
On SIGHUP reload, the same validation applies. If the new config fails
|
||||
validation, the reload is rejected and the old config remains active. An error
|
||||
@@ -196,6 +203,8 @@ All design decisions are documented as ADRs in [decisions/](decisions/).
|
||||
|-----|----------|---------|
|
||||
| [003](decisions/003-toml-config.md) | TOML configuration format | Rust-native, unambiguous, excellent serde support |
|
||||
| [008](decisions/008-static-dynamic-config-split.md) | Static/dynamic config split | Immutable StaticConfig, hot-reloadable DynamicConfig via ArcSwap |
|
||||
| [010](decisions/010-multi-site-phase1.md) | Multi-site in Phase 1 | Multiple domains from initial release |
|
||||
| [011](decisions/011-multi-domain-tls.md) | Multi-domain TLS config | Single SAN certificate covering all domains |
|
||||
|
||||
## Open Questions
|
||||
|
||||
@@ -204,3 +213,5 @@ questions affecting this document:
|
||||
|
||||
- **OQ-04**: Should config reload support a Unix domain socket API in addition
|
||||
to SIGHUP? (open)
|
||||
- **OQ-07**: Should per-site TLS overrides be supported for mixed ACME/manual
|
||||
domains? (open)
|
||||
@@ -16,8 +16,9 @@ available:
|
||||
2. **Custom handler** (Felix Knorr pattern): Build a handler using hyper's
|
||||
`Client` to forward requests. ~50-100 lines of Rust for our needs.
|
||||
|
||||
Our use case is minimal: single upstream per domain, single domain, no load
|
||||
balancing, no retry, no HTTP/2 proxying.
|
||||
Our use case is minimal: single upstream per domain, no load balancing, no
|
||||
retry, no HTTP/2 proxying. While the proxy supports multiple domains
|
||||
(ADR-010), each domain routes to exactly one upstream.
|
||||
|
||||
## Decision
|
||||
|
||||
@@ -31,6 +32,8 @@ project's channel proxy.
|
||||
path-based routing to multiple backends)
|
||||
- Our proxy case is the simplest possible: match a Host header, forward the
|
||||
entire request to a single upstream, stream the response back
|
||||
- Multi-domain support (ADR-010) doesn't change this — each domain still maps
|
||||
to one upstream
|
||||
- The Felix Knorr pattern is proven, idiomatic, and ~50-100 lines
|
||||
- We maintain full control over header injection, error handling, and upstream
|
||||
connection behavior
|
||||
@@ -46,11 +49,12 @@ project's channel proxy.
|
||||
|
||||
**Negative:**
|
||||
- We implement and maintain proxy logic ourselves (but it's trivial for our
|
||||
use case)
|
||||
use case — each domain maps to one upstream)
|
||||
- If requirements grow to load balancing or retry, we'd need to add that
|
||||
ourselves or switch to `axum-reverse-proxy`
|
||||
|
||||
## References
|
||||
|
||||
- [proxy.md](../proxy.md)
|
||||
- [ADR-010](010-multi-site-phase1.md) (multi-site in Phase 1)
|
||||
- Felix Knorr, "Replacing nginx with axum" (felix-knorr.net/posts/2024-10-13-replacing-nginx-with-axum.html)
|
||||
90
docs/architecture/decisions/010-multi-site-phase1.md
Normal file
90
docs/architecture/decisions/010-multi-site-phase1.md
Normal file
@@ -0,0 +1,90 @@
|
||||
# ADR-010: Multi-Site Support in Phase 1
|
||||
|
||||
## Status
|
||||
|
||||
Accepted
|
||||
|
||||
## Context
|
||||
|
||||
The original architecture phased multi-site support into Phase 2, treating
|
||||
Phase 1 as a single-domain replacement for nginx serving only `git.alk.dev`.
|
||||
This was based on the assumption that only one domain needed proxying initially.
|
||||
|
||||
However, `alk.dev` (the bare domain) will need proxying in the near future.
|
||||
While `alk.dev` is a simple case — proxying to a Deno/Fresh container with no
|
||||
special requirements — the proxy must support multiple sites from day one. The
|
||||
config format, routing logic, and TLS certificate provisioning all need
|
||||
multi-site awareness.
|
||||
|
||||
Additionally, `api.alk.dev` is explicitly out of scope (it runs its own
|
||||
HTTP/2+ server natively), but the proxy must not prevent future sites from
|
||||
being added.
|
||||
|
||||
The cost of deferring multi-site is high: we'd need a config format migration,
|
||||
routing logic rewrite, and TLS cert management changes later. Supporting
|
||||
multi-site from the start costs very little — the config format just uses an
|
||||
array of sites (which it already does), host-based routing is trivial in axum,
|
||||
and `rustls-acme` supports multi-domain certificates natively.
|
||||
|
||||
## Decision
|
||||
|
||||
Move multi-site support from Phase 2 into Phase 1. The proxy supports multiple
|
||||
sites from the initial release:
|
||||
|
||||
- `[[sites]]` array in config (already the planned format)
|
||||
- Host-based routing via axum's `Host` extractor (already the planned approach)
|
||||
- Multi-domain ACME certificate provisioning via `rustls-acme`
|
||||
- Each site maps a hostname to an upstream address
|
||||
|
||||
Phase 1 scope becomes:
|
||||
|
||||
1. Multi-site reverse proxy with TLS termination
|
||||
2. ACME certificate management (multi-domain)
|
||||
3. HTTP → HTTPS redirect
|
||||
4. Rate limiting, logging, health check, graceful shutdown
|
||||
5. Systemd integration
|
||||
|
||||
Phase 2 scope shifts to operational hardening:
|
||||
|
||||
1. Per-site rate limits and body limits
|
||||
2. Per-site upstream timeouts
|
||||
3. Metrics endpoint (Prometheus-compatible)
|
||||
4. Connection limits and timeouts
|
||||
5. Log rotation
|
||||
|
||||
Phase 3 remains future enhancements.
|
||||
|
||||
## Rationale
|
||||
|
||||
- The config format already uses `[[sites]]` — no format change needed
|
||||
- Host-based routing is the natural axum pattern and was already planned
|
||||
- `rustls-acme` accepts `Vec<domain>` — multi-domain is its default usage
|
||||
- The cost of adding multi-site later (config migration, routing rewrite,
|
||||
cert management changes) far exceeds the cost of supporting it now (zero
|
||||
additional complexity)
|
||||
- `alk.dev` is confirmed as a near-term need, not a hypothetical
|
||||
- The proxy's value proposition is being a memory-safe reverse proxy for *our
|
||||
infrastructure*, which has multiple domains
|
||||
|
||||
## Consequences
|
||||
|
||||
**Positive:**
|
||||
- No config format migration needed later
|
||||
- `alk.dev` can be added to the config without code changes
|
||||
- TLS cert management handles multiple domains from the start
|
||||
- Eliminates an entire phase of work
|
||||
|
||||
**Negative:**
|
||||
- Slightly more testing surface (must verify correct routing with multiple
|
||||
sites)
|
||||
- Must test multi-domain ACME provisioning (not just single-domain)
|
||||
- Wildcard or fallback site behavior needs to be defined (addressed in
|
||||
OQ-07)
|
||||
|
||||
## References
|
||||
|
||||
- [overview.md](../overview.md)
|
||||
- [config.md](../config.md)
|
||||
- [tls.md](../tls.md)
|
||||
- [proxy.md](../proxy.md)
|
||||
- ADR-002 (custom proxy handler — rationale updated for multi-site)
|
||||
92
docs/architecture/decisions/011-multi-domain-tls.md
Normal file
92
docs/architecture/decisions/011-multi-domain-tls.md
Normal file
@@ -0,0 +1,92 @@
|
||||
# ADR-011: Multi-Domain TLS Configuration
|
||||
|
||||
## Status
|
||||
|
||||
Accepted
|
||||
|
||||
## Context
|
||||
|
||||
With multi-site support in Phase 1 (ADR-010), the TLS configuration must
|
||||
support multiple domains. The previous design used a single `tls.acme_domain`
|
||||
string field, which only works for one domain.
|
||||
|
||||
There are several approaches to multi-domain TLS:
|
||||
|
||||
1. **Single ACME config with domain list**: `acme_domains = ["git.alk.dev",
|
||||
"alk.dev"]` — one certificate covering all domains (SAN certificate)
|
||||
2. **Per-site TLS configuration**: Each site entry specifies its own TLS
|
||||
mode (ACME or manual) and domain — more flexible but complex
|
||||
3. **Hybrid**: A global TLS section with ACME domains, plus per-site overrides
|
||||
for manual certificates
|
||||
|
||||
For our use case, all proxied domains use the same ACME certificate authority
|
||||
(Let's Encrypt) and the same challenge type (TLS-ALPN-01). There's no need
|
||||
for per-site TLS configuration in Phase 1.
|
||||
|
||||
## Decision
|
||||
|
||||
Use a single ACME configuration with a list of domains, producing one SAN
|
||||
certificate covering all proxied domains. Manual mode uses certificate file
|
||||
paths (single cert file with all domains, or one cert per domain resolved via
|
||||
SNI).
|
||||
|
||||
The config format changes from the previous single-domain format:
|
||||
|
||||
```toml
|
||||
# Previous (single-domain) format — no longer used
|
||||
[tls]
|
||||
mode = "acme"
|
||||
acme_domain = "git.alk.dev" # single string
|
||||
```
|
||||
|
||||
To the current multi-domain format:
|
||||
|
||||
```toml
|
||||
[tls]
|
||||
mode = "acme"
|
||||
acme_domains = ["git.alk.dev", "alk.dev"] # array of strings
|
||||
```
|
||||
|
||||
In ACME mode, `rustls-acme` provisions a single certificate covering all
|
||||
listed domains via Subject Alternative Names (SAN). This is the standard
|
||||
Let's Encrypt approach for multi-domain certificates.
|
||||
|
||||
In manual mode, the cert and key files must cover all domains (either a SAN
|
||||
certificate or separate certificates resolved via SNI).
|
||||
|
||||
## Rationale
|
||||
|
||||
- A single SAN certificate is simpler to manage (one renewal, one cert)
|
||||
- Let's Encrypt supports SAN certificates with up to 100 domains
|
||||
- `rustls-acme` accepts `Vec<String>` for domain lists — this is its natural
|
||||
API
|
||||
- All our domains use the same ACME configuration (Let's Encrypt production,
|
||||
TLS-ALPN-01 challenge)
|
||||
- Per-site TLS overrides add complexity with no current benefit
|
||||
- If per-site TLS configuration is needed later (e.g., a site with a manual
|
||||
cert), it can be added as an optional override without changing the global
|
||||
config structure
|
||||
|
||||
## Consequences
|
||||
|
||||
**Positive:**
|
||||
- Single certificate for all domains — simpler renewal, simpler cert management
|
||||
- Matches `rustls-acme`'s natural API (`AcmeConfig::new(domains: Vec<String>)`)
|
||||
- All domains in one cert means SNI resolution is handled by ACME automatically
|
||||
- Config format is a minimal change from single-domain
|
||||
|
||||
**Negative:**
|
||||
- Adding or removing a domain requires re-provisioning the certificate (ACME
|
||||
handles this automatically, but it means cert changes affect all domains)
|
||||
- If one domain fails ACME validation, the entire cert renewal fails (all
|
||||
domains must be validated) — mitigated by Let's Encrypt's domain-level
|
||||
validation
|
||||
- Per-site TLS configuration (e.g., a domain with a manual cert) requires a
|
||||
future config extension (OQ-07)
|
||||
|
||||
## References
|
||||
|
||||
- [tls.md](../tls.md)
|
||||
- [config.md](../config.md)
|
||||
- ADR-010 (multi-site in Phase 1)
|
||||
- ADR-004 (ACME-primary certificate management)
|
||||
@@ -21,8 +21,6 @@ last_updated: 2026-06-11
|
||||
than the current nginx config.
|
||||
- **Cross-references**: ADR-005
|
||||
|
||||
## Logging and Monitoring
|
||||
|
||||
### ~~OQ-02: What log format should fail2ban consume?~~
|
||||
|
||||
- **Origin**: [operations.md](operations.md), [proxy.md](proxy.md)
|
||||
@@ -33,6 +31,22 @@ last_updated: 2026-06-11
|
||||
See ADR-007.
|
||||
- **Cross-references**: ADR-007
|
||||
|
||||
### OQ-07: Should per-site TLS overrides be supported for mixed ACME/manual domains?
|
||||
|
||||
- **Origin**: [tls.md](tls.md), [config.md](config.md)
|
||||
- **Status**: open
|
||||
- **Priority**: low
|
||||
- **Context**: Phase 1 uses a single TLS configuration (ACME or manual) for all
|
||||
domains. All domains share the same ACME config and certificate. If a future
|
||||
domain needs a manual certificate (e.g., a corporate CA cert) while other
|
||||
domains use ACME, a per-site TLS override would be needed. This would require
|
||||
a custom `ResolvesServerCert` that combines ACME-provisioned certs with
|
||||
manually loaded certs. For now, all proxied domains use the same ACME config,
|
||||
so this is not needed.
|
||||
- **Cross-references**: ADR-011
|
||||
|
||||
## Logging and Monitoring
|
||||
|
||||
### OQ-03: Should the health check endpoint be on a separate port?
|
||||
|
||||
- **Origin**: [operations.md](operations.md)
|
||||
@@ -61,15 +75,15 @@ last_updated: 2026-06-11
|
||||
|
||||
## Deployment
|
||||
|
||||
### OQ-05: Should the proxy bind to multiple addresses or just one?
|
||||
### ~~OQ-05: Should the proxy bind to multiple addresses or just one?~~
|
||||
|
||||
- **Origin**: [overview.md](overview.md)
|
||||
- **Status**: open
|
||||
- **Status**: resolved
|
||||
- **Priority**: low
|
||||
- **Context**: Current nginx config binds to a specific IP (`15.235.125.95`).
|
||||
The proposed config uses `bind_addr` which could be any IP. For Phase 1, the
|
||||
config will specify a single IP address. Multi-address binding (listening on
|
||||
multiple IPs) is not needed but could be added as an array of addresses.
|
||||
- **Resolution**: A single `bind_addr` is sufficient. The proxy binds to one
|
||||
explicit IP address (not `0.0.0.0`). Multi-address binding is not needed for
|
||||
this single-server deployment. If needed in the future, `bind_addr` could be
|
||||
extended to an array. See config.md for the `bind_addr` field.
|
||||
- **Cross-references**: None
|
||||
|
||||
## Proxy
|
||||
|
||||
@@ -42,9 +42,10 @@ Requests` and logs the event with structured fields.
|
||||
### State Eviction
|
||||
|
||||
The per-IP token bucket state grows over time as new IPs are seen. A
|
||||
background task runs at a configurable interval (default: 60 seconds) and
|
||||
removes entries that haven't been accessed within the cleanup interval. This
|
||||
prevents unbounded memory growth.
|
||||
background task runs every 60 seconds (configurable) and removes entries
|
||||
whose last access timestamp is older than a configurable eviction age
|
||||
(default: 300 seconds / 5 minutes). This prevents unbounded memory growth
|
||||
while preserving recent entries that may still receive traffic.
|
||||
|
||||
### Fail2ban Integration
|
||||
|
||||
@@ -55,7 +56,7 @@ format decision.
|
||||
The log format uses `key=value` pairs with a `RATE_LIMIT` prefix:
|
||||
|
||||
```
|
||||
RATE_LIMIT client_ip=X.X.X.X host=Y.Z path=/W status=429
|
||||
RATE_LIMIT client_ip=203.0.113.50 host=Y.Z path=/W status=429
|
||||
```
|
||||
|
||||
A corresponding fail2ban filter and jail configuration are provided as part
|
||||
@@ -71,15 +72,15 @@ log entries:
|
||||
1. **Access logs**: Every proxied request is logged at `info` level with
|
||||
structured fields.
|
||||
|
||||
```
|
||||
REQUEST client_ip=1.2.3.4 host=git.alk.dev method=GET path=/user/repo status=200 upstream=127.0.0.1:3000 duration_ms=45
|
||||
```
|
||||
```
|
||||
REQUEST client_ip=203.0.113.50 host=git.alk.dev method=GET path=/user/repo status=200 upstream=127.0.0.1:3000 duration_ms=45
|
||||
```
|
||||
|
||||
2. **Event logs**: Rate limits, TLS errors, upstream failures, config reloads,
|
||||
etc.
|
||||
|
||||
```
|
||||
RATE_LIMIT client_ip=1.2.3.4 host=git.alk.dev path=/login status=429
|
||||
RATE_LIMIT client_ip=203.0.113.50 host=git.alk.dev path=/login status=429
|
||||
UPSTREAM_ERROR host=git.alk.dev upstream=127.0.0.1:3000 error="connection refused"
|
||||
CONFIG_RELOAD status=success sites=1
|
||||
```
|
||||
|
||||
@@ -8,10 +8,12 @@ last_updated: 2026-06-11
|
||||
## Vision
|
||||
|
||||
A memory-safe, minimal reverse proxy that replaces our vulnerable nginx instance
|
||||
for forward-proxying to backend services. The proxy terminates TLS, injects
|
||||
for forwarding requests to backend services. The proxy terminates TLS, injects
|
||||
standard proxy headers, enforces rate limits, and forwards requests to upstream
|
||||
services — with operational feature parity for our current single-domain Gitea
|
||||
setup.
|
||||
services — supporting multiple domains from initial release.
|
||||
|
||||
This project is open source under dual licensing: MIT OR Apache-2.0, consistent
|
||||
with standard Rust project licensing.
|
||||
|
||||
## Why This Exists
|
||||
|
||||
@@ -35,38 +37,44 @@ details.
|
||||
|
||||
### In Scope
|
||||
|
||||
- **Phase 1**: Replace nginx for `git.alk.dev` with feature parity
|
||||
- TLS termination with ACME (Let's Encrypt) certificate management
|
||||
- **Phase 1**: Multi-site reverse proxy with TLS termination
|
||||
- TLS termination with ACME (Let's Encrypt) multi-domain certificate management
|
||||
- Manual certificate paths as fallback mode
|
||||
- HTTP → HTTPS redirect
|
||||
- Reverse proxy to Gitea at `127.0.0.1:3000`
|
||||
- Host-based routing to multiple upstream services
|
||||
- Reverse proxy to Gitea at `127.0.0.1:3000` (git.alk.dev)
|
||||
- Reverse proxy to Deno/Fresh container for alk.dev (simple pass-through)
|
||||
- Proxy header injection (Host, X-Real-IP, X-Forwarded-For, X-Forwarded-Proto)
|
||||
- Request rate limiting with fail2ban-compatible logging (global per-IP; per-site in Phase 2)
|
||||
- 100 MB body size limit (global; per-site in Phase 2)
|
||||
- Request rate limiting with fail2ban-compatible logging (global per-IP)
|
||||
- 100 MB body size limit (global)
|
||||
- Configurable bind address (no `0.0.0.0` default)
|
||||
- Health check endpoint
|
||||
- Graceful shutdown (SIGTERM handling)
|
||||
- Systemd unit file
|
||||
- Dual licensing: MIT OR Apache-2.0
|
||||
|
||||
- **Phase 2**: Multi-site support
|
||||
- SNI-based TLS routing for multiple domains
|
||||
- Config file for site definitions
|
||||
- Dynamic config reload (ArcSwap pattern)
|
||||
|
||||
- **Phase 3**: Operational hardening
|
||||
- **Phase 2**: Operational hardening
|
||||
- Per-site rate limits and body limits
|
||||
- Per-site upstream timeouts
|
||||
- Metrics endpoint (Prometheus-compatible)
|
||||
- Connection limits and timeouts
|
||||
- Log rotation
|
||||
|
||||
- **Phase 3**: Future enhancements
|
||||
- Wildcard subdomain support
|
||||
- Per-site TLS overrides (manual certs for specific domains)
|
||||
- Unix domain socket config reload API
|
||||
|
||||
### Out of Scope
|
||||
|
||||
- HTTP/2 or HTTP/3 proxying (services that need these run their own native
|
||||
Rust servers — e.g., `api.alk.dev`)
|
||||
Rust servers — e.g., `api.alk.dev` runs its own HTTP/2+ server)
|
||||
- Load balancing or round-robin upstream selection
|
||||
- WebSocket proxying (can be added later if needed)
|
||||
- Static file serving
|
||||
- Access control beyond rate limiting (no auth, no IP allowlists in Phase 1)
|
||||
- CGI, SCGI, uWSGI, FastCGI
|
||||
- Per-site TLS configuration (all domains share one ACME config in Phase 1)
|
||||
|
||||
## Architecture
|
||||
|
||||
@@ -81,11 +89,14 @@ bind_addr:80 ──► │ HTTP listener → 301 redirect │
|
||||
│ │
|
||||
bind_addr:443 ──► │ TLS listener (tokio-rustls) │
|
||||
│ ├─ ACME mode: rustls-acme resolver │
|
||||
│ │ (auto cert provisioning/renewal) │
|
||||
│ │ (multi-domain SAN cert, │
|
||||
│ │ auto-provision & renew) │
|
||||
│ └─ Manual mode: cert/key file paths │
|
||||
│ │
|
||||
│ axum router │
|
||||
│ ├─ Host-based routing │
|
||||
│ │ ├─ git.alk.dev → :3000 │
|
||||
│ │ └─ alk.dev → :8080 │
|
||||
│ ├─ Rate limiting middleware │
|
||||
│ ├─ Proxy header injection │
|
||||
│ ├─ Body size limit (100MB) │
|
||||
@@ -147,7 +158,7 @@ All design decisions are documented as ADRs in [decisions/](decisions/).
|
||||
| ADR | Decision | Summary |
|
||||
|-----|----------|---------|
|
||||
| [001](decisions/001-rust-axum.md) | Rust with axum | Memory safety eliminates the bug class causing nginx CVEs; axum provides ergonomic tower integration |
|
||||
| [002](decisions/002-custom-proxy-handler.md) | Custom proxy handler | Single upstream, single domain — axum-reverse-proxy adds unnecessary complexity |
|
||||
| [002](decisions/002-custom-proxy-handler.md) | Custom proxy handler | Single upstream per domain — simpler than a general proxy library |
|
||||
| [003](decisions/003-toml-config.md) | TOML configuration format | Rust-native, unambiguous, excellent serde support |
|
||||
| [004](decisions/004-rustls-acme.md) | ACME-primary certificate management | Eliminates certbot dependency; automatic provisioning and renewal |
|
||||
| [005](decisions/005-tokio-rustls-direct.md) | tokio-rustls directly, not axum-server | Full control over TLS config, ACME resolver integration, cipher suite configuration |
|
||||
@@ -155,6 +166,8 @@ All design decisions are documented as ADRs in [decisions/](decisions/).
|
||||
| [007](decisions/007-custom-log-format.md) | Custom structured log format | key=value pairs with RATE_LIMIT prefix for fail2ban |
|
||||
| [008](decisions/008-static-dynamic-config-split.md) | Static/dynamic config with ArcSwap | Immutable StaticConfig, hot-reloadable DynamicConfig via ArcSwap |
|
||||
| [009](decisions/009-signal-handling.md) | Signal handling strategy | signal-hook for SIGTERM/SIGINT/SIGHUP |
|
||||
| [010](decisions/010-multi-site-phase1.md) | Multi-site in Phase 1 | Multiple domains from initial release; avoids config migration later |
|
||||
| [011](decisions/011-multi-domain-tls.md) | Multi-domain TLS config | Single SAN certificate covering all domains via rustls-acme |
|
||||
|
||||
## Open Questions
|
||||
|
||||
@@ -163,4 +176,4 @@ questions affecting this document:
|
||||
|
||||
- **OQ-01**: Should cipher suites be restricted beyond rustls defaults? (open)
|
||||
- **OQ-03**: Should the health check endpoint be on a separate port? (open)
|
||||
- **OQ-05**: Should the proxy bind to multiple addresses or just one? (open)
|
||||
- **OQ-07**: Should per-site TLS overrides be supported for mixed ACME/manual domains? (open)
|
||||
@@ -14,8 +14,9 @@ injection, body size limits), and forwards it to the upstream service.
|
||||
## Why It Exists
|
||||
|
||||
This component replaces nginx's `proxy_pass` directive. For our use case —
|
||||
single upstream per domain, no load balancing, no HTTP/2 proxying — a custom
|
||||
handler is simpler and more maintainable than a general-purpose proxy library.
|
||||
one upstream per domain across multiple domains, no load balancing, no HTTP/2
|
||||
proxying — a custom handler is simpler and more maintainable than a
|
||||
general-purpose proxy library (ADR-002, ADR-010).
|
||||
|
||||
## Architecture
|
||||
|
||||
@@ -140,9 +141,9 @@ services typically run on the same host (e.g., `127.0.0.1:3000`). The
|
||||
`upstream_scheme` field in each site's configuration allows specifying `https://`
|
||||
for upstreams that require TLS (e.g., separate hosts or secure internal services).
|
||||
|
||||
For the initial deployment (`git.alk.dev` → `127.0.0.1:3000`), the upstream
|
||||
connection uses plain HTTP, as TLS between the proxy and Gitea on loopback is
|
||||
unnecessary.
|
||||
For the initial deployment, upstream connections use plain HTTP (e.g.,
|
||||
`git.alk.dev` → `127.0.0.1:3000`, `alk.dev` → `127.0.0.1:8080`) since TLS
|
||||
between the proxy and backend services on loopback is unnecessary.
|
||||
|
||||
## Body Size Limit
|
||||
|
||||
@@ -157,8 +158,9 @@ All design decisions are documented as ADRs in [decisions/](decisions/).
|
||||
|
||||
| ADR | Decision | Summary |
|
||||
|-----|----------|---------|
|
||||
| [002](decisions/002-custom-proxy-handler.md) | Custom proxy handler | Single upstream, single domain — simpler than a general proxy library |
|
||||
| [002](decisions/002-custom-proxy-handler.md) | Custom proxy handler | One upstream per domain — simpler than a general proxy library |
|
||||
| [007](decisions/007-custom-log-format.md) | Custom structured log format | key=value pairs with RATE_LIMIT prefix for fail2ban |
|
||||
| [010](decisions/010-multi-site-phase1.md) | Multi-site in Phase 1 | Multiple domains from initial release |
|
||||
|
||||
## Open Questions
|
||||
|
||||
|
||||
@@ -57,10 +57,11 @@ no deploy hooks.
|
||||
|
||||
**How it works:**
|
||||
|
||||
1. `AcmeCertProvider` configures the ACME client with the domain, cache
|
||||
1. `AcmeCertProvider` configures the ACME client with the domain list, cache
|
||||
directory, and Let's Encrypt directory (staging or production).
|
||||
2. `AcmeConfig::new(vec![domain])` creates an ACME configuration for the
|
||||
domain.
|
||||
2. `AcmeConfig::new(domains)` creates an ACME configuration for all listed
|
||||
domains. Let's Encrypt will issue a single SAN certificate covering all
|
||||
domains.
|
||||
3. The ACME state machine runs as a background tokio task, handling:
|
||||
- Account registration with Let's Encrypt
|
||||
- Certificate ordering
|
||||
@@ -75,9 +76,9 @@ no deploy hooks.
|
||||
**Configuration:**
|
||||
|
||||
```toml
|
||||
[tls]
|
||||
[server.tls]
|
||||
mode = "acme"
|
||||
acme_domain = "git.alk.dev"
|
||||
acme_domains = ["git.alk.dev", "alk.dev"]
|
||||
acme_cache_dir = "/var/lib/reverse-proxy/acme-cache"
|
||||
acme_directory = "production" # or "staging" for testing
|
||||
```
|
||||
@@ -100,13 +101,8 @@ key_path = "/etc/letsencrypt/live/git.alk.dev/privkey.pem"
|
||||
```
|
||||
|
||||
Certificate files are loaded once at startup using `rustls_pemfile`. Manual
|
||||
mode requires a restart to pick up new certificates.
|
||||
|
||||
**Why not hot-reload manual certs?** ACME mode handles renewal automatically.
|
||||
Manual mode is for cases where you control cert rotation externally (certbot,
|
||||
manual renewal). In that case, a SIGHUP-triggered restart is simpler and more
|
||||
reliable than file watching. If zero-downtime cert rotation is needed, use ACME
|
||||
mode.
|
||||
mode requires a restart to pick up new certificates. See ADR-004 for the
|
||||
rationale behind making ACME the primary mode and manual mode restart-dependent.
|
||||
|
||||
## TLS Configuration
|
||||
|
||||
@@ -142,10 +138,13 @@ restrict cipher suites beyond rustls defaults.
|
||||
### ServerConfig Construction
|
||||
|
||||
For manual mode, the `ServerConfig` is built with `with_no_client_auth()` and
|
||||
`with_single_cert()`, loading the certificate chain and private key from disk.
|
||||
a custom `ResolvesServerCert` implementation that maps SNI hostnames to
|
||||
certificate/key pairs loaded from disk.
|
||||
|
||||
For ACME mode, the `ServerConfig` is built with `with_cert_resolver()`, passing
|
||||
the `ResolvesServerCertAcme` resolver. The ACME TLS-ALPN-01 protocol identifier
|
||||
the `ResolvesServerCertAcme` resolver. The ACME configuration includes all
|
||||
domains listed in `acme_domains`, and the resolver manages a single SAN
|
||||
certificate covering all of them. The ACME TLS-ALPN-01 protocol identifier
|
||||
(`acme-tls/1`) must be registered in the `alpn_protocols` list so the server
|
||||
can respond to TLS-ALPN-01 challenges.
|
||||
|
||||
@@ -154,28 +153,39 @@ versions (TLS 1.2 and TLS 1.3).
|
||||
|
||||
## SNI-Based Certificate Selection
|
||||
|
||||
### Current (Single Domain)
|
||||
### ACME Mode (Multi-Domain)
|
||||
|
||||
For single-domain setups, SNI selection is trivial: there's only one
|
||||
certificate, so `with_single_cert()` or `ResolvesServerCertAcme` (which
|
||||
handles the domain) is sufficient.
|
||||
|
||||
### Future (Multi-Domain)
|
||||
|
||||
When multiple domains are served, SNI selection works as follows:
|
||||
In ACME mode, `rustls-acme` manages a single SAN certificate covering all
|
||||
configured domains. The `ResolvesServerCertAcme` resolver automatically serves
|
||||
the correct certificate during the TLS handshake.
|
||||
|
||||
1. **TLS handshake**: The client sends the SNI extension indicating which
|
||||
hostname it's connecting to.
|
||||
2. **Certificate resolution**: In ACME mode, `ResolvesServerCertAcme` handles
|
||||
this automatically — it stores certificates keyed by domain. In manual mode,
|
||||
a custom `ResolvesServerCert` implementation maps SNI hostname to the
|
||||
correct `CertifiedKey`.
|
||||
2. **Certificate resolution**: `ResolvesServerCertAcme` matches the SNI
|
||||
hostname against the provisioned certificate's Subject Alternative Names
|
||||
and serves the certificate.
|
||||
3. **HTTP routing**: After the TLS handshake, axum's `Host` extractor routes
|
||||
the request to the correct site handler based on the `Host` header.
|
||||
|
||||
This is the same pattern nginx uses — SNI selects the cert during TLS, then
|
||||
`Host` header selects the server block. In manual mode, a `ResolvesServerCert`
|
||||
implementation maps SNI hostname to the correct `CertifiedKey`.
|
||||
`Host` header selects the server block. ACME mode handles this automatically
|
||||
through the cert resolver.
|
||||
|
||||
### Manual Mode (Multi-Domain)
|
||||
|
||||
In manual mode, a custom `ResolvesServerCert` implementation is required to
|
||||
map SNI hostnames to the correct `CertifiedKey`. This implementation:
|
||||
|
||||
1. Loads certificate files at startup (or on SIGHUP for reload)
|
||||
2. Maps each domain name to its certificate chain and private key
|
||||
3. During the TLS handshake, looks up the SNI hostname and returns the
|
||||
matching `CertifiedKey`
|
||||
|
||||
The custom resolver must handle the case where no matching certificate exists
|
||||
for the SNI hostname — in this case, the handshake fails, which is the
|
||||
correct behavior (we don't serve a default certificate for unknown domains).
|
||||
|
||||
See [open-questions.md](open-questions.md) OQ-07 for per-site TLS overrides.
|
||||
|
||||
## HTTP Listener (Port 80)
|
||||
|
||||
@@ -211,6 +221,8 @@ All design decisions are documented as ADRs in [decisions/](decisions/).
|
||||
|-----|----------|---------|
|
||||
| [004](decisions/004-rustls-acme.md) | ACME-primary cert management | Eliminates certbot; automatic provisioning and renewal |
|
||||
| [005](decisions/005-tokio-rustls-direct.md) | tokio-rustls directly | Full control over TLS config and ACME resolver integration |
|
||||
| [010](decisions/010-multi-site-phase1.md) | Multi-site in Phase 1 | Multiple domains from initial release |
|
||||
| [011](decisions/011-multi-domain-tls.md) | Multi-domain TLS config | Single SAN certificate covering all domains via rustls-acme |
|
||||
|
||||
## Open Questions
|
||||
|
||||
@@ -218,3 +230,5 @@ Open questions are tracked in [open-questions.md](open-questions.md). Key
|
||||
questions affecting this document:
|
||||
|
||||
- **OQ-01**: Should cipher suites be restricted beyond rustls defaults? (open)
|
||||
- **OQ-07**: Should per-site TLS overrides be supported for mixed ACME/manual
|
||||
domains? (open)
|
||||
Reference in New Issue
Block a user