From 7efc142406879172259ba37dcfd91a734b7455d4 Mon Sep 17 00:00:00 2001 From: "glm-5.1" Date: Thu, 11 Jun 2026 08:50:03 +0000 Subject: [PATCH] Expand architecture: multi-site Phase 1, multi-domain TLS, fix review issues MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Promote multi-site support from Phase 2 to Phase 1 (ADR-010): the proxy must support git.alk.dev and alk.dev from initial release. Add multi-domain TLS configuration (ADR-011): acme_domains array replaces acme_domain string, single SAN certificate via rustls-acme. Key changes: - ADR-010: Multi-site in Phase 1 — avoids config format migration later - ADR-011: Multi-domain TLS — single SAN cert, acme_domains Vec - ADR-002: Updated rationale for multi-site (one upstream per domain) - overview.md: Phase 1 now includes multi-site, alk.dev pass-through, dual licensing (MIT OR Apache-2.0), real IP removed - config.md: acme_domain → acme_domains, TOML example shows both sites, validation adds unique host check, real IP replaced with 203.0.113.10 - tls.md: Multi-domain SNI section moved from Future to current, manual mode uses ResolvesServerCert for SNI mapping, TOML header fixed - proxy.md: Updated for multi-site, removed single-domain language - operations.md: RFC 5737 documentation IPs, clarified rate limit eviction semantics (distinct scan interval vs eviction age) - open-questions.md: OQ-05 resolved (single bind_addr sufficient), new OQ-07 (per-site TLS overrides) Review fixes: - acme_domains (plural) consistently used across all docs and diagram - ADR-011 clearly scopes acme_domain as previous design - Inline decision rationale extracted: tls.md hot-reload → ADR-004 ref, config.md static/dynamic → ADR-008 ref - TOML section headers consistent (server.tls) --- docs/architecture/README.md | 9 +- docs/architecture/config.md | 43 +++++---- .../decisions/002-custom-proxy-handler.md | 10 +- .../decisions/010-multi-site-phase1.md | 90 ++++++++++++++++++ .../decisions/011-multi-domain-tls.md | 92 +++++++++++++++++++ docs/architecture/open-questions.md | 30 ++++-- docs/architecture/operations.md | 17 ++-- docs/architecture/overview.md | 87 ++++++++++-------- docs/architecture/proxy.md | 14 +-- docs/architecture/tls.md | 72 +++++++++------ 10 files changed, 356 insertions(+), 108 deletions(-) create mode 100644 docs/architecture/decisions/010-multi-site-phase1.md create mode 100644 docs/architecture/decisions/011-multi-domain-tls.md diff --git a/docs/architecture/README.md b/docs/architecture/README.md index a3c1ac6..effd806 100644 --- a/docs/architecture/README.md +++ b/docs/architecture/README.md @@ -14,6 +14,10 @@ memory-safe Rust/axum reverse proxy. The primary motivation is CVE-2026-42945 (unauthenticated RCE in nginx's rewrite module) and the broader pattern of memory corruption bugs in nginx's C codebase. +The proxy supports multiple domains from initial release (git.alk.dev and +alk.dev), with per-domain host-based routing and a single multi-domain SAN +certificate via ACME. + ## Architecture Documents | Document | Status | Description | @@ -37,6 +41,8 @@ memory corruption bugs in nginx's C codebase. | [007](decisions/007-custom-log-format.md) | Custom Structured Log Format | Accepted | | [008](decisions/008-static-dynamic-config-split.md) | Static/Dynamic Config Split with ArcSwap | Accepted | | [009](decisions/009-signal-handling.md) | Signal Handling Strategy | Accepted | +| [010](decisions/010-multi-site-phase1.md) | Multi-Site Support in Phase 1 | Accepted | +| [011](decisions/011-multi-domain-tls.md) | Multi-Domain TLS Configuration | Accepted | ## Open Questions @@ -48,8 +54,9 @@ See [open-questions.md](open-questions.md) for the full tracker. | ~~OQ-02~~ | ~~What log format should fail2ban consume?~~ | ~~high~~ | **resolved** (ADR-007) | | OQ-03 | Should the health check endpoint be on a separate port? | low | open | | OQ-04 | Config reload: SIGHUP only or also Unix socket API? | low | open | -| OQ-05 | Should the proxy bind to multiple addresses? | low | open | +| ~~OQ-05~~ | ~~Should the proxy bind to multiple addresses?~~ | ~~low~~ | **resolved** (single bind_addr sufficient) | | OQ-06 | Should upstream timeouts be configurable per-site? | low | open | +| OQ-07 | Should per-site TLS overrides be supported for mixed ACME/manual domains? | low | open | ## Document Lifecycle diff --git a/docs/architecture/config.md b/docs/architecture/config.md index ab84267..4609a3d 100644 --- a/docs/architecture/config.md +++ b/docs/architecture/config.md @@ -39,7 +39,7 @@ config.toml │ http_port │ │ rate_limit │ │ https_port │ │ body_limit │ │ tls.mode │ │ proxy_headers │ -│ tls.acme_domain │ │ │ +│ tls.acme_domains │ │ │ │ tls.cert_path │ │ ← ArcSwap → │ │ tls.key_path │ │ ConfigReloadHandle │ │ tls.cache_dir │ │ .reload(new_config) │ @@ -59,11 +59,11 @@ Immutable after startup. Changes require a process restart. | Field | Type | Description | |-------|------|-------------| -| `bind_addr` | `String` | IP address to bind to (e.g., `"15.235.125.95"`) | +| `bind_addr` | `String` | IP address to bind to (must be explicit, no `0.0.0.0`) | | `http_port` | `u16` | Port for HTTP→HTTPS redirect (default: `80`; set to `0` to disable) | | `https_port` | `u16` | Port for TLS listener (default: `443`) | | `tls.mode` | `"acme"` or `"manual"` | Certificate provisioning mode | -| `tls.acme_domain` | `String` | Domain for ACME (ACME mode only) | +| `tls.acme_domains` | `Vec` | Domains for ACME SAN certificate (ACME mode only) | | `tls.acme_cache_dir` | `String` | ACME state cache directory | | `tls.acme_directory` | `"production"` or `"staging"` | Let's Encrypt directory | | `tls.cert_path` | `String` | Certificate file path (manual mode only) | @@ -71,9 +71,10 @@ Immutable after startup. Changes require a process restart. | `log_level` | `"trace"`, `"debug"`, `"info"`, `"warn"`, `"error"` | Logging verbosity | | `log_format` | `"text"` or `"json"` | Log output format | -**Why these are static:** Changing bind addresses, ports, or TLS mode requires -creating new listeners and TLS configurations — operations that fundamentally -require a restart. There's no safe way to change these at runtime. +**Why these are static:** See ADR-008 for the rationale behind the +static/dynamic split. In summary: changing bind addresses, ports, or TLS mode +requires creating new listeners and TLS configurations — operations that +fundamentally require a restart. ### DynamicConfig @@ -95,10 +96,10 @@ connections immediately. | `upstream` | `String` | Upstream address (e.g., `"127.0.0.1:3000"`) | | `upstream_scheme` | `"http"` or `"https"` | Protocol for upstream connection (default: `"http"`) | -**Why these are dynamic:** Site definitions and rate limits are per-request -concerns. Adding a site or changing a rate limit should not require restarting -the proxy and dropping active connections. Rate limits and body limits are -global settings in Phase 1; per-site configuration for these may be added in +**Why these are dynamic:** See ADR-008 for the rationale. Site definitions +and rate limits are per-request concerns that should not require restarting +the proxy or dropping active connections. Rate limits and body limits are +global settings in Phase 1; per-site configuration for these is deferred to Phase 2. ## Config Reload @@ -136,13 +137,13 @@ config reload, but SIGHUP is sufficient for Phase 1. # reverse-proxy config [server] -bind_addr = "15.235.125.95" +bind_addr = "203.0.113.10" # Replace with actual bind address http_port = 80 https_port = 443 [server.tls] mode = "acme" # "acme" or "manual" -acme_domain = "git.alk.dev" +acme_domains = ["git.alk.dev", "alk.dev"] acme_cache_dir = "/var/lib/reverse-proxy/acme-cache" acme_directory = "production" # "production" or "staging" @@ -166,6 +167,11 @@ limit_bytes = 104857600 # 100 MB host = "git.alk.dev" upstream = "127.0.0.1:3000" upstream_scheme = "http" + +[[sites]] +host = "alk.dev" +upstream = "127.0.0.1:8080" +upstream_scheme = "http" ``` ### Validation @@ -173,12 +179,13 @@ upstream_scheme = "http" On startup, the config is validated: 1. `bind_addr` is not `0.0.0.0` (must be explicit) -2. In ACME mode, `acme_domain` must be set +2. In ACME mode, `acme_domains` must be non-empty 3. In manual mode, `cert_path` and `key_path` must both be set and the files must be readable 4. Each site must have a `host` and `upstream` -5. `rate_limit.requests_per_second` must be > 0 -6. `body.limit_bytes` must be > 0 +5. Site `host` values must be unique (no duplicate hostnames) +6. `rate_limit.requests_per_second` must be > 0 +7. `body.limit_bytes` must be > 0 On SIGHUP reload, the same validation applies. If the new config fails validation, the reload is rejected and the old config remains active. An error @@ -196,6 +203,8 @@ All design decisions are documented as ADRs in [decisions/](decisions/). |-----|----------|---------| | [003](decisions/003-toml-config.md) | TOML configuration format | Rust-native, unambiguous, excellent serde support | | [008](decisions/008-static-dynamic-config-split.md) | Static/dynamic config split | Immutable StaticConfig, hot-reloadable DynamicConfig via ArcSwap | +| [010](decisions/010-multi-site-phase1.md) | Multi-site in Phase 1 | Multiple domains from initial release | +| [011](decisions/011-multi-domain-tls.md) | Multi-domain TLS config | Single SAN certificate covering all domains | ## Open Questions @@ -203,4 +212,6 @@ Open questions are tracked in [open-questions.md](open-questions.md). Key questions affecting this document: - **OQ-04**: Should config reload support a Unix domain socket API in addition - to SIGHUP? (open) \ No newline at end of file + to SIGHUP? (open) +- **OQ-07**: Should per-site TLS overrides be supported for mixed ACME/manual + domains? (open) \ No newline at end of file diff --git a/docs/architecture/decisions/002-custom-proxy-handler.md b/docs/architecture/decisions/002-custom-proxy-handler.md index 159a509..089a3c2 100644 --- a/docs/architecture/decisions/002-custom-proxy-handler.md +++ b/docs/architecture/decisions/002-custom-proxy-handler.md @@ -16,8 +16,9 @@ available: 2. **Custom handler** (Felix Knorr pattern): Build a handler using hyper's `Client` to forward requests. ~50-100 lines of Rust for our needs. -Our use case is minimal: single upstream per domain, single domain, no load -balancing, no retry, no HTTP/2 proxying. +Our use case is minimal: single upstream per domain, no load balancing, no +retry, no HTTP/2 proxying. While the proxy supports multiple domains +(ADR-010), each domain routes to exactly one upstream. ## Decision @@ -31,6 +32,8 @@ project's channel proxy. path-based routing to multiple backends) - Our proxy case is the simplest possible: match a Host header, forward the entire request to a single upstream, stream the response back +- Multi-domain support (ADR-010) doesn't change this — each domain still maps + to one upstream - The Felix Knorr pattern is proven, idiomatic, and ~50-100 lines - We maintain full control over header injection, error handling, and upstream connection behavior @@ -46,11 +49,12 @@ project's channel proxy. **Negative:** - We implement and maintain proxy logic ourselves (but it's trivial for our - use case) + use case — each domain maps to one upstream) - If requirements grow to load balancing or retry, we'd need to add that ourselves or switch to `axum-reverse-proxy` ## References - [proxy.md](../proxy.md) +- [ADR-010](010-multi-site-phase1.md) (multi-site in Phase 1) - Felix Knorr, "Replacing nginx with axum" (felix-knorr.net/posts/2024-10-13-replacing-nginx-with-axum.html) \ No newline at end of file diff --git a/docs/architecture/decisions/010-multi-site-phase1.md b/docs/architecture/decisions/010-multi-site-phase1.md new file mode 100644 index 0000000..32e17f2 --- /dev/null +++ b/docs/architecture/decisions/010-multi-site-phase1.md @@ -0,0 +1,90 @@ +# ADR-010: Multi-Site Support in Phase 1 + +## Status + +Accepted + +## Context + +The original architecture phased multi-site support into Phase 2, treating +Phase 1 as a single-domain replacement for nginx serving only `git.alk.dev`. +This was based on the assumption that only one domain needed proxying initially. + +However, `alk.dev` (the bare domain) will need proxying in the near future. +While `alk.dev` is a simple case — proxying to a Deno/Fresh container with no +special requirements — the proxy must support multiple sites from day one. The +config format, routing logic, and TLS certificate provisioning all need +multi-site awareness. + +Additionally, `api.alk.dev` is explicitly out of scope (it runs its own +HTTP/2+ server natively), but the proxy must not prevent future sites from +being added. + +The cost of deferring multi-site is high: we'd need a config format migration, +routing logic rewrite, and TLS cert management changes later. Supporting +multi-site from the start costs very little — the config format just uses an +array of sites (which it already does), host-based routing is trivial in axum, +and `rustls-acme` supports multi-domain certificates natively. + +## Decision + +Move multi-site support from Phase 2 into Phase 1. The proxy supports multiple +sites from the initial release: + +- `[[sites]]` array in config (already the planned format) +- Host-based routing via axum's `Host` extractor (already the planned approach) +- Multi-domain ACME certificate provisioning via `rustls-acme` +- Each site maps a hostname to an upstream address + +Phase 1 scope becomes: + +1. Multi-site reverse proxy with TLS termination +2. ACME certificate management (multi-domain) +3. HTTP → HTTPS redirect +4. Rate limiting, logging, health check, graceful shutdown +5. Systemd integration + +Phase 2 scope shifts to operational hardening: + +1. Per-site rate limits and body limits +2. Per-site upstream timeouts +3. Metrics endpoint (Prometheus-compatible) +4. Connection limits and timeouts +5. Log rotation + +Phase 3 remains future enhancements. + +## Rationale + +- The config format already uses `[[sites]]` — no format change needed +- Host-based routing is the natural axum pattern and was already planned +- `rustls-acme` accepts `Vec` — multi-domain is its default usage +- The cost of adding multi-site later (config migration, routing rewrite, + cert management changes) far exceeds the cost of supporting it now (zero + additional complexity) +- `alk.dev` is confirmed as a near-term need, not a hypothetical +- The proxy's value proposition is being a memory-safe reverse proxy for *our + infrastructure*, which has multiple domains + +## Consequences + +**Positive:** +- No config format migration needed later +- `alk.dev` can be added to the config without code changes +- TLS cert management handles multiple domains from the start +- Eliminates an entire phase of work + +**Negative:** +- Slightly more testing surface (must verify correct routing with multiple + sites) +- Must test multi-domain ACME provisioning (not just single-domain) +- Wildcard or fallback site behavior needs to be defined (addressed in + OQ-07) + +## References + +- [overview.md](../overview.md) +- [config.md](../config.md) +- [tls.md](../tls.md) +- [proxy.md](../proxy.md) +- ADR-002 (custom proxy handler — rationale updated for multi-site) \ No newline at end of file diff --git a/docs/architecture/decisions/011-multi-domain-tls.md b/docs/architecture/decisions/011-multi-domain-tls.md new file mode 100644 index 0000000..344dfc3 --- /dev/null +++ b/docs/architecture/decisions/011-multi-domain-tls.md @@ -0,0 +1,92 @@ +# ADR-011: Multi-Domain TLS Configuration + +## Status + +Accepted + +## Context + +With multi-site support in Phase 1 (ADR-010), the TLS configuration must +support multiple domains. The previous design used a single `tls.acme_domain` +string field, which only works for one domain. + +There are several approaches to multi-domain TLS: + +1. **Single ACME config with domain list**: `acme_domains = ["git.alk.dev", + "alk.dev"]` — one certificate covering all domains (SAN certificate) +2. **Per-site TLS configuration**: Each site entry specifies its own TLS + mode (ACME or manual) and domain — more flexible but complex +3. **Hybrid**: A global TLS section with ACME domains, plus per-site overrides + for manual certificates + +For our use case, all proxied domains use the same ACME certificate authority +(Let's Encrypt) and the same challenge type (TLS-ALPN-01). There's no need +for per-site TLS configuration in Phase 1. + +## Decision + +Use a single ACME configuration with a list of domains, producing one SAN +certificate covering all proxied domains. Manual mode uses certificate file +paths (single cert file with all domains, or one cert per domain resolved via +SNI). + +The config format changes from the previous single-domain format: + +```toml +# Previous (single-domain) format — no longer used +[tls] +mode = "acme" +acme_domain = "git.alk.dev" # single string +``` + +To the current multi-domain format: + +```toml +[tls] +mode = "acme" +acme_domains = ["git.alk.dev", "alk.dev"] # array of strings +``` + +In ACME mode, `rustls-acme` provisions a single certificate covering all +listed domains via Subject Alternative Names (SAN). This is the standard +Let's Encrypt approach for multi-domain certificates. + +In manual mode, the cert and key files must cover all domains (either a SAN +certificate or separate certificates resolved via SNI). + +## Rationale + +- A single SAN certificate is simpler to manage (one renewal, one cert) +- Let's Encrypt supports SAN certificates with up to 100 domains +- `rustls-acme` accepts `Vec` for domain lists — this is its natural + API +- All our domains use the same ACME configuration (Let's Encrypt production, + TLS-ALPN-01 challenge) +- Per-site TLS overrides add complexity with no current benefit +- If per-site TLS configuration is needed later (e.g., a site with a manual + cert), it can be added as an optional override without changing the global + config structure + +## Consequences + +**Positive:** +- Single certificate for all domains — simpler renewal, simpler cert management +- Matches `rustls-acme`'s natural API (`AcmeConfig::new(domains: Vec)`) +- All domains in one cert means SNI resolution is handled by ACME automatically +- Config format is a minimal change from single-domain + +**Negative:** +- Adding or removing a domain requires re-provisioning the certificate (ACME + handles this automatically, but it means cert changes affect all domains) +- If one domain fails ACME validation, the entire cert renewal fails (all + domains must be validated) — mitigated by Let's Encrypt's domain-level + validation +- Per-site TLS configuration (e.g., a domain with a manual cert) requires a + future config extension (OQ-07) + +## References + +- [tls.md](../tls.md) +- [config.md](../config.md) +- ADR-010 (multi-site in Phase 1) +- ADR-004 (ACME-primary certificate management) \ No newline at end of file diff --git a/docs/architecture/open-questions.md b/docs/architecture/open-questions.md index 8226a2a..6b6fc2a 100644 --- a/docs/architecture/open-questions.md +++ b/docs/architecture/open-questions.md @@ -21,8 +21,6 @@ last_updated: 2026-06-11 than the current nginx config. - **Cross-references**: ADR-005 -## Logging and Monitoring - ### ~~OQ-02: What log format should fail2ban consume?~~ - **Origin**: [operations.md](operations.md), [proxy.md](proxy.md) @@ -33,6 +31,22 @@ last_updated: 2026-06-11 See ADR-007. - **Cross-references**: ADR-007 +### OQ-07: Should per-site TLS overrides be supported for mixed ACME/manual domains? + +- **Origin**: [tls.md](tls.md), [config.md](config.md) +- **Status**: open +- **Priority**: low +- **Context**: Phase 1 uses a single TLS configuration (ACME or manual) for all + domains. All domains share the same ACME config and certificate. If a future + domain needs a manual certificate (e.g., a corporate CA cert) while other + domains use ACME, a per-site TLS override would be needed. This would require + a custom `ResolvesServerCert` that combines ACME-provisioned certs with + manually loaded certs. For now, all proxied domains use the same ACME config, + so this is not needed. +- **Cross-references**: ADR-011 + +## Logging and Monitoring + ### OQ-03: Should the health check endpoint be on a separate port? - **Origin**: [operations.md](operations.md) @@ -61,15 +75,15 @@ last_updated: 2026-06-11 ## Deployment -### OQ-05: Should the proxy bind to multiple addresses or just one? +### ~~OQ-05: Should the proxy bind to multiple addresses or just one?~~ - **Origin**: [overview.md](overview.md) -- **Status**: open +- **Status**: resolved - **Priority**: low -- **Context**: Current nginx config binds to a specific IP (`15.235.125.95`). - The proposed config uses `bind_addr` which could be any IP. For Phase 1, the - config will specify a single IP address. Multi-address binding (listening on - multiple IPs) is not needed but could be added as an array of addresses. +- **Resolution**: A single `bind_addr` is sufficient. The proxy binds to one + explicit IP address (not `0.0.0.0`). Multi-address binding is not needed for + this single-server deployment. If needed in the future, `bind_addr` could be + extended to an array. See config.md for the `bind_addr` field. - **Cross-references**: None ## Proxy diff --git a/docs/architecture/operations.md b/docs/architecture/operations.md index 630dcff..625039b 100644 --- a/docs/architecture/operations.md +++ b/docs/architecture/operations.md @@ -42,9 +42,10 @@ Requests` and logs the event with structured fields. ### State Eviction The per-IP token bucket state grows over time as new IPs are seen. A -background task runs at a configurable interval (default: 60 seconds) and -removes entries that haven't been accessed within the cleanup interval. This -prevents unbounded memory growth. +background task runs every 60 seconds (configurable) and removes entries +whose last access timestamp is older than a configurable eviction age +(default: 300 seconds / 5 minutes). This prevents unbounded memory growth +while preserving recent entries that may still receive traffic. ### Fail2ban Integration @@ -55,7 +56,7 @@ format decision. The log format uses `key=value` pairs with a `RATE_LIMIT` prefix: ``` -RATE_LIMIT client_ip=X.X.X.X host=Y.Z path=/W status=429 +RATE_LIMIT client_ip=203.0.113.50 host=Y.Z path=/W status=429 ``` A corresponding fail2ban filter and jail configuration are provided as part @@ -71,15 +72,15 @@ log entries: 1. **Access logs**: Every proxied request is logged at `info` level with structured fields. - ``` - REQUEST client_ip=1.2.3.4 host=git.alk.dev method=GET path=/user/repo status=200 upstream=127.0.0.1:3000 duration_ms=45 - ``` +``` +REQUEST client_ip=203.0.113.50 host=git.alk.dev method=GET path=/user/repo status=200 upstream=127.0.0.1:3000 duration_ms=45 +``` 2. **Event logs**: Rate limits, TLS errors, upstream failures, config reloads, etc. ``` - RATE_LIMIT client_ip=1.2.3.4 host=git.alk.dev path=/login status=429 + RATE_LIMIT client_ip=203.0.113.50 host=git.alk.dev path=/login status=429 UPSTREAM_ERROR host=git.alk.dev upstream=127.0.0.1:3000 error="connection refused" CONFIG_RELOAD status=success sites=1 ``` diff --git a/docs/architecture/overview.md b/docs/architecture/overview.md index 26caf8d..ba70f33 100644 --- a/docs/architecture/overview.md +++ b/docs/architecture/overview.md @@ -8,10 +8,12 @@ last_updated: 2026-06-11 ## Vision A memory-safe, minimal reverse proxy that replaces our vulnerable nginx instance -for forward-proxying to backend services. The proxy terminates TLS, injects +for forwarding requests to backend services. The proxy terminates TLS, injects standard proxy headers, enforces rate limits, and forwards requests to upstream -services — with operational feature parity for our current single-domain Gitea -setup. +services — supporting multiple domains from initial release. + +This project is open source under dual licensing: MIT OR Apache-2.0, consistent +with standard Rust project licensing. ## Why This Exists @@ -35,65 +37,74 @@ details. ### In Scope -- **Phase 1**: Replace nginx for `git.alk.dev` with feature parity - - TLS termination with ACME (Let's Encrypt) certificate management +- **Phase 1**: Multi-site reverse proxy with TLS termination + - TLS termination with ACME (Let's Encrypt) multi-domain certificate management - Manual certificate paths as fallback mode - HTTP → HTTPS redirect - - Reverse proxy to Gitea at `127.0.0.1:3000` + - Host-based routing to multiple upstream services + - Reverse proxy to Gitea at `127.0.0.1:3000` (git.alk.dev) + - Reverse proxy to Deno/Fresh container for alk.dev (simple pass-through) - Proxy header injection (Host, X-Real-IP, X-Forwarded-For, X-Forwarded-Proto) - - Request rate limiting with fail2ban-compatible logging (global per-IP; per-site in Phase 2) - - 100 MB body size limit (global; per-site in Phase 2) + - Request rate limiting with fail2ban-compatible logging (global per-IP) + - 100 MB body size limit (global) - Configurable bind address (no `0.0.0.0` default) - Health check endpoint - Graceful shutdown (SIGTERM handling) - Systemd unit file + - Dual licensing: MIT OR Apache-2.0 -- **Phase 2**: Multi-site support - - SNI-based TLS routing for multiple domains - - Config file for site definitions - - Dynamic config reload (ArcSwap pattern) - -- **Phase 3**: Operational hardening +- **Phase 2**: Operational hardening + - Per-site rate limits and body limits + - Per-site upstream timeouts - Metrics endpoint (Prometheus-compatible) - Connection limits and timeouts - Log rotation +- **Phase 3**: Future enhancements + - Wildcard subdomain support + - Per-site TLS overrides (manual certs for specific domains) + - Unix domain socket config reload API + ### Out of Scope - HTTP/2 or HTTP/3 proxying (services that need these run their own native - Rust servers — e.g., `api.alk.dev`) + Rust servers — e.g., `api.alk.dev` runs its own HTTP/2+ server) - Load balancing or round-robin upstream selection - WebSocket proxying (can be added later if needed) - Static file serving - Access control beyond rate limiting (no auth, no IP allowlists in Phase 1) - CGI, SCGI, uWSGI, FastCGI +- Per-site TLS configuration (all domains share one ACME config in Phase 1) ## Architecture ``` - ┌────────────────────────────────────┐ - │ reverse-proxy (Rust/axum) │ + ┌────────────────────────────────────┐ + │ reverse-proxy (Rust/axum) │ config.toml ──────► │ StaticConfig + DynamicConfig │ - │ (ArcSwap for hot-reload) │ - │ │ + │ (ArcSwap for hot-reload) │ + │ │ bind_addr:80 ──► │ HTTP listener → 301 redirect │ - │ to HTTPS │ - │ │ + │ to HTTPS │ + │ │ bind_addr:443 ──► │ TLS listener (tokio-rustls) │ - │ ├─ ACME mode: rustls-acme resolver │ - │ │ (auto cert provisioning/renewal) │ - │ └─ Manual mode: cert/key file paths │ - │ │ - │ axum router │ - │ ├─ Host-based routing │ - │ ├─ Rate limiting middleware │ - │ ├─ Proxy header injection │ - │ ├─ Body size limit (100MB) │ - │ └─ Reverse proxy handler │ - │ └─ hyper Client → upstream │ - │ │ - │ /health → 200 OK │ - └────────────────────────────────────┘ + │ ├─ ACME mode: rustls-acme resolver │ + │ │ (multi-domain SAN cert, │ + │ │ auto-provision & renew) │ + │ └─ Manual mode: cert/key file paths │ + │ │ + │ axum router │ + │ ├─ Host-based routing │ + │ │ ├─ git.alk.dev → :3000 │ + │ │ └─ alk.dev → :8080 │ + │ ├─ Rate limiting middleware │ + │ ├─ Proxy header injection │ + │ ├─ Body size limit (100MB) │ + │ └─ Reverse proxy handler │ + │ └─ hyper Client → upstream │ + │ │ + │ /health → 200 OK │ + └────────────────────────────────────┘ ``` ## Crate Dependencies @@ -147,7 +158,7 @@ All design decisions are documented as ADRs in [decisions/](decisions/). | ADR | Decision | Summary | |-----|----------|---------| | [001](decisions/001-rust-axum.md) | Rust with axum | Memory safety eliminates the bug class causing nginx CVEs; axum provides ergonomic tower integration | -| [002](decisions/002-custom-proxy-handler.md) | Custom proxy handler | Single upstream, single domain — axum-reverse-proxy adds unnecessary complexity | +| [002](decisions/002-custom-proxy-handler.md) | Custom proxy handler | Single upstream per domain — simpler than a general proxy library | | [003](decisions/003-toml-config.md) | TOML configuration format | Rust-native, unambiguous, excellent serde support | | [004](decisions/004-rustls-acme.md) | ACME-primary certificate management | Eliminates certbot dependency; automatic provisioning and renewal | | [005](decisions/005-tokio-rustls-direct.md) | tokio-rustls directly, not axum-server | Full control over TLS config, ACME resolver integration, cipher suite configuration | @@ -155,6 +166,8 @@ All design decisions are documented as ADRs in [decisions/](decisions/). | [007](decisions/007-custom-log-format.md) | Custom structured log format | key=value pairs with RATE_LIMIT prefix for fail2ban | | [008](decisions/008-static-dynamic-config-split.md) | Static/dynamic config with ArcSwap | Immutable StaticConfig, hot-reloadable DynamicConfig via ArcSwap | | [009](decisions/009-signal-handling.md) | Signal handling strategy | signal-hook for SIGTERM/SIGINT/SIGHUP | +| [010](decisions/010-multi-site-phase1.md) | Multi-site in Phase 1 | Multiple domains from initial release; avoids config migration later | +| [011](decisions/011-multi-domain-tls.md) | Multi-domain TLS config | Single SAN certificate covering all domains via rustls-acme | ## Open Questions @@ -163,4 +176,4 @@ questions affecting this document: - **OQ-01**: Should cipher suites be restricted beyond rustls defaults? (open) - **OQ-03**: Should the health check endpoint be on a separate port? (open) -- **OQ-05**: Should the proxy bind to multiple addresses or just one? (open) \ No newline at end of file +- **OQ-07**: Should per-site TLS overrides be supported for mixed ACME/manual domains? (open) \ No newline at end of file diff --git a/docs/architecture/proxy.md b/docs/architecture/proxy.md index 46e83ce..15bbae9 100644 --- a/docs/architecture/proxy.md +++ b/docs/architecture/proxy.md @@ -14,8 +14,9 @@ injection, body size limits), and forwards it to the upstream service. ## Why It Exists This component replaces nginx's `proxy_pass` directive. For our use case — -single upstream per domain, no load balancing, no HTTP/2 proxying — a custom -handler is simpler and more maintainable than a general-purpose proxy library. +one upstream per domain across multiple domains, no load balancing, no HTTP/2 +proxying — a custom handler is simpler and more maintainable than a +general-purpose proxy library (ADR-002, ADR-010). ## Architecture @@ -140,9 +141,9 @@ services typically run on the same host (e.g., `127.0.0.1:3000`). The `upstream_scheme` field in each site's configuration allows specifying `https://` for upstreams that require TLS (e.g., separate hosts or secure internal services). -For the initial deployment (`git.alk.dev` → `127.0.0.1:3000`), the upstream -connection uses plain HTTP, as TLS between the proxy and Gitea on loopback is -unnecessary. +For the initial deployment, upstream connections use plain HTTP (e.g., +`git.alk.dev` → `127.0.0.1:3000`, `alk.dev` → `127.0.0.1:8080`) since TLS +between the proxy and backend services on loopback is unnecessary. ## Body Size Limit @@ -157,8 +158,9 @@ All design decisions are documented as ADRs in [decisions/](decisions/). | ADR | Decision | Summary | |-----|----------|---------| -| [002](decisions/002-custom-proxy-handler.md) | Custom proxy handler | Single upstream, single domain — simpler than a general proxy library | +| [002](decisions/002-custom-proxy-handler.md) | Custom proxy handler | One upstream per domain — simpler than a general proxy library | | [007](decisions/007-custom-log-format.md) | Custom structured log format | key=value pairs with RATE_LIMIT prefix for fail2ban | +| [010](decisions/010-multi-site-phase1.md) | Multi-site in Phase 1 | Multiple domains from initial release | ## Open Questions diff --git a/docs/architecture/tls.md b/docs/architecture/tls.md index 572c4e8..3d97658 100644 --- a/docs/architecture/tls.md +++ b/docs/architecture/tls.md @@ -57,10 +57,11 @@ no deploy hooks. **How it works:** -1. `AcmeCertProvider` configures the ACME client with the domain, cache +1. `AcmeCertProvider` configures the ACME client with the domain list, cache directory, and Let's Encrypt directory (staging or production). -2. `AcmeConfig::new(vec![domain])` creates an ACME configuration for the - domain. +2. `AcmeConfig::new(domains)` creates an ACME configuration for all listed + domains. Let's Encrypt will issue a single SAN certificate covering all + domains. 3. The ACME state machine runs as a background tokio task, handling: - Account registration with Let's Encrypt - Certificate ordering @@ -75,9 +76,9 @@ no deploy hooks. **Configuration:** ```toml -[tls] +[server.tls] mode = "acme" -acme_domain = "git.alk.dev" +acme_domains = ["git.alk.dev", "alk.dev"] acme_cache_dir = "/var/lib/reverse-proxy/acme-cache" acme_directory = "production" # or "staging" for testing ``` @@ -100,13 +101,8 @@ key_path = "/etc/letsencrypt/live/git.alk.dev/privkey.pem" ``` Certificate files are loaded once at startup using `rustls_pemfile`. Manual -mode requires a restart to pick up new certificates. - -**Why not hot-reload manual certs?** ACME mode handles renewal automatically. -Manual mode is for cases where you control cert rotation externally (certbot, -manual renewal). In that case, a SIGHUP-triggered restart is simpler and more -reliable than file watching. If zero-downtime cert rotation is needed, use ACME -mode. +mode requires a restart to pick up new certificates. See ADR-004 for the +rationale behind making ACME the primary mode and manual mode restart-dependent. ## TLS Configuration @@ -142,10 +138,13 @@ restrict cipher suites beyond rustls defaults. ### ServerConfig Construction For manual mode, the `ServerConfig` is built with `with_no_client_auth()` and -`with_single_cert()`, loading the certificate chain and private key from disk. +a custom `ResolvesServerCert` implementation that maps SNI hostnames to +certificate/key pairs loaded from disk. For ACME mode, the `ServerConfig` is built with `with_cert_resolver()`, passing -the `ResolvesServerCertAcme` resolver. The ACME TLS-ALPN-01 protocol identifier +the `ResolvesServerCertAcme` resolver. The ACME configuration includes all +domains listed in `acme_domains`, and the resolver manages a single SAN +certificate covering all of them. The ACME TLS-ALPN-01 protocol identifier (`acme-tls/1`) must be registered in the `alpn_protocols` list so the server can respond to TLS-ALPN-01 challenges. @@ -154,28 +153,39 @@ versions (TLS 1.2 and TLS 1.3). ## SNI-Based Certificate Selection -### Current (Single Domain) +### ACME Mode (Multi-Domain) -For single-domain setups, SNI selection is trivial: there's only one -certificate, so `with_single_cert()` or `ResolvesServerCertAcme` (which -handles the domain) is sufficient. - -### Future (Multi-Domain) - -When multiple domains are served, SNI selection works as follows: +In ACME mode, `rustls-acme` manages a single SAN certificate covering all +configured domains. The `ResolvesServerCertAcme` resolver automatically serves +the correct certificate during the TLS handshake. 1. **TLS handshake**: The client sends the SNI extension indicating which hostname it's connecting to. -2. **Certificate resolution**: In ACME mode, `ResolvesServerCertAcme` handles - this automatically — it stores certificates keyed by domain. In manual mode, - a custom `ResolvesServerCert` implementation maps SNI hostname to the - correct `CertifiedKey`. +2. **Certificate resolution**: `ResolvesServerCertAcme` matches the SNI + hostname against the provisioned certificate's Subject Alternative Names + and serves the certificate. 3. **HTTP routing**: After the TLS handshake, axum's `Host` extractor routes the request to the correct site handler based on the `Host` header. This is the same pattern nginx uses — SNI selects the cert during TLS, then -`Host` header selects the server block. In manual mode, a `ResolvesServerCert` -implementation maps SNI hostname to the correct `CertifiedKey`. +`Host` header selects the server block. ACME mode handles this automatically +through the cert resolver. + +### Manual Mode (Multi-Domain) + +In manual mode, a custom `ResolvesServerCert` implementation is required to +map SNI hostnames to the correct `CertifiedKey`. This implementation: + +1. Loads certificate files at startup (or on SIGHUP for reload) +2. Maps each domain name to its certificate chain and private key +3. During the TLS handshake, looks up the SNI hostname and returns the + matching `CertifiedKey` + +The custom resolver must handle the case where no matching certificate exists +for the SNI hostname — in this case, the handshake fails, which is the +correct behavior (we don't serve a default certificate for unknown domains). + +See [open-questions.md](open-questions.md) OQ-07 for per-site TLS overrides. ## HTTP Listener (Port 80) @@ -211,10 +221,14 @@ All design decisions are documented as ADRs in [decisions/](decisions/). |-----|----------|---------| | [004](decisions/004-rustls-acme.md) | ACME-primary cert management | Eliminates certbot; automatic provisioning and renewal | | [005](decisions/005-tokio-rustls-direct.md) | tokio-rustls directly | Full control over TLS config and ACME resolver integration | +| [010](decisions/010-multi-site-phase1.md) | Multi-site in Phase 1 | Multiple domains from initial release | +| [011](decisions/011-multi-domain-tls.md) | Multi-domain TLS config | Single SAN certificate covering all domains via rustls-acme | ## Open Questions Open questions are tracked in [open-questions.md](open-questions.md). Key questions affecting this document: -- **OQ-01**: Should cipher suites be restricted beyond rustls defaults? (open) \ No newline at end of file +- **OQ-01**: Should cipher suites be restricted beyond rustls defaults? (open) +- **OQ-07**: Should per-site TLS overrides be supported for mixed ACME/manual + domains? (open) \ No newline at end of file