Expand architecture: multi-site Phase 1, multi-domain TLS, fix review issues

Promote multi-site support from Phase 2 to Phase 1 (ADR-010): the proxy must support git.alk.dev and alk.dev from initial release. Add multi-domain TLS configuration (ADR-011): acme_domains array replaces acme_domain string, single SAN certificate via rustls-acme. Key changes: - ADR-010: Multi-site in Phase 1 — avoids config format migration later - ADR-011: Multi-domain TLS — single SAN cert, acme_domains Vec<String> - ADR-002: Updated rationale for multi-site (one upstream per domain) - overview.md: Phase 1 now includes multi-site, alk.dev pass-through, dual licensing (MIT OR Apache-2.0), real IP removed - config.md: acme_domain → acme_domains, TOML example shows both sites, validation adds unique host check, real IP replaced with 203.0.113.10 - tls.md: Multi-domain SNI section moved from Future to current, manual mode uses ResolvesServerCert for SNI mapping, TOML header fixed - proxy.md: Updated for multi-site, removed single-domain language - operations.md: RFC 5737 documentation IPs, clarified rate limit eviction semantics (distinct scan interval vs eviction age) - open-questions.md: OQ-05 resolved (single bind_addr sufficient), new OQ-07 (per-site TLS overrides) Review fixes: - acme_domains (plural) consistently used across all docs and diagram - ADR-011 clearly scopes acme_domain as previous design - Inline decision rationale extracted: tls.md hot-reload → ADR-004 ref, config.md static/dynamic → ADR-008 ref - TOML section headers consistent (server.tls)
2026-06-11 08:50:03 +00:00
parent 8ee6284b62
commit 7efc142406
10 changed files with 356 additions and 108 deletions
--- a/docs/architecture/decisions/002-custom-proxy-handler.md
+++ b/docs/architecture/decisions/002-custom-proxy-handler.md
@@ -16,8 +16,9 @@ available:
 2. **Custom handler** (Felix Knorr pattern): Build a handler using hyper's
   `Client` to forward requests. ~50-100 lines of Rust for our needs.

-Our use case is minimal: single upstream per domain, single domain, no load
-balancing, no retry, no HTTP/2 proxying.
+Our use case is minimal: single upstream per domain, no load balancing, no
+retry, no HTTP/2 proxying. While the proxy supports multiple domains
+(ADR-010), each domain routes to exactly one upstream.

 ## Decision

@@ -31,6 +32,8 @@ project's channel proxy.
  path-based routing to multiple backends)
 - Our proxy case is the simplest possible: match a Host header, forward the
  entire request to a single upstream, stream the response back
+- Multi-domain support (ADR-010) doesn't change this — each domain still maps
+  to one upstream
 - The Felix Knorr pattern is proven, idiomatic, and ~50-100 lines
 - We maintain full control over header injection, error handling, and upstream
  connection behavior
@@ -46,11 +49,12 @@ project's channel proxy.

 **Negative:**
 - We implement and maintain proxy logic ourselves (but it's trivial for our
-  use case)
+  use case — each domain maps to one upstream)
 - If requirements grow to load balancing or retry, we'd need to add that
  ourselves or switch to `axum-reverse-proxy`

 ## References

 - [proxy.md](../proxy.md)
+- [ADR-010](010-multi-site-phase1.md) (multi-site in Phase 1)
 - Felix Knorr, "Replacing nginx with axum" (felix-knorr.net/posts/2024-10-13-replacing-nginx-with-axum.html)
--- a/docs/architecture/decisions/010-multi-site-phase1.md
+++ b/docs/architecture/decisions/010-multi-site-phase1.md
@@ -0,0 +1,90 @@
+# ADR-010: Multi-Site Support in Phase 1
+
+## Status
+
+Accepted
+
+## Context
+
+The original architecture phased multi-site support into Phase 2, treating
+Phase 1 as a single-domain replacement for nginx serving only `git.alk.dev`.
+This was based on the assumption that only one domain needed proxying initially.
+
+However, `alk.dev` (the bare domain) will need proxying in the near future.
+While `alk.dev` is a simple case — proxying to a Deno/Fresh container with no
+special requirements — the proxy must support multiple sites from day one. The
+config format, routing logic, and TLS certificate provisioning all need
+multi-site awareness.
+
+Additionally, `api.alk.dev` is explicitly out of scope (it runs its own
+HTTP/2+ server natively), but the proxy must not prevent future sites from
+being added.
+
+The cost of deferring multi-site is high: we'd need a config format migration,
+routing logic rewrite, and TLS cert management changes later. Supporting
+multi-site from the start costs very little — the config format just uses an
+array of sites (which it already does), host-based routing is trivial in axum,
+and `rustls-acme` supports multi-domain certificates natively.
+
+## Decision
+
+Move multi-site support from Phase 2 into Phase 1. The proxy supports multiple
+sites from the initial release:
+
+- `[[sites]]` array in config (already the planned format)
+- Host-based routing via axum's `Host` extractor (already the planned approach)
+- Multi-domain ACME certificate provisioning via `rustls-acme`
+- Each site maps a hostname to an upstream address
+
+Phase 1 scope becomes:
+
+1. Multi-site reverse proxy with TLS termination
+2. ACME certificate management (multi-domain)
+3. HTTP → HTTPS redirect
+4. Rate limiting, logging, health check, graceful shutdown
+5. Systemd integration
+
+Phase 2 scope shifts to operational hardening:
+
+1. Per-site rate limits and body limits
+2. Per-site upstream timeouts
+3. Metrics endpoint (Prometheus-compatible)
+4. Connection limits and timeouts
+5. Log rotation
+
+Phase 3 remains future enhancements.
+
+## Rationale
+
+- The config format already uses `[[sites]]` — no format change needed
+- Host-based routing is the natural axum pattern and was already planned
+- `rustls-acme` accepts `Vec<domain>` — multi-domain is its default usage
+- The cost of adding multi-site later (config migration, routing rewrite,
+  cert management changes) far exceeds the cost of supporting it now (zero
+  additional complexity)
+- `alk.dev` is confirmed as a near-term need, not a hypothetical
+- The proxy's value proposition is being a memory-safe reverse proxy for *our
+  infrastructure*, which has multiple domains
+
+## Consequences
+
+**Positive:**
+- No config format migration needed later
+- `alk.dev` can be added to the config without code changes
+- TLS cert management handles multiple domains from the start
+- Eliminates an entire phase of work
+
+**Negative:**
+- Slightly more testing surface (must verify correct routing with multiple
+  sites)
+- Must test multi-domain ACME provisioning (not just single-domain)
+- Wildcard or fallback site behavior needs to be defined (addressed in
+  OQ-07)
+
+## References
+
+- [overview.md](../overview.md)
+- [config.md](../config.md)
+- [tls.md](../tls.md)
+- [proxy.md](../proxy.md)
+- ADR-002 (custom proxy handler — rationale updated for multi-site)
--- a/docs/architecture/decisions/011-multi-domain-tls.md
+++ b/docs/architecture/decisions/011-multi-domain-tls.md
@@ -0,0 +1,92 @@
+# ADR-011: Multi-Domain TLS Configuration
+
+## Status
+
+Accepted
+
+## Context
+
+With multi-site support in Phase 1 (ADR-010), the TLS configuration must
+support multiple domains. The previous design used a single `tls.acme_domain`
+string field, which only works for one domain.
+
+There are several approaches to multi-domain TLS:
+
+1. **Single ACME config with domain list**: `acme_domains = ["git.alk.dev",
+   "alk.dev"]` — one certificate covering all domains (SAN certificate)
+2. **Per-site TLS configuration**: Each site entry specifies its own TLS
+   mode (ACME or manual) and domain — more flexible but complex
+3. **Hybrid**: A global TLS section with ACME domains, plus per-site overrides
+   for manual certificates
+
+For our use case, all proxied domains use the same ACME certificate authority
+(Let's Encrypt) and the same challenge type (TLS-ALPN-01). There's no need
+for per-site TLS configuration in Phase 1.
+
+## Decision
+
+Use a single ACME configuration with a list of domains, producing one SAN
+certificate covering all proxied domains. Manual mode uses certificate file
+paths (single cert file with all domains, or one cert per domain resolved via
+SNI).
+
+The config format changes from the previous single-domain format:
+
+```toml
+# Previous (single-domain) format — no longer used
+[tls]
+mode = "acme"
+acme_domain = "git.alk.dev"  # single string
+```
+
+To the current multi-domain format:
+
+```toml
+[tls]
+mode = "acme"
+acme_domains = ["git.alk.dev", "alk.dev"]  # array of strings
+```
+
+In ACME mode, `rustls-acme` provisions a single certificate covering all
+listed domains via Subject Alternative Names (SAN). This is the standard
+Let's Encrypt approach for multi-domain certificates.
+
+In manual mode, the cert and key files must cover all domains (either a SAN
+certificate or separate certificates resolved via SNI).
+
+## Rationale
+
+- A single SAN certificate is simpler to manage (one renewal, one cert)
+- Let's Encrypt supports SAN certificates with up to 100 domains
+- `rustls-acme` accepts `Vec<String>` for domain lists — this is its natural
+  API
+- All our domains use the same ACME configuration (Let's Encrypt production,
+  TLS-ALPN-01 challenge)
+- Per-site TLS overrides add complexity with no current benefit
+- If per-site TLS configuration is needed later (e.g., a site with a manual
+  cert), it can be added as an optional override without changing the global
+  config structure
+
+## Consequences
+
+**Positive:**
+- Single certificate for all domains — simpler renewal, simpler cert management
+- Matches `rustls-acme`'s natural API (`AcmeConfig::new(domains: Vec<String>)`)
+- All domains in one cert means SNI resolution is handled by ACME automatically
+- Config format is a minimal change from single-domain
+
+**Negative:**
+- Adding or removing a domain requires re-provisioning the certificate (ACME
+  handles this automatically, but it means cert changes affect all domains)
+- If one domain fails ACME validation, the entire cert renewal fails (all
+  domains must be validated) — mitigated by Let's Encrypt's domain-level
+  validation
+- Per-site TLS configuration (e.g., a domain with a manual cert) requires a
+  future config extension (OQ-07)
+
+## References
+
+- [tls.md](../tls.md)
+- [config.md](../config.md)
+- ADR-010 (multi-site in Phase 1)
+- ADR-004 (ACME-primary certificate management)