Expand architecture: multi-site Phase 1, multi-domain TLS, fix review issues

Promote multi-site support from Phase 2 to Phase 1 (ADR-010): the proxy must support git.alk.dev and alk.dev from initial release. Add multi-domain TLS configuration (ADR-011): acme_domains array replaces acme_domain string, single SAN certificate via rustls-acme. Key changes: - ADR-010: Multi-site in Phase 1 — avoids config format migration later - ADR-011: Multi-domain TLS — single SAN cert, acme_domains Vec<String> - ADR-002: Updated rationale for multi-site (one upstream per domain) - overview.md: Phase 1 now includes multi-site, alk.dev pass-through, dual licensing (MIT OR Apache-2.0), real IP removed - config.md: acme_domain → acme_domains, TOML example shows both sites, validation adds unique host check, real IP replaced with 203.0.113.10 - tls.md: Multi-domain SNI section moved from Future to current, manual mode uses ResolvesServerCert for SNI mapping, TOML header fixed - proxy.md: Updated for multi-site, removed single-domain language - operations.md: RFC 5737 documentation IPs, clarified rate limit eviction semantics (distinct scan interval vs eviction age) - open-questions.md: OQ-05 resolved (single bind_addr sufficient), new OQ-07 (per-site TLS overrides) Review fixes: - acme_domains (plural) consistently used across all docs and diagram - ADR-011 clearly scopes acme_domain as previous design - Inline decision rationale extracted: tls.md hot-reload → ADR-004 ref, config.md static/dynamic → ADR-008 ref - TOML section headers consistent (server.tls)
2026-06-11 08:50:03 +00:00
parent 8ee6284b62
commit 7efc142406
10 changed files with 356 additions and 108 deletions
--- a/docs/architecture/README.md
+++ b/docs/architecture/README.md
@@ -14,6 +14,10 @@ memory-safe Rust/axum reverse proxy. The primary motivation is CVE-2026-42945
 (unauthenticated RCE in nginx's rewrite module) and the broader pattern of
 memory corruption bugs in nginx's C codebase.

+The proxy supports multiple domains from initial release (git.alk.dev and
+alk.dev), with per-domain host-based routing and a single multi-domain SAN
+certificate via ACME.
+
 ## Architecture Documents

 | Document | Status | Description |
@@ -37,6 +41,8 @@ memory corruption bugs in nginx's C codebase.
 | [007](decisions/007-custom-log-format.md) | Custom Structured Log Format | Accepted |
 | [008](decisions/008-static-dynamic-config-split.md) | Static/Dynamic Config Split with ArcSwap | Accepted |
 | [009](decisions/009-signal-handling.md) | Signal Handling Strategy | Accepted |
+| [010](decisions/010-multi-site-phase1.md) | Multi-Site Support in Phase 1 | Accepted |
+| [011](decisions/011-multi-domain-tls.md) | Multi-Domain TLS Configuration | Accepted |

 ## Open Questions

@@ -48,8 +54,9 @@ See [open-questions.md](open-questions.md) for the full tracker.
 | ~~OQ-02~~ | ~~What log format should fail2ban consume?~~ | ~~high~~ | **resolved** (ADR-007) |
 | OQ-03 | Should the health check endpoint be on a separate port? | low | open |
 | OQ-04 | Config reload: SIGHUP only or also Unix socket API? | low | open |
-| OQ-05 | Should the proxy bind to multiple addresses? | low | open |
+| ~~OQ-05~~ | ~~Should the proxy bind to multiple addresses?~~ | ~~low~~ | **resolved** (single bind_addr sufficient) |
 | OQ-06 | Should upstream timeouts be configurable per-site? | low | open |
+| OQ-07 | Should per-site TLS overrides be supported for mixed ACME/manual domains? | low | open |

 ## Document Lifecycle

--- a/docs/architecture/config.md
+++ b/docs/architecture/config.md
@@ -39,7 +39,7 @@ config.toml
 │  http_port           │     │  rate_limit           │
 │  https_port          │     │  body_limit           │
 │  tls.mode            │     │  proxy_headers        │
-│  tls.acme_domain     │     │                       │
+│  tls.acme_domains    │     │                       │
 │  tls.cert_path       │     │  ← ArcSwap →          │
 │  tls.key_path        │     │  ConfigReloadHandle    │
 │  tls.cache_dir       │     │  .reload(new_config)  │
@@ -59,11 +59,11 @@ Immutable after startup. Changes require a process restart.

 | Field | Type | Description |
 |-------|------|-------------|
-| `bind_addr` | `String` | IP address to bind to (e.g., `"15.235.125.95"`) |
+| `bind_addr` | `String` | IP address to bind to (must be explicit, no `0.0.0.0`) |
 | `http_port` | `u16` | Port for HTTP→HTTPS redirect (default: `80`; set to `0` to disable) |
 | `https_port` | `u16` | Port for TLS listener (default: `443`) |
 | `tls.mode` | `"acme"` or `"manual"` | Certificate provisioning mode |
-| `tls.acme_domain` | `String` | Domain for ACME (ACME mode only) |
+| `tls.acme_domains` | `Vec<String>` | Domains for ACME SAN certificate (ACME mode only) |
 | `tls.acme_cache_dir` | `String` | ACME state cache directory |
 | `tls.acme_directory` | `"production"` or `"staging"` | Let's Encrypt directory |
 | `tls.cert_path` | `String` | Certificate file path (manual mode only) |
@@ -71,9 +71,10 @@ Immutable after startup. Changes require a process restart.
 | `log_level` | `"trace"`, `"debug"`, `"info"`, `"warn"`, `"error"` | Logging verbosity |
 | `log_format` | `"text"` or `"json"` | Log output format |

-**Why these are static:** Changing bind addresses, ports, or TLS mode requires
-creating new listeners and TLS configurations — operations that fundamentally
-require a restart. There's no safe way to change these at runtime.
+**Why these are static:** See ADR-008 for the rationale behind the
+static/dynamic split. In summary: changing bind addresses, ports, or TLS mode
+requires creating new listeners and TLS configurations — operations that
+fundamentally require a restart.

 ### DynamicConfig

@@ -95,10 +96,10 @@ connections immediately.
 | `upstream` | `String` | Upstream address (e.g., `"127.0.0.1:3000"`) |
 | `upstream_scheme` | `"http"` or `"https"` | Protocol for upstream connection (default: `"http"`) |

-**Why these are dynamic:** Site definitions and rate limits are per-request
-concerns. Adding a site or changing a rate limit should not require restarting
-the proxy and dropping active connections. Rate limits and body limits are
-global settings in Phase 1; per-site configuration for these may be added in
+**Why these are dynamic:** See ADR-008 for the rationale. Site definitions
+and rate limits are per-request concerns that should not require restarting
+the proxy or dropping active connections. Rate limits and body limits are
+global settings in Phase 1; per-site configuration for these is deferred to
 Phase 2.

 ## Config Reload
@@ -136,13 +137,13 @@ config reload, but SIGHUP is sufficient for Phase 1.
 # reverse-proxy config

 [server]
-bind_addr = "15.235.125.95"
+bind_addr = "203.0.113.10"  # Replace with actual bind address
 http_port = 80
 https_port = 443

 [server.tls]
 mode = "acme"                    # "acme" or "manual"
-acme_domain = "git.alk.dev"
+acme_domains = ["git.alk.dev", "alk.dev"]
 acme_cache_dir = "/var/lib/reverse-proxy/acme-cache"
 acme_directory = "production"    # "production" or "staging"

@@ -166,6 +167,11 @@ limit_bytes = 104857600          # 100 MB
 host = "git.alk.dev"
 upstream = "127.0.0.1:3000"
 upstream_scheme = "http"
+
+[[sites]]
+host = "alk.dev"
+upstream = "127.0.0.1:8080"
+upstream_scheme = "http"
 ```

 ### Validation
@@ -173,12 +179,13 @@ upstream_scheme = "http"
 On startup, the config is validated:

 1. `bind_addr` is not `0.0.0.0` (must be explicit)
-2. In ACME mode, `acme_domain` must be set
+2. In ACME mode, `acme_domains` must be non-empty
 3. In manual mode, `cert_path` and `key_path` must both be set and the files
   must be readable
 4. Each site must have a `host` and `upstream`
-5. `rate_limit.requests_per_second` must be > 0
-6. `body.limit_bytes` must be > 0
+5. Site `host` values must be unique (no duplicate hostnames)
+6. `rate_limit.requests_per_second` must be > 0
+7. `body.limit_bytes` must be > 0

 On SIGHUP reload, the same validation applies. If the new config fails
 validation, the reload is rejected and the old config remains active. An error
@@ -196,6 +203,8 @@ All design decisions are documented as ADRs in [decisions/](decisions/).
 |-----|----------|---------|
 | [003](decisions/003-toml-config.md) | TOML configuration format | Rust-native, unambiguous, excellent serde support |
 | [008](decisions/008-static-dynamic-config-split.md) | Static/dynamic config split | Immutable StaticConfig, hot-reloadable DynamicConfig via ArcSwap |
+| [010](decisions/010-multi-site-phase1.md) | Multi-site in Phase 1 | Multiple domains from initial release |
+| [011](decisions/011-multi-domain-tls.md) | Multi-domain TLS config | Single SAN certificate covering all domains |

 ## Open Questions

@@ -204,3 +213,5 @@ questions affecting this document:

 - **OQ-04**: Should config reload support a Unix domain socket API in addition
  to SIGHUP? (open)
+- **OQ-07**: Should per-site TLS overrides be supported for mixed ACME/manual
+  domains? (open)
--- a/docs/architecture/decisions/002-custom-proxy-handler.md
+++ b/docs/architecture/decisions/002-custom-proxy-handler.md
@@ -16,8 +16,9 @@ available:
 2. **Custom handler** (Felix Knorr pattern): Build a handler using hyper's
   `Client` to forward requests. ~50-100 lines of Rust for our needs.

-Our use case is minimal: single upstream per domain, single domain, no load
-balancing, no retry, no HTTP/2 proxying.
+Our use case is minimal: single upstream per domain, no load balancing, no
+retry, no HTTP/2 proxying. While the proxy supports multiple domains
+(ADR-010), each domain routes to exactly one upstream.

 ## Decision

@@ -31,6 +32,8 @@ project's channel proxy.
  path-based routing to multiple backends)
 - Our proxy case is the simplest possible: match a Host header, forward the
  entire request to a single upstream, stream the response back
+- Multi-domain support (ADR-010) doesn't change this — each domain still maps
+  to one upstream
 - The Felix Knorr pattern is proven, idiomatic, and ~50-100 lines
 - We maintain full control over header injection, error handling, and upstream
  connection behavior
@@ -46,11 +49,12 @@ project's channel proxy.

 **Negative:**
 - We implement and maintain proxy logic ourselves (but it's trivial for our
-  use case)
+  use case — each domain maps to one upstream)
 - If requirements grow to load balancing or retry, we'd need to add that
  ourselves or switch to `axum-reverse-proxy`

 ## References

 - [proxy.md](../proxy.md)
+- [ADR-010](010-multi-site-phase1.md) (multi-site in Phase 1)
 - Felix Knorr, "Replacing nginx with axum" (felix-knorr.net/posts/2024-10-13-replacing-nginx-with-axum.html)
--- a/docs/architecture/decisions/010-multi-site-phase1.md
+++ b/docs/architecture/decisions/010-multi-site-phase1.md
@@ -0,0 +1,90 @@
+# ADR-010: Multi-Site Support in Phase 1
+
+## Status
+
+Accepted
+
+## Context
+
+The original architecture phased multi-site support into Phase 2, treating
+Phase 1 as a single-domain replacement for nginx serving only `git.alk.dev`.
+This was based on the assumption that only one domain needed proxying initially.
+
+However, `alk.dev` (the bare domain) will need proxying in the near future.
+While `alk.dev` is a simple case — proxying to a Deno/Fresh container with no
+special requirements — the proxy must support multiple sites from day one. The
+config format, routing logic, and TLS certificate provisioning all need
+multi-site awareness.
+
+Additionally, `api.alk.dev` is explicitly out of scope (it runs its own
+HTTP/2+ server natively), but the proxy must not prevent future sites from
+being added.
+
+The cost of deferring multi-site is high: we'd need a config format migration,
+routing logic rewrite, and TLS cert management changes later. Supporting
+multi-site from the start costs very little — the config format just uses an
+array of sites (which it already does), host-based routing is trivial in axum,
+and `rustls-acme` supports multi-domain certificates natively.
+
+## Decision
+
+Move multi-site support from Phase 2 into Phase 1. The proxy supports multiple
+sites from the initial release:
+
+- `[[sites]]` array in config (already the planned format)
+- Host-based routing via axum's `Host` extractor (already the planned approach)
+- Multi-domain ACME certificate provisioning via `rustls-acme`
+- Each site maps a hostname to an upstream address
+
+Phase 1 scope becomes:
+
+1. Multi-site reverse proxy with TLS termination
+2. ACME certificate management (multi-domain)
+3. HTTP → HTTPS redirect
+4. Rate limiting, logging, health check, graceful shutdown
+5. Systemd integration
+
+Phase 2 scope shifts to operational hardening:
+
+1. Per-site rate limits and body limits
+2. Per-site upstream timeouts
+3. Metrics endpoint (Prometheus-compatible)
+4. Connection limits and timeouts
+5. Log rotation
+
+Phase 3 remains future enhancements.
+
+## Rationale
+
+- The config format already uses `[[sites]]` — no format change needed
+- Host-based routing is the natural axum pattern and was already planned
+- `rustls-acme` accepts `Vec<domain>` — multi-domain is its default usage
+- The cost of adding multi-site later (config migration, routing rewrite,
+  cert management changes) far exceeds the cost of supporting it now (zero
+  additional complexity)
+- `alk.dev` is confirmed as a near-term need, not a hypothetical
+- The proxy's value proposition is being a memory-safe reverse proxy for *our
+  infrastructure*, which has multiple domains
+
+## Consequences
+
+**Positive:**
+- No config format migration needed later
+- `alk.dev` can be added to the config without code changes
+- TLS cert management handles multiple domains from the start
+- Eliminates an entire phase of work
+
+**Negative:**
+- Slightly more testing surface (must verify correct routing with multiple
+  sites)
+- Must test multi-domain ACME provisioning (not just single-domain)
+- Wildcard or fallback site behavior needs to be defined (addressed in
+  OQ-07)
+
+## References
+
+- [overview.md](../overview.md)
+- [config.md](../config.md)
+- [tls.md](../tls.md)
+- [proxy.md](../proxy.md)
+- ADR-002 (custom proxy handler — rationale updated for multi-site)
--- a/docs/architecture/decisions/011-multi-domain-tls.md
+++ b/docs/architecture/decisions/011-multi-domain-tls.md
@@ -0,0 +1,92 @@
+# ADR-011: Multi-Domain TLS Configuration
+
+## Status
+
+Accepted
+
+## Context
+
+With multi-site support in Phase 1 (ADR-010), the TLS configuration must
+support multiple domains. The previous design used a single `tls.acme_domain`
+string field, which only works for one domain.
+
+There are several approaches to multi-domain TLS:
+
+1. **Single ACME config with domain list**: `acme_domains = ["git.alk.dev",
+   "alk.dev"]` — one certificate covering all domains (SAN certificate)
+2. **Per-site TLS configuration**: Each site entry specifies its own TLS
+   mode (ACME or manual) and domain — more flexible but complex
+3. **Hybrid**: A global TLS section with ACME domains, plus per-site overrides
+   for manual certificates
+
+For our use case, all proxied domains use the same ACME certificate authority
+(Let's Encrypt) and the same challenge type (TLS-ALPN-01). There's no need
+for per-site TLS configuration in Phase 1.
+
+## Decision
+
+Use a single ACME configuration with a list of domains, producing one SAN
+certificate covering all proxied domains. Manual mode uses certificate file
+paths (single cert file with all domains, or one cert per domain resolved via
+SNI).
+
+The config format changes from the previous single-domain format:
+
+```toml
+# Previous (single-domain) format — no longer used
+[tls]
+mode = "acme"
+acme_domain = "git.alk.dev"  # single string
+```
+
+To the current multi-domain format:
+
+```toml
+[tls]
+mode = "acme"
+acme_domains = ["git.alk.dev", "alk.dev"]  # array of strings
+```
+
+In ACME mode, `rustls-acme` provisions a single certificate covering all
+listed domains via Subject Alternative Names (SAN). This is the standard
+Let's Encrypt approach for multi-domain certificates.
+
+In manual mode, the cert and key files must cover all domains (either a SAN
+certificate or separate certificates resolved via SNI).
+
+## Rationale
+
+- A single SAN certificate is simpler to manage (one renewal, one cert)
+- Let's Encrypt supports SAN certificates with up to 100 domains
+- `rustls-acme` accepts `Vec<String>` for domain lists — this is its natural
+  API
+- All our domains use the same ACME configuration (Let's Encrypt production,
+  TLS-ALPN-01 challenge)
+- Per-site TLS overrides add complexity with no current benefit
+- If per-site TLS configuration is needed later (e.g., a site with a manual
+  cert), it can be added as an optional override without changing the global
+  config structure
+
+## Consequences
+
+**Positive:**
+- Single certificate for all domains — simpler renewal, simpler cert management
+- Matches `rustls-acme`'s natural API (`AcmeConfig::new(domains: Vec<String>)`)
+- All domains in one cert means SNI resolution is handled by ACME automatically
+- Config format is a minimal change from single-domain
+
+**Negative:**
+- Adding or removing a domain requires re-provisioning the certificate (ACME
+  handles this automatically, but it means cert changes affect all domains)
+- If one domain fails ACME validation, the entire cert renewal fails (all
+  domains must be validated) — mitigated by Let's Encrypt's domain-level
+  validation
+- Per-site TLS configuration (e.g., a domain with a manual cert) requires a
+  future config extension (OQ-07)
+
+## References
+
+- [tls.md](../tls.md)
+- [config.md](../config.md)
+- ADR-010 (multi-site in Phase 1)
+- ADR-004 (ACME-primary certificate management)
--- a/docs/architecture/open-questions.md
+++ b/docs/architecture/open-questions.md
@@ -21,8 +21,6 @@ last_updated: 2026-06-11
  than the current nginx config.
 - **Cross-references**: ADR-005

-## Logging and Monitoring
-
 ### ~~OQ-02: What log format should fail2ban consume?~~

 - **Origin**: [operations.md](operations.md), [proxy.md](proxy.md)
@@ -33,6 +31,22 @@ last_updated: 2026-06-11
  See ADR-007.
 - **Cross-references**: ADR-007

+### OQ-07: Should per-site TLS overrides be supported for mixed ACME/manual domains?
+
+- **Origin**: [tls.md](tls.md), [config.md](config.md)
+- **Status**: open
+- **Priority**: low
+- **Context**: Phase 1 uses a single TLS configuration (ACME or manual) for all
+  domains. All domains share the same ACME config and certificate. If a future
+  domain needs a manual certificate (e.g., a corporate CA cert) while other
+  domains use ACME, a per-site TLS override would be needed. This would require
+  a custom `ResolvesServerCert` that combines ACME-provisioned certs with
+  manually loaded certs. For now, all proxied domains use the same ACME config,
+  so this is not needed.
+- **Cross-references**: ADR-011
+
+## Logging and Monitoring
+
 ### OQ-03: Should the health check endpoint be on a separate port?

 - **Origin**: [operations.md](operations.md)
@@ -61,15 +75,15 @@ last_updated: 2026-06-11

 ## Deployment

-### OQ-05: Should the proxy bind to multiple addresses or just one?
+### ~~OQ-05: Should the proxy bind to multiple addresses or just one?~~

 - **Origin**: [overview.md](overview.md)
- **Status**: open
+- **Status**: resolved
 - **Priority**: low
- **Context**: Current nginx config binds to a specific IP (`15.235.125.95`).
-  The proposed config uses `bind_addr` which could be any IP. For Phase 1, the
-  config will specify a single IP address. Multi-address binding (listening on
-  multiple IPs) is not needed but could be added as an array of addresses.
+- **Resolution**: A single `bind_addr` is sufficient. The proxy binds to one
+  explicit IP address (not `0.0.0.0`). Multi-address binding is not needed for
+  this single-server deployment. If needed in the future, `bind_addr` could be
+  extended to an array. See config.md for the `bind_addr` field.
 - **Cross-references**: None

 ## Proxy
--- a/docs/architecture/operations.md
+++ b/docs/architecture/operations.md
@@ -42,9 +42,10 @@ Requests` and logs the event with structured fields.
 ### State Eviction

 The per-IP token bucket state grows over time as new IPs are seen. A
-background task runs at a configurable interval (default: 60 seconds) and
-removes entries that haven't been accessed within the cleanup interval. This
-prevents unbounded memory growth.
+background task runs every 60 seconds (configurable) and removes entries
+whose last access timestamp is older than a configurable eviction age
+(default: 300 seconds / 5 minutes). This prevents unbounded memory growth
+while preserving recent entries that may still receive traffic.

 ### Fail2ban Integration

@@ -55,7 +56,7 @@ format decision.
 The log format uses `key=value` pairs with a `RATE_LIMIT` prefix:

 ```
-RATE_LIMIT client_ip=X.X.X.X host=Y.Z path=/W status=429
+RATE_LIMIT client_ip=203.0.113.50 host=Y.Z path=/W status=429
 ```

 A corresponding fail2ban filter and jail configuration are provided as part
@@ -71,15 +72,15 @@ log entries:
 1. **Access logs**: Every proxied request is logged at `info` level with
   structured fields.

-   ```
-   REQUEST client_ip=1.2.3.4 host=git.alk.dev method=GET path=/user/repo status=200 upstream=127.0.0.1:3000 duration_ms=45
-   ```
+```
+REQUEST client_ip=203.0.113.50 host=git.alk.dev method=GET path=/user/repo status=200 upstream=127.0.0.1:3000 duration_ms=45
+```

 2. **Event logs**: Rate limits, TLS errors, upstream failures, config reloads,
   etc.

   ```
-   RATE_LIMIT client_ip=1.2.3.4 host=git.alk.dev path=/login status=429
+   RATE_LIMIT client_ip=203.0.113.50 host=git.alk.dev path=/login status=429
   UPSTREAM_ERROR host=git.alk.dev upstream=127.0.0.1:3000 error="connection refused"
   CONFIG_RELOAD status=success sites=1
   ```
--- a/docs/architecture/overview.md
+++ b/docs/architecture/overview.md
@@ -8,10 +8,12 @@ last_updated: 2026-06-11
 ## Vision

 A memory-safe, minimal reverse proxy that replaces our vulnerable nginx instance
-for forward-proxying to backend services. The proxy terminates TLS, injects
+for forwarding requests to backend services. The proxy terminates TLS, injects
 standard proxy headers, enforces rate limits, and forwards requests to upstream
-services — with operational feature parity for our current single-domain Gitea
-setup.
+services — supporting multiple domains from initial release.
+
+This project is open source under dual licensing: MIT OR Apache-2.0, consistent
+with standard Rust project licensing.

 ## Why This Exists

@@ -35,38 +37,44 @@ details.

 ### In Scope

- **Phase 1**: Replace nginx for `git.alk.dev` with feature parity
-  - TLS termination with ACME (Let's Encrypt) certificate management
+- **Phase 1**: Multi-site reverse proxy with TLS termination
+  - TLS termination with ACME (Let's Encrypt) multi-domain certificate management
  - Manual certificate paths as fallback mode
  - HTTP → HTTPS redirect
-  - Reverse proxy to Gitea at `127.0.0.1:3000`
+  - Host-based routing to multiple upstream services
+  - Reverse proxy to Gitea at `127.0.0.1:3000` (git.alk.dev)
+  - Reverse proxy to Deno/Fresh container for alk.dev (simple pass-through)
  - Proxy header injection (Host, X-Real-IP, X-Forwarded-For, X-Forwarded-Proto)
-  - Request rate limiting with fail2ban-compatible logging (global per-IP; per-site in Phase 2)
-  - 100 MB body size limit (global; per-site in Phase 2)
+  - Request rate limiting with fail2ban-compatible logging (global per-IP)
+  - 100 MB body size limit (global)
  - Configurable bind address (no `0.0.0.0` default)
  - Health check endpoint
  - Graceful shutdown (SIGTERM handling)
  - Systemd unit file
+  - Dual licensing: MIT OR Apache-2.0

- **Phase 2**: Multi-site support
-  - SNI-based TLS routing for multiple domains
-  - Config file for site definitions
-  - Dynamic config reload (ArcSwap pattern)
-
- **Phase 3**: Operational hardening
+- **Phase 2**: Operational hardening
+  - Per-site rate limits and body limits
+  - Per-site upstream timeouts
  - Metrics endpoint (Prometheus-compatible)
  - Connection limits and timeouts
  - Log rotation

+- **Phase 3**: Future enhancements
+  - Wildcard subdomain support
+  - Per-site TLS overrides (manual certs for specific domains)
+  - Unix domain socket config reload API
+
 ### Out of Scope

 - HTTP/2 or HTTP/3 proxying (services that need these run their own native
-  Rust servers — e.g., `api.alk.dev`)
+  Rust servers — e.g., `api.alk.dev` runs its own HTTP/2+ server)
 - Load balancing or round-robin upstream selection
 - WebSocket proxying (can be added later if needed)
 - Static file serving
 - Access control beyond rate limiting (no auth, no IP allowlists in Phase 1)
 - CGI, SCGI, uWSGI, FastCGI
+- Per-site TLS configuration (all domains share one ACME config in Phase 1)

 ## Architecture

@@ -81,11 +89,14 @@ bind_addr:80   ──►  │  HTTP listener → 301 redirect        │
                     │                                      │
 bind_addr:443  ──►  │  TLS listener (tokio-rustls)         │
                     │  ├─ ACME mode: rustls-acme resolver  │
-                    │  │  (auto cert provisioning/renewal) │
+                     │  │  (multi-domain SAN cert,           │
+                     │  │   auto-provision & renew)          │
                     │  └─ Manual mode: cert/key file paths  │
                     │                                      │
                     │  axum router                         │
                     │  ├─ Host-based routing                │
+                     │  │  ├─ git.alk.dev → :3000            │
+                     │  │  └─ alk.dev     → :8080            │
                     │  ├─ Rate limiting middleware          │
                     │  ├─ Proxy header injection            │
                     │  ├─ Body size limit (100MB)           │
@@ -147,7 +158,7 @@ All design decisions are documented as ADRs in [decisions/](decisions/).
 | ADR | Decision | Summary |
 |-----|----------|---------|
 | [001](decisions/001-rust-axum.md) | Rust with axum | Memory safety eliminates the bug class causing nginx CVEs; axum provides ergonomic tower integration |
-| [002](decisions/002-custom-proxy-handler.md) | Custom proxy handler | Single upstream, single domain — axum-reverse-proxy adds unnecessary complexity |
+| [002](decisions/002-custom-proxy-handler.md) | Custom proxy handler | Single upstream per domain — simpler than a general proxy library |
 | [003](decisions/003-toml-config.md) | TOML configuration format | Rust-native, unambiguous, excellent serde support |
 | [004](decisions/004-rustls-acme.md) | ACME-primary certificate management | Eliminates certbot dependency; automatic provisioning and renewal |
 | [005](decisions/005-tokio-rustls-direct.md) | tokio-rustls directly, not axum-server | Full control over TLS config, ACME resolver integration, cipher suite configuration |
@@ -155,6 +166,8 @@ All design decisions are documented as ADRs in [decisions/](decisions/).
 | [007](decisions/007-custom-log-format.md) | Custom structured log format | key=value pairs with RATE_LIMIT prefix for fail2ban |
 | [008](decisions/008-static-dynamic-config-split.md) | Static/dynamic config with ArcSwap | Immutable StaticConfig, hot-reloadable DynamicConfig via ArcSwap |
 | [009](decisions/009-signal-handling.md) | Signal handling strategy | signal-hook for SIGTERM/SIGINT/SIGHUP |
+| [010](decisions/010-multi-site-phase1.md) | Multi-site in Phase 1 | Multiple domains from initial release; avoids config migration later |
+| [011](decisions/011-multi-domain-tls.md) | Multi-domain TLS config | Single SAN certificate covering all domains via rustls-acme |

 ## Open Questions

@@ -163,4 +176,4 @@ questions affecting this document:

 - **OQ-01**: Should cipher suites be restricted beyond rustls defaults? (open)
 - **OQ-03**: Should the health check endpoint be on a separate port? (open)
- **OQ-05**: Should the proxy bind to multiple addresses or just one? (open)
+- **OQ-07**: Should per-site TLS overrides be supported for mixed ACME/manual domains? (open)
--- a/docs/architecture/proxy.md
+++ b/docs/architecture/proxy.md
@@ -14,8 +14,9 @@ injection, body size limits), and forwards it to the upstream service.
 ## Why It Exists

 This component replaces nginx's `proxy_pass` directive. For our use case —
-single upstream per domain, no load balancing, no HTTP/2 proxying — a custom
-handler is simpler and more maintainable than a general-purpose proxy library.
+one upstream per domain across multiple domains, no load balancing, no HTTP/2
+proxying — a custom handler is simpler and more maintainable than a
+general-purpose proxy library (ADR-002, ADR-010).

 ## Architecture

@@ -140,9 +141,9 @@ services typically run on the same host (e.g., `127.0.0.1:3000`). The
 `upstream_scheme` field in each site's configuration allows specifying `https://`
 for upstreams that require TLS (e.g., separate hosts or secure internal services).

-For the initial deployment (`git.alk.dev` → `127.0.0.1:3000`), the upstream
-connection uses plain HTTP, as TLS between the proxy and Gitea on loopback is
-unnecessary.
+For the initial deployment, upstream connections use plain HTTP (e.g.,
+`git.alk.dev` → `127.0.0.1:3000`, `alk.dev` → `127.0.0.1:8080`) since TLS
+between the proxy and backend services on loopback is unnecessary.

 ## Body Size Limit

@@ -157,8 +158,9 @@ All design decisions are documented as ADRs in [decisions/](decisions/).

 | ADR | Decision | Summary |
 |-----|----------|---------|
-| [002](decisions/002-custom-proxy-handler.md) | Custom proxy handler | Single upstream, single domain — simpler than a general proxy library |
+| [002](decisions/002-custom-proxy-handler.md) | Custom proxy handler | One upstream per domain — simpler than a general proxy library |
 | [007](decisions/007-custom-log-format.md) | Custom structured log format | key=value pairs with RATE_LIMIT prefix for fail2ban |
+| [010](decisions/010-multi-site-phase1.md) | Multi-site in Phase 1 | Multiple domains from initial release |

 ## Open Questions

--- a/docs/architecture/tls.md
+++ b/docs/architecture/tls.md
@@ -57,10 +57,11 @@ no deploy hooks.

 **How it works:**

-1. `AcmeCertProvider` configures the ACME client with the domain, cache
+1. `AcmeCertProvider` configures the ACME client with the domain list, cache
   directory, and Let's Encrypt directory (staging or production).
-2. `AcmeConfig::new(vec![domain])` creates an ACME configuration for the
-   domain.
+2. `AcmeConfig::new(domains)` creates an ACME configuration for all listed
+   domains. Let's Encrypt will issue a single SAN certificate covering all
+   domains.
 3. The ACME state machine runs as a background tokio task, handling:
   - Account registration with Let's Encrypt
   - Certificate ordering
@@ -75,9 +76,9 @@ no deploy hooks.
 **Configuration:**

 ```toml
-[tls]
+[server.tls]
 mode = "acme"
-acme_domain = "git.alk.dev"
+acme_domains = ["git.alk.dev", "alk.dev"]
 acme_cache_dir = "/var/lib/reverse-proxy/acme-cache"
 acme_directory = "production"  # or "staging" for testing
 ```
@@ -100,13 +101,8 @@ key_path = "/etc/letsencrypt/live/git.alk.dev/privkey.pem"
 ```

 Certificate files are loaded once at startup using `rustls_pemfile`. Manual
-mode requires a restart to pick up new certificates.
-
-**Why not hot-reload manual certs?** ACME mode handles renewal automatically.
-Manual mode is for cases where you control cert rotation externally (certbot,
-manual renewal). In that case, a SIGHUP-triggered restart is simpler and more
-reliable than file watching. If zero-downtime cert rotation is needed, use ACME
-mode.
+mode requires a restart to pick up new certificates. See ADR-004 for the
+rationale behind making ACME the primary mode and manual mode restart-dependent.

 ## TLS Configuration

@@ -142,10 +138,13 @@ restrict cipher suites beyond rustls defaults.
 ### ServerConfig Construction

 For manual mode, the `ServerConfig` is built with `with_no_client_auth()` and
-`with_single_cert()`, loading the certificate chain and private key from disk.
+a custom `ResolvesServerCert` implementation that maps SNI hostnames to
+certificate/key pairs loaded from disk.

 For ACME mode, the `ServerConfig` is built with `with_cert_resolver()`, passing
-the `ResolvesServerCertAcme` resolver. The ACME TLS-ALPN-01 protocol identifier
+the `ResolvesServerCertAcme` resolver. The ACME configuration includes all
+domains listed in `acme_domains`, and the resolver manages a single SAN
+certificate covering all of them. The ACME TLS-ALPN-01 protocol identifier
 (`acme-tls/1`) must be registered in the `alpn_protocols` list so the server
 can respond to TLS-ALPN-01 challenges.

@@ -154,28 +153,39 @@ versions (TLS 1.2 and TLS 1.3).

 ## SNI-Based Certificate Selection

-### Current (Single Domain)
+### ACME Mode (Multi-Domain)

-For single-domain setups, SNI selection is trivial: there's only one
-certificate, so `with_single_cert()` or `ResolvesServerCertAcme` (which
-handles the domain) is sufficient.
-
-### Future (Multi-Domain)
-
-When multiple domains are served, SNI selection works as follows:
+In ACME mode, `rustls-acme` manages a single SAN certificate covering all
+configured domains. The `ResolvesServerCertAcme` resolver automatically serves
+the correct certificate during the TLS handshake.

 1. **TLS handshake**: The client sends the SNI extension indicating which
   hostname it's connecting to.
-2. **Certificate resolution**: In ACME mode, `ResolvesServerCertAcme` handles
-   this automatically — it stores certificates keyed by domain. In manual mode,
-   a custom `ResolvesServerCert` implementation maps SNI hostname to the
-   correct `CertifiedKey`.
+2. **Certificate resolution**: `ResolvesServerCertAcme` matches the SNI
+   hostname against the provisioned certificate's Subject Alternative Names
+   and serves the certificate.
 3. **HTTP routing**: After the TLS handshake, axum's `Host` extractor routes
   the request to the correct site handler based on the `Host` header.

 This is the same pattern nginx uses — SNI selects the cert during TLS, then
-`Host` header selects the server block. In manual mode, a `ResolvesServerCert`
-implementation maps SNI hostname to the correct `CertifiedKey`.
+`Host` header selects the server block. ACME mode handles this automatically
+through the cert resolver.
+
+### Manual Mode (Multi-Domain)
+
+In manual mode, a custom `ResolvesServerCert` implementation is required to
+map SNI hostnames to the correct `CertifiedKey`. This implementation:
+
+1. Loads certificate files at startup (or on SIGHUP for reload)
+2. Maps each domain name to its certificate chain and private key
+3. During the TLS handshake, looks up the SNI hostname and returns the
+   matching `CertifiedKey`
+
+The custom resolver must handle the case where no matching certificate exists
+for the SNI hostname — in this case, the handshake fails, which is the
+correct behavior (we don't serve a default certificate for unknown domains).
+
+See [open-questions.md](open-questions.md) OQ-07 for per-site TLS overrides.

 ## HTTP Listener (Port 80)

@@ -211,6 +221,8 @@ All design decisions are documented as ADRs in [decisions/](decisions/).
 |-----|----------|---------|
 | [004](decisions/004-rustls-acme.md) | ACME-primary cert management | Eliminates certbot; automatic provisioning and renewal |
 | [005](decisions/005-tokio-rustls-direct.md) | tokio-rustls directly | Full control over TLS config and ACME resolver integration |
+| [010](decisions/010-multi-site-phase1.md) | Multi-site in Phase 1 | Multiple domains from initial release |
+| [011](decisions/011-multi-domain-tls.md) | Multi-domain TLS config | Single SAN certificate covering all domains via rustls-acme |

 ## Open Questions

@@ -218,3 +230,5 @@ Open questions are tracked in [open-questions.md](open-questions.md). Key
 questions affecting this document:

 - **OQ-01**: Should cipher suites be restricted beyond rustls defaults? (open)
+- **OQ-07**: Should per-site TLS overrides be supported for mixed ACME/manual
+  domains? (open)