From 68d27c4789d00953a2cadeeb1dba6a4d89f58c06 Mon Sep 17 00:00:00 2001 From: "glm-5.1" Date: Thu, 11 Jun 2026 15:04:09 +0000 Subject: [PATCH] Triage implementation review findings and update architecture specs Analyzed 29 findings from the implementation review (002-implementation-review.md) and identified 8 architecture-level concerns requiring spec changes: Architecture gaps addressed: - C2: Added acme_contact field to config.md, tls.md, and operations.md. Let's Encrypt requires a contact email for production; the spec was missing this required field. - C4: Added StaticConfig drift tracking requirement to config.md reload section. ConfigReloadHandle must update its stored StaticConfig after each successful reload to prevent stale warnings. - W1: Updated shutdown sequence in operations.md to specify that server tasks should be joined (not aborted) during the drain window. - W5: Added health check path collision note to proxy.md. - W13: Clarified that access logging is always-on in operations.md. - W14: Updated X-Forwarded-Proto description in proxy.md to clarify that it is always 'https' since the HTTP listener redirects rather than proxies. New open questions added: - OQ-08: Should /health use a less common path to avoid upstream collision? - OQ-09: How should upstream_connect_timeout_secs be enforced? - OQ-10: Should ACME contact email be a required config field? - OQ-11: How should X-Forwarded-Proto be derived per-listener? - OQ-12: Should request access logging be mandatory or optional? The remaining 21 findings are implementation-level bugs, code quality issues, or Phase 2 improvements that don't require architecture spec changes. --- docs/architecture/README.md | 5 ++ docs/architecture/config.md | 14 +++++ docs/architecture/open-questions.md | 98 +++++++++++++++++++++++++++-- docs/architecture/operations.md | 10 ++- docs/architecture/proxy.md | 8 ++- docs/architecture/tls.md | 10 ++- 6 files changed, 135 insertions(+), 10 deletions(-) diff --git a/docs/architecture/README.md b/docs/architecture/README.md index 79c685e..a6fdd69 100644 --- a/docs/architecture/README.md +++ b/docs/architecture/README.md @@ -67,6 +67,11 @@ See [open-questions.md](open-questions.md) for the full tracker. | ~~OQ-05~~ | ~~Should the proxy bind to multiple addresses?~~ | ~~low~~ | **resolved** (single bind_addr sufficient) | | ~~OQ-06~~ | ~~Should upstream timeouts be configurable per-site?~~ | ~~low~~ | **resolved** (ADR-015) | | ~~OQ-07~~ | ~~Should per-site TLS overrides be supported for mixed ACME/manual domains?~~ | ~~low~~ | **resolved** (ADR-019) | +| OQ-08 | Should the `/health` path use a less common endpoint to avoid upstream collision? | medium | open | +| OQ-09 | How should `upstream_connect_timeout_secs` be enforced? | medium | open | +| OQ-10 | Should ACME contact email be a required config field? | high | open | +| OQ-11 | How should `X-Forwarded-Proto` be derived per-listener? | medium | open | +| OQ-12 | Should request access logging be mandatory or optional? | high | open | ## Document Lifecycle diff --git a/docs/architecture/config.md b/docs/architecture/config.md index 38f4b52..090085b 100644 --- a/docs/architecture/config.md +++ b/docs/architecture/config.md @@ -116,6 +116,7 @@ will be handled via signal-based or built-in rotation. | `tls.acme_domains` | `Vec` | Domains for ACME SAN certificate (ACME mode only) | | `tls.acme_cache_dir` | `String` | ACME state cache directory | | `tls.acme_directory` | `"production"` or `"staging"` | Let's Encrypt directory | +| `tls.acme_contact` | `String` | Contact email for ACME registration (e.g., `"mailto:admin@example.com"`). Required for production; Let's Encrypt rejects registrations without a contact email. See OQ-10. | | `tls.cert_path` | `String` | Certificate file path (manual mode only) | | `tls.key_path` | `String` | Private key file path (manual mode only) | @@ -174,6 +175,7 @@ Phase 2. | `listeners[].http_port` | `u16` | `80` | No | | `listeners[].https_port` | `u16` | `443` | No | | `listeners[].tls.acme_directory` | `String` | `"production"` | No | +| `listeners[].tls.acme_contact` | `String` | — | Yes (ACME mode only) | | `sites[].upstream_scheme` | `String` | `"http"` | No | | `sites[].upstream_connect_timeout_secs` | `u64` | `5` | No | | `sites[].upstream_request_timeout_secs` | `u64` | `60` | No | @@ -236,6 +238,13 @@ effect. This gives operators early feedback about config drift. Only the DynamicConfig portion is swapped via ArcSwap. StaticConfig changes require a process restart to take effect. +**Important**: The `ConfigReloadHandle` must track the last-known StaticConfig +so that it can correctly detect changes on subsequent reloads. After each +successful reload, the stored StaticConfig is updated with the new value (via +`ArcSwap` or similar interior mutability). This prevents stale +warnings: if the same static config change is present on two consecutive +reloads, the operator should see the warning only once, not on every reload. + ### Reload Serialization Reload operations are serialized using a `tokio::sync::Mutex` on the reload @@ -290,6 +299,7 @@ mode = "acme" acme_domains = ["git.alk.dev"] acme_cache_dir = "/var/lib/reverse-proxy/acme-cache-git" acme_directory = "production" +acme_contact = "mailto:admin@alk.dev" [[listeners.sites]] host = "git.alk.dev" @@ -347,6 +357,7 @@ mode = "acme" acme_domains = ["git.alk.dev", "alk.dev"] acme_cache_dir = "/var/lib/reverse-proxy/acme-cache" acme_directory = "production" +acme_contact = "mailto:admin@alk.dev" [[listeners.sites]] host = "git.alk.dev" @@ -391,6 +402,9 @@ On startup, the config is validated: `10.0.0.5` (missing port). The `upstream_scheme` field handles the protocol. 18. `upstream_scheme` values are case-sensitive: only `"http"` or `"https"` (lowercase). Default is `"http"`. +19. In ACME mode, `tls.acme_contact` must be a valid `mailto:` URI + (e.g., `"mailto:admin@example.com"`). Let's Encrypt requires a contact + email for production certificate requests. On SIGHUP reload, the same validation applies. If the new config fails validation, the reload is rejected and the old config remains active. An error diff --git a/docs/architecture/open-questions.md b/docs/architecture/open-questions.md index b55c844..6402e4f 100644 --- a/docs/architecture/open-questions.md +++ b/docs/architecture/open-questions.md @@ -87,7 +87,97 @@ last_updated: 2026-06-11 - **Origin**: [proxy.md](proxy.md) - **Status**: resolved - **Priority**: low -- **Resolution**: Yes. Per-site upstream timeouts with sensible defaults (5s - connect, 60s request). Optional fields in SiteConfig that override global - defaults when specified. See ADR-015. -- **Cross-references**: ADR-015, ADR-017 \ No newline at end of file +- **Resolution**: Resolved by ADR-015. Per-site upstream timeout overrides with + sensible defaults (5s connect, 60s request). Optional fields in SiteConfig + that override global defaults when specified. +- **Cross-references**: ADR-015, ADR-017 + +### OQ-08: Should the `/health` path use a less common endpoint to avoid upstream collision? + +- **Origin**: Implementation review finding W5, [proxy.md](proxy.md) +- **Status**: open +- **Priority**: medium +- **Resolution**: None yet. The proxy currently intercepts `GET /health` on all + hosts before host-based routing, which means any upstream application that + uses `/health` for its own health checks will have those requests silently + intercepted. Options: (1) Use a less common path like `/__health` or + `/healthz`; (2) Only intercept `/health` when the Host header doesn't match + any known site (fallthrough); (3) Make the health check path configurable + via `StaticConfig`. Option 1 is simplest for Phase 1. Option 3 is most + flexible long-term. The architecture spec (proxy.md, ADR-013) currently + specifies `/health` as a top-level route regardless of Host. +- **Cross-references**: ADR-013 + +### OQ-09: How should `upstream_connect_timeout_secs` be enforced? + +- **Origin**: Implementation review finding W4, ADR-015, ADR-017 +- **Status**: open +- **Priority**: medium +- **Resolution**: None yet. The architecture (ADR-015, ADR-017) specifies a + 5-second default connect timeout separate from the request timeout, and + `SiteConfig` includes `upstream_connect_timeout_secs`. However, the + implementation only applies `upstream_request_timeout_secs` as a blanket + timeout covering the entire exchange. The hyper client handles TCP connect + internally, making a two-phase timeout harder to implement without custom + connect logic. Need to decide: (1) implement a two-phase timeout using + `tokio::time::timeout` for connect phase then request phase; (2) configure + the hyper client's `connect_timeout` parameter; or (3) accept the current + behavior for Phase 1 and add connect timeout enforcement in Phase 2. +- **Cross-references**: ADR-015, ADR-017 + +## Configuration + +### OQ-10: Should ACME contact email be a required config field? + +- **Origin**: Implementation review finding C2, [tls.md](tls.md), [config.md](config.md) +- **Status**: open +- **Priority**: high +- **Resolution**: None yet. Let's Encrypt requires a contact email for production + certificate requests. The current architecture spec does not include an + `acme_contact` field in `TlsConfig` or `ListenerConfig`. Without it, ACME + registration with Let's Encrypt production will fail. Options: (1) Add a + required `acme_contact` field to the TLS config within each `[[listeners]]` + entry that uses ACME mode; (2) Add a global `acme_contact` field shared + across all ACME listeners. Per-listener is more flexible but adds config + noise. Global is simpler for typical deployments. Need to update config.md + and tls.md. +- **Cross-references**: ADR-004 + +### OQ-11: How should `X-Forwarded-Proto` be derived per-listener? + +- **Origin**: Implementation review finding W14, [proxy.md](proxy.md) +- **Status**: open +- **Priority**: medium +- **Resolution**: None yet. The architecture spec (proxy.md) states + `X-Forwarded-Proto` should be "determined by which listener port received the + request" — `https` for requests on the listener's `https_port`, `http` for + requests on the listener's `http_port`. The implementation hardcodes + `is_https: true` in `ProxyState`. For a TLS-terminating reverse proxy this + is correct (all TLS connections arrive on the HTTPS port), but the HTTP + redirect listener should set `X-Forwarded-Proto: https` since it redirects to + HTTPS. Need to clarify: (1) The HTTPS listener always sets `X-Forwarded-Proto: + https` (correct, since it terminates TLS); (2) The HTTP redirect listener + sends a 301 redirect and does NOT proxy, so `X-Forwarded-Proto` on the + redirect response is not applicable. The hardcoded behavior is correct but + should be documented. +- **Cross-references**: ADR-021 + +## Operations + +### OQ-12: Should request access logging be mandatory or optional? + +- **Origin**: Implementation review finding W13, [operations.md](operations.md) +- **Status**: open +- **Priority**: high +- **Resolution**: None yet. The architecture spec (operations.md) defines an + access log format (`REQUEST client_ip=... host=... method=... path=... + status=... upstream=... duration_ms=...`) and a `log_request!` macro, but + the implementation does not emit access logs. Without request-level logging, + the proxy is operationally blind — there is no observability into traffic, + response codes, or upstream latency. This also blocks fail2ban integration + for access-log-based jails. The question is whether to: (1) Make access + logging mandatory (always-on at `info` level); (2) Make it configurable + (e.g., `access_log` boolean in `LoggingConfig`); or (3) Tie it to the + existing `log_file_path` setting. The architecture spec implies it's always + on. +- **Cross-references**: ADR-007 \ No newline at end of file diff --git a/docs/architecture/operations.md b/docs/architecture/operations.md index 2d265b7..b6f33d7 100644 --- a/docs/architecture/operations.md +++ b/docs/architecture/operations.md @@ -107,7 +107,10 @@ All logs use `tracing` with structured fields. The proxy outputs two types of log entries: 1. **Access logs**: Every proxied request is logged at `info` level with - structured fields. + structured fields. Access logging is **always-on** — it is the primary + observability mechanism for the proxy and is required for fail2ban + integration. There is no configuration option to disable access logging + (see OQ-12). ``` REQUEST client_ip=203.0.113.50 host=git.alk.dev method=GET path=/user/repo status=200 upstream=127.0.0.1:3000 duration_ms=45 @@ -293,7 +296,9 @@ On SIGTERM or SIGINT, the proxy performs a graceful shutdown: 2. **Close idle keep-alive connections** — Send `Connection: close` on any idle connections in the keep-alive pool. 3. **Wait for in-flight requests** — Up to `shutdown_timeout_secs` (default: 30) - for active requests to complete. + for active requests to complete. Server tasks are joined (not aborted) so + that in-flight requests can drain normally. Only after the timeout expires + are remaining tasks aborted. 4. **Force-close remaining connections** — After the timeout, any remaining connections are forcefully closed via TCP RST. 5. **Cancel background tasks** — ACME renewal tasks, rate limiter eviction task, @@ -477,6 +482,7 @@ mode = "acme" acme_domains = ["git.example.com"] acme_cache_dir = "/var/lib/reverse-proxy/acme-cache" acme_directory = "production" +acme_contact = "mailto:admin@example.com" [[listeners.sites]] host = "git.example.com" diff --git a/docs/architecture/proxy.md b/docs/architecture/proxy.md index 1f067ff..f5f743b 100644 --- a/docs/architecture/proxy.md +++ b/docs/architecture/proxy.md @@ -95,6 +95,12 @@ The `/health` path is a special case: it matches regardless of the `Host` header and is evaluated before host-based routing. A `GET /health` request on any hostname returns `200 OK` with an empty body. +**Note**: This means any upstream application that uses `/health` for its own +health checks will have those requests silently intercepted by the proxy and +will never reach the upstream. If this is a concern, the health check path +should be changed to a less common path (e.g., `/__health` or `/healthz`) or +made configurable. See OQ-08. + ### 2. Proxy Header Injection Headers are injected before forwarding. The proxy is an **edge proxy** — it @@ -107,7 +113,7 @@ existing `X-Forwarded-For` headers from the client cannot be trusted. | `Host` | Original request `Host` header | Preserved as-is | | `X-Real-IP` | `ConnectInfo` remote IP | Set to client's IP address | | `X-Forwarded-For` | `ConnectInfo` remote IP | **Replaced**, not appended. The proxy is the edge proxy — there are no trusted proxies upstream, so existing `X-Forwarded-For` values from the client cannot be trusted. | -| `X-Forwarded-Proto` | Determined by which listener port received the request | `https` for requests on the listener's `https_port`, `http` for requests on the listener's `http_port` | +| `X-Forwarded-Proto` | Determined by which listener port received the request | `https` for requests on the listener's `https_port`, `http` for requests on the listener's `http_port`. Note: since the TLS-terminating listener only receives HTTPS connections, this is always `"https"` in practice. The HTTP redirect listener sends a 301 redirect rather than proxying, so `X-Forwarded-Proto` is not set there. See OQ-11. | **ConnectInfo propagation**: `ConnectInfo` is populated by extracting `TcpStream::peer_addr()` before wrapping the connection in diff --git a/docs/architecture/tls.md b/docs/architecture/tls.md index 221f847..04f3570 100644 --- a/docs/architecture/tls.md +++ b/docs/architecture/tls.md @@ -77,16 +77,19 @@ no deploy hooks. listed in that listener's `acme_domains`. Let's Encrypt will issue a certificate covering those domains (a single SAN certificate or a single-domain certificate, depending on how many domains are listed). -3. The ACME state machine runs as a background tokio task per listener, +3. The `acme_contact` field provides a contact email address (as a `mailto:` + URI) required by Let's Encrypt for production certificate requests. Without + a contact email, Let's Encrypt production API returns a 400-level error. +4. The ACME state machine runs as a background tokio task per listener, handling: - Account registration with Let's Encrypt - Certificate ordering - TLS-ALPN-01 challenge (or HTTP-01 challenge) - Certificate issuance - Certificate renewal (automatic, ~30 days before expiry) -4. `ResolvesServerCertAcme` is a rustls `ResolvesServerCert` implementation +5. `ResolvesServerCertAcme` is a rustls `ResolvesServerCert` implementation that automatically serves the ACME-provisioned certificate. -5. When a new certificate is issued, the resolver updates atomically — no +6. When a new certificate is issued, the resolver updates atomically — no restart or signal handling needed. **Configuration (within a `[[listeners]]` entry):** @@ -100,6 +103,7 @@ mode = "acme" acme_domains = ["git.alk.dev", "alk.dev"] acme_cache_dir = "/var/lib/reverse-proxy/acme-cache" acme_directory = "production" # or "staging" for testing +acme_contact = "mailto:admin@alk.dev" # Required for Let's Encrypt production ``` **Cache directory:** The `DirCache` from rustls-acme persists ACME account data,