Triage implementation review findings and update architecture specs

Analyzed 29 findings from the implementation review (002-implementation-review.md) and identified 8 architecture-level concerns requiring spec changes: Architecture gaps addressed: - C2: Added acme_contact field to config.md, tls.md, and operations.md. Let's Encrypt requires a contact email for production; the spec was missing this required field. - C4: Added StaticConfig drift tracking requirement to config.md reload section. ConfigReloadHandle must update its stored StaticConfig after each successful reload to prevent stale warnings. - W1: Updated shutdown sequence in operations.md to specify that server tasks should be joined (not aborted) during the drain window. - W5: Added health check path collision note to proxy.md. - W13: Clarified that access logging is always-on in operations.md. - W14: Updated X-Forwarded-Proto description in proxy.md to clarify that it is always 'https' since the HTTP listener redirects rather than proxies. New open questions added: - OQ-08: Should /health use a less common path to avoid upstream collision? - OQ-09: How should upstream_connect_timeout_secs be enforced? - OQ-10: Should ACME contact email be a required config field? - OQ-11: How should X-Forwarded-Proto be derived per-listener? - OQ-12: Should request access logging be mandatory or optional? The remaining 21 findings are implementation-level bugs, code quality issues, or Phase 2 improvements that don't require architecture spec changes.
2026-06-11 15:04:09 +00:00
parent 5478df7ab7
commit 68d27c4789
6 changed files with 135 additions and 10 deletions
--- a/docs/architecture/open-questions.md
+++ b/docs/architecture/open-questions.md
@@ -87,7 +87,97 @@ last_updated: 2026-06-11
 - **Origin**: [proxy.md](proxy.md)
 - **Status**: resolved
 - **Priority**: low
- **Resolution**: Yes. Per-site upstream timeouts with sensible defaults (5s
-  connect, 60s request). Optional fields in SiteConfig that override global
-  defaults when specified. See ADR-015.
- **Cross-references**: ADR-015, ADR-017
+- **Resolution**: Resolved by ADR-015. Per-site upstream timeout overrides with
+  sensible defaults (5s connect, 60s request). Optional fields in SiteConfig
+  that override global defaults when specified.
+- **Cross-references**: ADR-015, ADR-017
+
+### OQ-08: Should the `/health` path use a less common endpoint to avoid upstream collision?
+
+- **Origin**: Implementation review finding W5, [proxy.md](proxy.md)
+- **Status**: open
+- **Priority**: medium
+- **Resolution**: None yet. The proxy currently intercepts `GET /health` on all
+  hosts before host-based routing, which means any upstream application that
+  uses `/health` for its own health checks will have those requests silently
+  intercepted. Options: (1) Use a less common path like `/__health` or
+  `/healthz`; (2) Only intercept `/health` when the Host header doesn't match
+  any known site (fallthrough); (3) Make the health check path configurable
+  via `StaticConfig`. Option 1 is simplest for Phase 1. Option 3 is most
+  flexible long-term. The architecture spec (proxy.md, ADR-013) currently
+  specifies `/health` as a top-level route regardless of Host.
+- **Cross-references**: ADR-013
+
+### OQ-09: How should `upstream_connect_timeout_secs` be enforced?
+
+- **Origin**: Implementation review finding W4, ADR-015, ADR-017
+- **Status**: open
+- **Priority**: medium
+- **Resolution**: None yet. The architecture (ADR-015, ADR-017) specifies a
+  5-second default connect timeout separate from the request timeout, and
+  `SiteConfig` includes `upstream_connect_timeout_secs`. However, the
+  implementation only applies `upstream_request_timeout_secs` as a blanket
+  timeout covering the entire exchange. The hyper client handles TCP connect
+  internally, making a two-phase timeout harder to implement without custom
+  connect logic. Need to decide: (1) implement a two-phase timeout using
+  `tokio::time::timeout` for connect phase then request phase; (2) configure
+  the hyper client's `connect_timeout` parameter; or (3) accept the current
+  behavior for Phase 1 and add connect timeout enforcement in Phase 2.
+- **Cross-references**: ADR-015, ADR-017
+
+## Configuration
+
+### OQ-10: Should ACME contact email be a required config field?
+
+- **Origin**: Implementation review finding C2, [tls.md](tls.md), [config.md](config.md)
+- **Status**: open
+- **Priority**: high
+- **Resolution**: None yet. Let's Encrypt requires a contact email for production
+  certificate requests. The current architecture spec does not include an
+  `acme_contact` field in `TlsConfig` or `ListenerConfig`. Without it, ACME
+  registration with Let's Encrypt production will fail. Options: (1) Add a
+  required `acme_contact` field to the TLS config within each `[[listeners]]`
+  entry that uses ACME mode; (2) Add a global `acme_contact` field shared
+  across all ACME listeners. Per-listener is more flexible but adds config
+  noise. Global is simpler for typical deployments. Need to update config.md
+  and tls.md.
+- **Cross-references**: ADR-004
+
+### OQ-11: How should `X-Forwarded-Proto` be derived per-listener?
+
+- **Origin**: Implementation review finding W14, [proxy.md](proxy.md)
+- **Status**: open
+- **Priority**: medium
+- **Resolution**: None yet. The architecture spec (proxy.md) states
+  `X-Forwarded-Proto` should be "determined by which listener port received the
+  request" — `https` for requests on the listener's `https_port`, `http` for
+  requests on the listener's `http_port`. The implementation hardcodes
+  `is_https: true` in `ProxyState`. For a TLS-terminating reverse proxy this
+  is correct (all TLS connections arrive on the HTTPS port), but the HTTP
+  redirect listener should set `X-Forwarded-Proto: https` since it redirects to
+  HTTPS. Need to clarify: (1) The HTTPS listener always sets `X-Forwarded-Proto:
+  https` (correct, since it terminates TLS); (2) The HTTP redirect listener
+  sends a 301 redirect and does NOT proxy, so `X-Forwarded-Proto` on the
+  redirect response is not applicable. The hardcoded behavior is correct but
+  should be documented.
+- **Cross-references**: ADR-021
+
+## Operations
+
+### OQ-12: Should request access logging be mandatory or optional?
+
+- **Origin**: Implementation review finding W13, [operations.md](operations.md)
+- **Status**: open
+- **Priority**: high
+- **Resolution**: None yet. The architecture spec (operations.md) defines an
+  access log format (`REQUEST client_ip=... host=... method=... path=...
+  status=... upstream=... duration_ms=...`) and a `log_request!` macro, but
+  the implementation does not emit access logs. Without request-level logging,
+  the proxy is operationally blind — there is no observability into traffic,
+  response codes, or upstream latency. This also blocks fail2ban integration
+  for access-log-based jails. The question is whether to: (1) Make access
+  logging mandatory (always-on at `info` level); (2) Make it configurable
+  (e.g., `access_log` boolean in `LoggingConfig`); or (3) Tie it to the
+  existing `log_file_path` setting. The architecture spec implies it's always
+  on.
+- **Cross-references**: ADR-007