Triage implementation review findings and update architecture specs

Analyzed 29 findings from the implementation review (002-implementation-review.md) and identified 8 architecture-level concerns requiring spec changes: Architecture gaps addressed: - C2: Added acme_contact field to config.md, tls.md, and operations.md. Let's Encrypt requires a contact email for production; the spec was missing this required field. - C4: Added StaticConfig drift tracking requirement to config.md reload section. ConfigReloadHandle must update its stored StaticConfig after each successful reload to prevent stale warnings. - W1: Updated shutdown sequence in operations.md to specify that server tasks should be joined (not aborted) during the drain window. - W5: Added health check path collision note to proxy.md. - W13: Clarified that access logging is always-on in operations.md. - W14: Updated X-Forwarded-Proto description in proxy.md to clarify that it is always 'https' since the HTTP listener redirects rather than proxies. New open questions added: - OQ-08: Should /health use a less common path to avoid upstream collision? - OQ-09: How should upstream_connect_timeout_secs be enforced? - OQ-10: Should ACME contact email be a required config field? - OQ-11: How should X-Forwarded-Proto be derived per-listener? - OQ-12: Should request access logging be mandatory or optional? The remaining 21 findings are implementation-level bugs, code quality issues, or Phase 2 improvements that don't require architecture spec changes.
2026-06-11 15:04:09 +00:00
parent 5478df7ab7
commit 68d27c4789
6 changed files with 135 additions and 10 deletions
--- a/docs/architecture/README.md
+++ b/docs/architecture/README.md
@@ -67,6 +67,11 @@ See [open-questions.md](open-questions.md) for the full tracker.
 | ~~OQ-05~~ | ~~Should the proxy bind to multiple addresses?~~ | ~~low~~ | **resolved** (single bind_addr sufficient) |
 | ~~OQ-06~~ | ~~Should upstream timeouts be configurable per-site?~~ | ~~low~~ | **resolved** (ADR-015) |
 | ~~OQ-07~~ | ~~Should per-site TLS overrides be supported for mixed ACME/manual domains?~~ | ~~low~~ | **resolved** (ADR-019) |
 | OQ-08 | Should the `/health` path use a less common endpoint to avoid upstream collision? | medium | open |
 | OQ-09 | How should `upstream_connect_timeout_secs` be enforced? | medium | open |
 | OQ-10 | Should ACME contact email be a required config field? | high | open |
 | OQ-11 | How should `X-Forwarded-Proto` be derived per-listener? | medium | open |
 | OQ-12 | Should request access logging be mandatory or optional? | high | open |
 ## Document Lifecycle
--- a/docs/architecture/config.md
+++ b/docs/architecture/config.md
@@ -116,6 +116,7 @@ will be handled via signal-based or built-in rotation.
 | `tls.acme_domains` | `Vec<String>` | Domains for ACME SAN certificate (ACME mode only) |
 | `tls.acme_cache_dir` | `String` | ACME state cache directory |
 | `tls.acme_directory` | `"production"` or `"staging"` | Let's Encrypt directory |
 | `tls.acme_contact` | `String` | Contact email for ACME registration (e.g., `"mailto:admin@example.com"`). Required for production; Let's Encrypt rejects registrations without a contact email. See OQ-10. |
 | `tls.cert_path` | `String` | Certificate file path (manual mode only) |
 | `tls.key_path` | `String` | Private key file path (manual mode only) |
@@ -174,6 +175,7 @@ Phase 2.
 | `listeners[].http_port` | `u16` | `80` | No |
 | `listeners[].https_port` | `u16` | `443` | No |
 | `listeners[].tls.acme_directory` | `String` | `"production"` | No |
 | `listeners[].tls.acme_contact` | `String` | — | Yes (ACME mode only) |
 | `sites[].upstream_scheme` | `String` | `"http"` | No |
 | `sites[].upstream_connect_timeout_secs` | `u64` | `5` | No |
 | `sites[].upstream_request_timeout_secs` | `u64` | `60` | No |
@@ -236,6 +238,13 @@ effect. This gives operators early feedback about config drift.
 Only the DynamicConfig portion is swapped via ArcSwap. StaticConfig changes
 require a process restart to take effect.
 **Important**: The `ConfigReloadHandle` must track the last-known StaticConfig
 so that it can correctly detect changes on subsequent reloads. After each
 successful reload, the stored StaticConfig is updated with the new value (via
 `ArcSwap<StaticConfig>` or similar interior mutability). This prevents stale
 warnings: if the same static config change is present on two consecutive
 reloads, the operator should see the warning only once, not on every reload.
 ### Reload Serialization
 Reload operations are serialized using a `tokio::sync::Mutex` on the reload
@@ -290,6 +299,7 @@ mode = "acme"
 acme_domains = ["git.alk.dev"]
 acme_cache_dir = "/var/lib/reverse-proxy/acme-cache-git"
 acme_directory = "production"
 acme_contact = "mailto:admin@alk.dev"
 [[listeners.sites]]
 host = "git.alk.dev"
@@ -347,6 +357,7 @@ mode = "acme"
 acme_domains = ["git.alk.dev", "alk.dev"]
 acme_cache_dir = "/var/lib/reverse-proxy/acme-cache"
 acme_directory = "production"
 acme_contact = "mailto:admin@alk.dev"
 [[listeners.sites]]
 host = "git.alk.dev"
@@ -391,6 +402,9 @@ On startup, the config is validated:
    `10.0.0.5` (missing port). The `upstream_scheme` field handles the protocol.
 18. `upstream_scheme` values are case-sensitive: only `"http"` or `"https"`
    (lowercase). Default is `"http"`.
 19. In ACME mode, `tls.acme_contact` must be a valid `mailto:` URI
    (e.g., `"mailto:admin@example.com"`). Let's Encrypt requires a contact
    email for production certificate requests.
 On SIGHUP reload, the same validation applies. If the new config fails
 validation, the reload is rejected and the old config remains active. An error
--- a/docs/architecture/open-questions.md
+++ b/docs/architecture/open-questions.md
@@ -87,7 +87,97 @@ last_updated: 2026-06-11
 - **Origin**: [proxy.md](proxy.md)
 - **Status**: resolved
 - **Priority**: low
- **Resolution**: Yes. Per-site upstream timeouts with sensible defaults (5s
+- **Resolution**: Resolved by ADR-015. Per-site upstream timeout overrides with
-  connect, 60s request). Optional fields in SiteConfig that override global
+  sensible defaults (5s connect, 60s request). Optional fields in SiteConfig
-  defaults when specified. See ADR-015.
+  that override global defaults when specified.
- **Cross-references**: ADR-015, ADR-017
+- **Cross-references**: ADR-015, ADR-017
 ### OQ-08: Should the `/health` path use a less common endpoint to avoid upstream collision?
 - **Origin**: Implementation review finding W5, [proxy.md](proxy.md)
 - **Status**: open
 - **Priority**: medium
 - **Resolution**: None yet. The proxy currently intercepts `GET /health` on all
  hosts before host-based routing, which means any upstream application that
  uses `/health` for its own health checks will have those requests silently
  intercepted. Options: (1) Use a less common path like `/__health` or
  `/healthz`; (2) Only intercept `/health` when the Host header doesn't match
  any known site (fallthrough); (3) Make the health check path configurable
  via `StaticConfig`. Option 1 is simplest for Phase 1. Option 3 is most
  flexible long-term. The architecture spec (proxy.md, ADR-013) currently
  specifies `/health` as a top-level route regardless of Host.
 - **Cross-references**: ADR-013
 ### OQ-09: How should `upstream_connect_timeout_secs` be enforced?
 - **Origin**: Implementation review finding W4, ADR-015, ADR-017
 - **Status**: open
 - **Priority**: medium
 - **Resolution**: None yet. The architecture (ADR-015, ADR-017) specifies a
  5-second default connect timeout separate from the request timeout, and
  `SiteConfig` includes `upstream_connect_timeout_secs`. However, the
  implementation only applies `upstream_request_timeout_secs` as a blanket
  timeout covering the entire exchange. The hyper client handles TCP connect
  internally, making a two-phase timeout harder to implement without custom
  connect logic. Need to decide: (1) implement a two-phase timeout using
  `tokio::time::timeout` for connect phase then request phase; (2) configure
  the hyper client's `connect_timeout` parameter; or (3) accept the current
  behavior for Phase 1 and add connect timeout enforcement in Phase 2.
 - **Cross-references**: ADR-015, ADR-017
 ## Configuration
 ### OQ-10: Should ACME contact email be a required config field?
 - **Origin**: Implementation review finding C2, [tls.md](tls.md), [config.md](config.md)
 - **Status**: open
 - **Priority**: high
 - **Resolution**: None yet. Let's Encrypt requires a contact email for production
  certificate requests. The current architecture spec does not include an
  `acme_contact` field in `TlsConfig` or `ListenerConfig`. Without it, ACME
  registration with Let's Encrypt production will fail. Options: (1) Add a
  required `acme_contact` field to the TLS config within each `[[listeners]]`
  entry that uses ACME mode; (2) Add a global `acme_contact` field shared
  across all ACME listeners. Per-listener is more flexible but adds config
  noise. Global is simpler for typical deployments. Need to update config.md
  and tls.md.
 - **Cross-references**: ADR-004
 ### OQ-11: How should `X-Forwarded-Proto` be derived per-listener?
 - **Origin**: Implementation review finding W14, [proxy.md](proxy.md)
 - **Status**: open
 - **Priority**: medium
 - **Resolution**: None yet. The architecture spec (proxy.md) states
  `X-Forwarded-Proto` should be "determined by which listener port received the
  request" — `https` for requests on the listener's `https_port`, `http` for
  requests on the listener's `http_port`. The implementation hardcodes
  `is_https: true` in `ProxyState`. For a TLS-terminating reverse proxy this
  is correct (all TLS connections arrive on the HTTPS port), but the HTTP
  redirect listener should set `X-Forwarded-Proto: https` since it redirects to
  HTTPS. Need to clarify: (1) The HTTPS listener always sets `X-Forwarded-Proto:
  https` (correct, since it terminates TLS); (2) The HTTP redirect listener
  sends a 301 redirect and does NOT proxy, so `X-Forwarded-Proto` on the
  redirect response is not applicable. The hardcoded behavior is correct but
  should be documented.
 - **Cross-references**: ADR-021
 ## Operations
 ### OQ-12: Should request access logging be mandatory or optional?
 - **Origin**: Implementation review finding W13, [operations.md](operations.md)
 - **Status**: open
 - **Priority**: high
 - **Resolution**: None yet. The architecture spec (operations.md) defines an
  access log format (`REQUEST client_ip=... host=... method=... path=...
  status=... upstream=... duration_ms=...`) and a `log_request!` macro, but
  the implementation does not emit access logs. Without request-level logging,
  the proxy is operationally blind — there is no observability into traffic,
  response codes, or upstream latency. This also blocks fail2ban integration
  for access-log-based jails. The question is whether to: (1) Make access
  logging mandatory (always-on at `info` level); (2) Make it configurable
  (e.g., `access_log` boolean in `LoggingConfig`); or (3) Tie it to the
  existing `log_file_path` setting. The architecture spec implies it's always
  on.
 - **Cross-references**: ADR-007
--- a/docs/architecture/operations.md
+++ b/docs/architecture/operations.md
@@ -107,7 +107,10 @@ All logs use `tracing` with structured fields. The proxy outputs two types of
 log entries:
 1. **Access logs**: Every proxied request is logged at `info` level with
-   structured fields.
+   structured fields. Access logging is **always-on** — it is the primary
   observability mechanism for the proxy and is required for fail2ban
   integration. There is no configuration option to disable access logging
   (see OQ-12).
 ```
 REQUEST client_ip=203.0.113.50 host=git.alk.dev method=GET path=/user/repo status=200 upstream=127.0.0.1:3000 duration_ms=45
@@ -293,7 +296,9 @@ On SIGTERM or SIGINT, the proxy performs a graceful shutdown:
 2. **Close idle keep-alive connections** — Send `Connection: close` on any idle
   connections in the keep-alive pool.
 3. **Wait for in-flight requests** — Up to `shutdown_timeout_secs` (default: 30)
-   for active requests to complete.
+   for active requests to complete. Server tasks are joined (not aborted) so
   that in-flight requests can drain normally. Only after the timeout expires
   are remaining tasks aborted.
 4. **Force-close remaining connections** — After the timeout, any remaining
   connections are forcefully closed via TCP RST.
 5. **Cancel background tasks** — ACME renewal tasks, rate limiter eviction task,
@@ -477,6 +482,7 @@ mode = "acme"
 acme_domains = ["git.example.com"]
 acme_cache_dir = "/var/lib/reverse-proxy/acme-cache"
 acme_directory = "production"
 acme_contact = "mailto:admin@example.com"
 [[listeners.sites]]
 host = "git.example.com"
--- a/docs/architecture/proxy.md
+++ b/docs/architecture/proxy.md
@@ -95,6 +95,12 @@ The `/health` path is a special case: it matches regardless of the `Host`
 header and is evaluated before host-based routing. A `GET /health` request on
 any hostname returns `200 OK` with an empty body.
 **Note**: This means any upstream application that uses `/health` for its own
 health checks will have those requests silently intercepted by the proxy and
 will never reach the upstream. If this is a concern, the health check path
 should be changed to a less common path (e.g., `/__health` or `/healthz`) or
 made configurable. See OQ-08.
 ### 2. Proxy Header Injection
 Headers are injected before forwarding. The proxy is an **edge proxy** — it
@@ -107,7 +113,7 @@ existing `X-Forwarded-For` headers from the client cannot be trusted.
 | `Host` | Original request `Host` header | Preserved as-is |
 | `X-Real-IP` | `ConnectInfo<SocketAddr>` remote IP | Set to client's IP address |
 | `X-Forwarded-For` | `ConnectInfo<SocketAddr>` remote IP | **Replaced**, not appended. The proxy is the edge proxy — there are no trusted proxies upstream, so existing `X-Forwarded-For` values from the client cannot be trusted. |
-| `X-Forwarded-Proto` | Determined by which listener port received the request | `https` for requests on the listener's `https_port`, `http` for requests on the listener's `http_port` |
+| `X-Forwarded-Proto` | Determined by which listener port received the request | `https` for requests on the listener's `https_port`, `http` for requests on the listener's `http_port`. Note: since the TLS-terminating listener only receives HTTPS connections, this is always `"https"` in practice. The HTTP redirect listener sends a 301 redirect rather than proxying, so `X-Forwarded-Proto` is not set there. See OQ-11. |
 **ConnectInfo propagation**: `ConnectInfo<SocketAddr>` is populated by
 extracting `TcpStream::peer_addr()` before wrapping the connection in
--- a/docs/architecture/tls.md
+++ b/docs/architecture/tls.md
@@ -77,16 +77,19 @@ no deploy hooks.
   listed in that listener's `acme_domains`. Let's Encrypt will issue a
   certificate covering those domains (a single SAN certificate or a
   single-domain certificate, depending on how many domains are listed).
-3. The ACME state machine runs as a background tokio task per listener,
+3. The `acme_contact` field provides a contact email address (as a `mailto:`
   URI) required by Let's Encrypt for production certificate requests. Without
   a contact email, Let's Encrypt production API returns a 400-level error.
 4. The ACME state machine runs as a background tokio task per listener,
   handling:
   - Account registration with Let's Encrypt
   - Certificate ordering
   - TLS-ALPN-01 challenge (or HTTP-01 challenge)
   - Certificate issuance
   - Certificate renewal (automatic, ~30 days before expiry)
-4. `ResolvesServerCertAcme` is a rustls `ResolvesServerCert` implementation
+5. `ResolvesServerCertAcme` is a rustls `ResolvesServerCert` implementation
   that automatically serves the ACME-provisioned certificate.
-5. When a new certificate is issued, the resolver updates atomically — no
+6. When a new certificate is issued, the resolver updates atomically — no
   restart or signal handling needed.
 **Configuration (within a `[[listeners]]` entry):**
@@ -100,6 +103,7 @@ mode = "acme"
 acme_domains = ["git.alk.dev", "alk.dev"]
 acme_cache_dir = "/var/lib/reverse-proxy/acme-cache"
 acme_directory = "production"  # or "staging" for testing
 acme_contact = "mailto:admin@alk.dev"  # Required for Let's Encrypt production
 ```
 **Cache directory:** The `DirCache` from rustls-acme persists ACME account data,