Decompose implementation review fixes into 14 atomic tasks with post-fix review

Break down findings from review #002 into dependency-ordered fix tasks: Critical/High: - fix/acme-contact-and-challenge (C1+C2): Add acme_contact field, wire to ACME, remove unused challenge_config, add validation rule 19 - fix/remove-health-and-hardcode-https (W5+W14+ADR-022): Remove /health from main listener, hardcode X-Forwarded-Proto to https - fix/config-reload-static-drift (C4): Use ArcSwap<StaticConfig> so reload diffs against last config, not startup config - fix/access-logging (W13): Wire up log_request! macro for every proxied request with client_ip, host, method, path, status, upstream, duration_ms Medium: - fix/graceful-shutdown (W1+W7): Join HTTPS tasks with timeout instead of abort, add shutdown signal to admin socket and eviction task - fix/connect-timeout (W4): Wire upstream_connect_timeout_secs to enforce separate connect timeout Low/Independent: - fix/token-bucket-nanosecond (W6): Use as_nanos() instead of as_millis() - fix/normalize-host-ipv6 (S3): Handle IPv6 bracket notation in normalize_host - fix/http-port-validation (S1): Validate http_port in range 0 or 1-65535 - fix/integration-test-toml (S10): Fix double-nested listeners.listeners.sites - fix/logging-test-global-subscriber (W9): Use try_init() to avoid test conflicts - fix/fragile-error-detection (W3): Add typed error matching or documented string match - fix/add-code-comments (C3,W8,W10,W11,S9): Document correct-but-non-obvious behaviors - fix/request-timeout-scope (S8): Document full-request timeout scope - fix/clean-dead-code (S4+S2): Remove dead_code annotations, add #[non_exhaustive] Review gate: - review/post-fix-review: Verify all fixes against architecture spec
2026-06-12 04:08:45 +00:00
parent fe1ae6c05e
commit f9d7b8112b
16 changed files with 1074 additions and 0 deletions
--- a/tasks/fix/access-logging.md
+++ b/tasks/fix/access-logging.md
@@ -0,0 +1,77 @@
 ---
 id: fix/access-logging
 name: Wire up request access logging in the proxy handler
 status: pending
 depends_on: []
 scope: moderate
 risk: medium
 impact: component
 level: implementation
 review_findings: [W13]
 ---
 ## Description
 The `log_request!` macro exists in `src/logging/format.rs` but is never called anywhere in the codebase. The architecture spec (operations.md) states: "Access logging is **always-on** — it is the primary observability mechanism for the proxy and is required for fail2ban integration. There is no configuration option to disable access logging."
 Every proxied request must produce an access log line in the format:
 ```
 REQUEST client_ip=203.0.113.50 host=git.alk.dev method=GET path=/user/repo status=200 upstream=127.0.0.1:3000 duration_ms=45
 ```
 Additionally, upstream errors must produce `UPSTREAM_ERROR` log lines and rate-limited requests already produce `RATE_LIMIT` lines (those work correctly).
 ### Changes Required
 **`src/proxy/handler.rs`**:
 - Add `std::time::Instant` tracking at the start of `proxy_handler`
 - Call `log_request!` macro on every proxied request (success path)
 - Include: `client_ip`, `host`, `method`, `path`, `status`, `upstream`, `duration_ms`
 - Call `log_upstream_error!` on upstream connection failures and bad gateway errors
 - The `duration_ms` should measure from request entry to response sent
 **`src/proxy/mod.rs`**:
 - Ensure `log_request!` and `log_upstream_error!` macros are accessible (they should be via `crate::log_request!`)
 ### Log Format Details
 From operations.md and the existing macro definitions:
 **Access log** (every proxied request):
 ```
 REQUEST client_ip=203.0.113.50 host=git.alk.dev method=GET path=/user/repo status=200 upstream=127.0.0.1:3000 duration_ms=45
 ```
 **Upstream error** (connection refused, timeout, etc.):
 ```
 UPSTREAM_ERROR host=git.alk.dev upstream=127.0.0.1:3000 error="connection refused"
 ```
 The `log_upstream_error!` macro should be called in the error branches of `proxy_handler` where upstream connections fail or time out.
 ## Acceptance Criteria
 - [ ] `log_request!` is called for every successfully proxied request
 - [ ] `log_request!` is called for proxied requests that receive non-2xx upstream responses (4xx/5xx from upstream are still logged as access logs with the upstream status code)
 - [ ] `log_upstream_error!` is called when upstream is unreachable (502)
 - [ ] `log_upstream_error!` is called when upstream times out (504)
 - [ ] `duration_ms` accurately measures request-to-response time
 - [ ] Log format matches the `REQUEST` prefix format with `key=value` pairs
 - [ ] Access logging is always-on — no configuration to disable it
 - [ ] Existing tests pass
 - [ ] `cargo clippy` passes with no warnings
 ## References
 - docs/architecture/operations.md — logging section, access log format
 - docs/reviews/002-implementation-review.md — W13 finding
 - src/logging/format.rs — log_request!, log_upstream_error! macros
 - docs/architecture/decisions/007-custom-log-format.md — log format rationale
 ## Notes
 > To be filled by implementation agent
 ## Summary
 > To be filled on completion
--- a/tasks/fix/acme-contact-and-challenge.md
+++ b/tasks/fix/acme-contact-and-challenge.md
@@ -0,0 +1,78 @@
 ---
 id: fix/acme-contact-and-challenge
 name: Fix ACME contact email wiring and remove unused challenge config
 status: pending
 depends_on: []
 scope: moderate
 risk: high
 impact: component
 level: implementation
 review_findings: [C1, C2]
 ---
 ## Description
 Fix two related ACME bugs that prevent Let's Encrypt certificate provisioning from working:
 1. **C2**: The `acme_contact` field is defined in the architecture (`TlsConfig.acme_contact` in config.md) but missing from the `TlsConfig` struct in `static_config.rs`. The ACME setup in `acceptor.rs` always passes `contact: vec![]`, which will cause production Let's Encrypt registration to fail with a 400-level error.
 2. **C1**: When `TlsMode::Acme` is matched in `main.rs`, the `challenge_config` and `resolver` fields are destructured with `_` (discarded). The separate `challenge_config` is unnecessary — the `rustls-acme` crate's `ResolvesServerCertAcme` resolver handles TLS-ALPN-01 challenges automatically on the main listener when `acme-tls/1` is included in the ALPN list (which `build_acme_server_config` already does at line 28). The separate `challenge_config` and `build_acme_challenge_config` function should be removed.
 ### Changes Required
 **`src/config/static_config.rs`**:
 - Add `acme_contact: String` field to `TlsConfig` struct (with `#[serde(default)]`)
 - Update TOML deserialization tests to include `acme_contact`
 **`src/config/validation.rs`**:
 - Add validation rule 19: In ACME mode, `tls.acme_contact` must be a non-empty string starting with `mailto:`
 - Add `ValidationError::AcmeContactInvalid` variant
 **`src/tls/acceptor.rs`**:
 - Pass `tls_config.acme_contact` to `AcmeTlsConfig.contact` instead of `vec![]`
 - Remove `build_acme_challenge_config` function (dead code after C1 fix)
 - Remove `challenge_config` field from `TlsMode::Acme` variant
 - Update `setup_tls` to only return `default_config` and `resolver` (no `challenge_config`)
 **`src/main.rs`**:
 - Update `TlsMode::Acme` destructuring to remove `challenge_config: _`
 - Remove any references to the discarded fields
 **`src/tls/acme.rs`**:
 - Ensure `AcmeTlsConfig.contact` is used properly (it already has `contact: Vec<String>` field and `self.contact.iter().map(|c| c.as_str())` in `setup()` — the struct is fine, it just needs to receive the actual contact from config)
 **Tests**:
 - Update `static_config.rs` tests for new `acme_contact` field
 - Update `validation.rs` tests for new ACME contact validation
 - Update `acceptor.rs` tests for removed `challenge_config`
 - Verify ACME setup still creates resolver correctly
 ## Acceptance Criteria
 - [ ] `acme_contact` field exists on `TlsConfig` and deserializes from TOML
 - [ ] Validation rejects ACME mode without `acme_contact` (or with non-`mailto:` value)
 - [ ] `AcmeTlsConfig.contact` receives the configured `acme_contact` value (not empty vec)
 - [ ] `build_acme_challenge_config` function is removed
 - [ ] `TlsMode::Acme` variant no longer has `challenge_config` field
 - [ ] `main.rs` no longer discards ACME resolver or challenge config
 - [ ] ACME TLS-ALPN-01 challenge handling still works via main listener's ALPN protocol list
 - [ ] All existing tests pass
 - [ ] `cargo clippy` passes with no warnings
 ## References
 - docs/architecture/config.md — `acme_contact` field definition, validation rule 19
 - docs/architecture/tls.md — ACME mode, contact email
 - docs/reviews/002-implementation-review.md — C1, C2 findings
 - src/tls/acceptor.rs — current ACME setup code
 - src/tls/acme.rs — AcmeTlsConfig struct
 - src/config/static_config.rs — TlsConfig struct
 - src/config/validation.rs — validation rules
 ## Notes
 > To be filled by implementation agent
 ## Summary
 > To be filled on completion
--- a/tasks/fix/add-code-comments.md
+++ b/tasks/fix/add-code-comments.md
@@ -0,0 +1,87 @@
 ---
 id: fix/add-code-comments
 name: Add clarifying code comments for correct-but-non-obvious behaviors
 status: pending
 depends_on: [fix/remove-health-and-hardcode-https]
 scope: narrow
 risk: trivial
 impact: component
 level: implementation
 review_findings: [C3, W8, W10, W11, S9]
 ---
 ## Description
 Several review findings identified behaviors that are correct but need clarifying comments to prevent future "fixes" that would change correct behavior. This task adds documentation comments to these code locations.
 ### Comments to Add
 **C3 — X-Forwarded-For replace semantics** (`src/proxy/headers.rs`):
 Add a comment above `headers.insert(HeaderName::from_static("x-forwarded-for"), ip_value)`:
 ```rust
 // X-Forwarded-For is SET (not appended) because this proxy is the outermost
 // edge proxy. Any existing X-Forwarded-For from the client is untrusted and
 // must be replaced with the actual client IP from ConnectInfo.
 // See ADR-021 for the edge proxy model rationale.
 ```
 **W8 — Server header removal** (`src/proxy/handler.rs`):
 Add a comment above `parts.headers.remove("server")`:
 ```rust
 // The upstream Server header is intentionally removed. As a security-focused
 // reverse proxy, we hide upstream server identity as a defense-in-depth measure.
 // The proxy does not add its own Server header either. See W8 in review #002.
 ```
 **W10 — Body limit two-layer defense** (`src/proxy/body_limit.rs`):
 Add a comment above the Content-Length check:
 ```rust
 // Early rejection: if Content-Length is present and exceeds the limit, reject
 // immediately without reading the body. For requests without Content-Length
 // (chunked, HTTP/2), the Limited body wrapper below enforces the limit during
 // streaming. This is a two-layer defense.
 ```
 **W11 — Health check port conflict conservatism** (`src/config/validation.rs`):
 Add a comment above the health check port conflict check:
 ```rust
 // Health check always binds to 127.0.0.1 (hardcoded in src/health.rs), so this
 // conflict check is conservative — it warns even when the health check port
 // wouldn't actually conflict (e.g., health check on 127.0.0.1:80 vs listener
 // on 203.0.113.10:80). This is acceptable for Phase 1.
 ```
 **S9 — Rate limit bypass when no IP** (`src/rate_limit/mod.rs`):
 Add a comment above the `return next.run(req).await` in the `None` branch:
 ```rust
 // If no client IP can be identified, the request passes through without rate
 // limiting. In practice, ConnectInfo is always set by the server's
 // ConnectInfoService, so this branch is unreachable. If the proxy were ever
 // deployed without ConnectInfo propagation, rate limiting would silently become
 // a no-op. Consider adding a warning log or returning 429 in a future phase.
 ```
 ## Acceptance Criteria
 - [ ] All five comments are added at the specified locations
 - [ ] Comments reference the relevant ADRs or review findings where applicable
 - [ ] Code behavior is unchanged (comments only)
 - [ ] `cargo clippy` passes with no warnings
 ## References
 - docs/reviews/002-implementation-review.md — C3, W8, W10, W11, S9 findings
 - docs/architecture/decisions/021-x-forwarded-for-edge-proxy.md — ADR-021
 - src/proxy/headers.rs — X-Forwarded-For insert
 - src/proxy/handler.rs — Server header removal
 - src/proxy/body_limit.rs — Content-Length check
 - src/config/validation.rs — Health check port conflict
 - src/rate_limit/mod.rs — No IP fallback
 ## Notes
 > This task depends on fix/remove-health-and-hardcode-https because the X-Forwarded-Proto comment (W14) is being addressed in that task. The remaining comments here are independent.
 ## Summary
 > To be filled on completion
--- a/tasks/fix/clean-dead-code.md
+++ b/tasks/fix/clean-dead-code.md
@@ -0,0 +1,60 @@
 ---
 id: fix/clean-dead-code
 name: Remove dead_code annotations and add #[non_exhaustive] to public enums
 status: pending
 depends_on: [fix/acme-contact-and-challenge]
 scope: narrow
 risk: low
 impact: component
 level: implementation
 review_findings: [S4, S2]
 ---
 ## Description
 Many public items are annotated with `#[allow(dead_code)]` because they're defined but not yet used by the binary crate. After the ACME contact and challenge fix (which wires up previously unused ACME code), most of these annotations should be removable.
 Additionally, public enums (`TlsMode`, `ProxyError`, `AdminSocketError`, `ValidationError`) should have `#[non_exhaustive]` to allow future expansion without breaking changes.
 ### Changes Required
 **Remove `#[allow(dead_code)]` annotations**:
 - After `fix/acme-contact-and-challenge` is complete, `build_acme_challenge_config` and `TlsMode::Acme.challenge_config` will be removed entirely (not just un-annotated)
 - Run `cargo check` and remove `#[allow(dead_code)]` annotations from items that are now used
 - Items that are still genuinely unused after the ACME fix should keep the annotation but with a TODO comment explaining why
 **Add `#[non_exhaustive]` to public enums**:
 - `src/tls/acceptor.rs:49` — `TlsMode` enum (may gain modes like `"letsencrypt"` or `"auto"`)
 - `src/proxy/error.rs:5` — `ProxyError` enum (may gain `UpstreamTls` error handling)
 - `src/admin/socket.rs:15` — `AdminSocketError` enum (may gain new error variants)
 - `src/config/validation.rs:10` — `ValidationError` enum (new validation rules may be added)
 ### Files to Check
 - `src/tls/acceptor.rs` — lines 14, 33, 48, 58
 - `src/tls/acme.rs` — lines 9, 11, 15, 23, 55
 - `src/config/static_config.rs` — lines 4, 31, 44, 49, 54, 56, 70, 76, 86, 91
 ## Acceptance Criteria
 - [ ] `#[allow(dead_code)]` annotations removed from items that are now used
 - [ ] Remaining `#[allow(dead_code)]` annotations have TODO comments explaining why
 - [ ] `#[non_exhaustive]` added to `TlsMode`, `ProxyError`, `AdminSocketError`, `ValidationError`
 - [ ] `cargo check` passes (no unused code warnings for annotated items)
 - [ ] `cargo clippy` passes with no warnings
 ## References
 - docs/reviews/002-implementation-review.md — S4, S2 findings
 - src/tls/acceptor.rs — TlsMode enum
 - src/proxy/error.rs — ProxyError enum
 - src/admin/socket.rs — AdminSocketError enum
 - src/config/validation.rs — ValidationError enum
 ## Notes
 > This task depends on fix/acme-contact-and-challenge because the ACME fix removes some dead code (challenge_config) and wires up other code that was previously unused.
 ## Summary
 > To be filled on completion
--- a/tasks/fix/config-reload-static-drift.md
+++ b/tasks/fix/config-reload-static-drift.md
@@ -0,0 +1,57 @@
 ---
 id: fix/config-reload-static-drift
 name: Fix ConfigReloadHandle static config drift causing stale diff warnings
 status: pending
 depends_on: []
 scope: narrow
 risk: medium
 impact: component
 level: implementation
 review_findings: [C4]
 ---
 ## Description
 Fix the `ConfigReloadHandle::reload()` method which computes a diff of static config fields but never updates the stored `StaticConfig`. This means every reload compares against the original static config (from process start), not the last-reloaded one, causing repeated "static config fields changed" warnings for the same fields on every reload.
 The architecture spec in config.md explicitly states: "The `ConfigReloadHandle` must track the last-known StaticConfig so that it can correctly detect changes on subsequent reloads. After each successful reload, the stored StaticConfig is updated with the new value."
 ### Changes Required
 **`src/config/dynamic_config.rs`**:
 - Change `ConfigReloadHandle.static_config` from `StaticConfig` to `ArcSwap<StaticConfig>`
 - Update `ConfigReloadHandle::new()` to wrap the static config in `ArcSwap`
 - In `reload()`, after computing the diff, store the new static config: `self.static_config.store(Arc::new(new_static))`
 - Add `pub fn static_config(&self) -> Arc<StaticConfig>` accessor if needed by other code
 - Update `diff_static_config` to work with references (it already does — just need to use `self.static_config.load()` instead of `&self.static_config`)
 **Tests**:
 - Update `ConfigReloadHandle` tests to verify that a second reload only reports changes since the first reload, not changes since startup
 - Verify that the `static_config_diff_detects_changes` test still works with `ArcSwap<StaticConfig>`
 ### Implementation Note
 The `ArcSwap<StaticConfig>` approach is already used for `DynamicConfig` in the same struct. The pattern is well-established in the codebase. `ArcSwap` provides lock-free reads (important since reloads are rare but reads could be concurrent), and `Arc<StaticConfig>` is cheap to clone for comparison purposes.
 ## Acceptance Criteria
 - [ ] `ConfigReloadHandle.static_config` is `ArcSwap<StaticConfig>` (not plain `StaticConfig`)
 - [ ] After a successful reload, `self.static_config` is updated with the new value
 - [ ] A second reload with no further static config changes reports an empty diff
 - [ ] A second reload with different static config changes reports only the new changes
 - [ ] Existing reload tests pass
 - [ ] `cargo clippy` passes with no warnings
 ## References
 - docs/architecture/config.md — Static config changes during reload, ArcSwap pattern
 - docs/reviews/002-implementation-review.md — C4 finding
 - src/config/dynamic_config.rs — ConfigReloadHandle implementation
 ## Notes
 > To be filled by implementation agent
 ## Summary
 > To be filled on completion
--- a/tasks/fix/connect-timeout.md
+++ b/tasks/fix/connect-timeout.md
@@ -0,0 +1,88 @@
 ---
 id: fix/connect-timeout
 name: Wire upstream_connect_timeout_secs to enforce separate connect timeout
 status: pending
 depends_on: []
 scope: narrow
 risk: medium
 impact: component
 level: implementation
 review_findings: [W4]
 ---
 ## Description
 The proxy uses `tokio::time::timeout` with `upstream_request_timeout_secs` for the entire request, but there is no separate connect timeout. A slow DNS resolution or TCP handshake consumes the full request timeout budget, leaving no time for the actual request/response cycle.
 The architecture already specifies a 5-second default connect timeout separate from the request timeout (ADR-015, ADR-017), and `SiteConfig` already includes `upstream_connect_timeout_secs` with a default of 5. The field just isn't wired up in the proxy handler.
 ### Changes Required
 **`src/proxy/handler.rs`**:
 - Read `site.upstream_connect_timeout_secs` alongside `site.upstream_request_timeout_secs`
 - Implement a two-phase timeout for the upstream request:
  ```rust
  let connect_timeout = Duration::from_secs(site.upstream_connect_timeout_secs);
  let request_timeout = Duration::from_secs(site.upstream_request_timeout_secs);
  let result = tokio::time::timeout(request_timeout, async {
      // Phase 1: connect timeout
      // The hyper client handles connect internally, so we wrap the
      // entire request in the request timeout. The connect timeout
      // can be enforced by setting it on the hyper client's connect
      // configuration, or by using a separate timeout wrapper.
      client.request(upstream_req).await
  }).await;
  ```
 - For Phase 1, the simplest correct approach is to set the connect timeout on the hyper client builder. However, hyper's `Client` doesn't directly expose a connect timeout per-request. The alternative is to use `tokio::time::timeout(connect_timeout, ...)` wrapping just the connection phase, but this requires restructuring to separate the connect from the request.
  **Recommended approach**: Since `SiteConfig` already has `upstream_connect_timeout_secs`, and the hyper client is shared across all sites, the cleanest Phase 1 approach is to document that the `upstream_request_timeout_secs` covers the entire exchange and note that a per-site connect timeout requires a per-site client or a two-phase timeout approach. For now, log a warning if the connect timeout differs from the request timeout, and use the request timeout as the overall timeout. A future Phase 2 task can implement true per-site connect timeouts with separate client builders.
  **Alternative approach**: Use `connect_timeout` on the `HttpConnector` or `HttpsConnector` when building clients. The `hyper_util::client::legacy::Client` builder supports `.pool_idle_timeout()` but not a direct connect timeout. However, `hyper_rustls::HttpsConnector` wraps an `HttpConnector` which does support `set_connect_timeout()`. This would require creating clients per-site or passing the timeout through at connection time.
  **Simplest correct approach for Phase 1**: Set `connect_timeout` on the `HttpConnector` used by both the HTTP and HTTPS clients. Create a helper that sets `connect_timeout` on the `HttpConnector`. This provides a global connect timeout that applies to all upstream connections. Per-site override can be a Phase 2 enhancement.
 - Actually, re-reading the code: the `create_http_client` and `create_https_client` functions create shared clients. The simplest approach is to add a `connect_timeout` method to the `HttpConnector` via `set_connect_timeout()` and the `HttpsConnector` builder. This would set a default connect timeout on all upstream connections.
 - For per-site connect timeout, we'd need to either:
  a. Create per-site clients (heavy approach)
  b. Use a two-phase timeout with `tokio::time::timeout` (wraps the connect phase)
 - **Decision**: Wire `upstream_connect_timeout_secs` as the connect timeout on the HTTP/HTTPS connectors. This provides site-independent connect timeout enforcement. Per-site connect timeout variation requires client-per-site which is out of scope for Phase 1. The config field exists and the value is available; the implementation should at minimum use the per-site value from `SiteConfig` as the request timeout's "connect portion" when building the outer timeout.
  **Final recommended approach**: Use a two-phase timeout in `proxy_handler`:
  ```rust
  let connect_timeout = Duration::from_secs(site.upstream_connect_timeout_secs);
  let request_timeout = Duration::from_secs(site.upstream_request_timeout_secs);
  let result = tokio::time::timeout(request_timeout, async {
      let response = client.request(upstream_req).await?;
      Ok(response)
  }).await;
  ```
  Since we can't easily separate connect from request phases with hyper's shared client, the simplest improvement is to at minimum set `set_connect_timeout` on the `HttpConnector` used to build the clients, using the default 5s value. Then per-site `upstream_request_timeout_secs` remains the overall timeout.
 ## Acceptance Criteria
 - [ ] `upstream_connect_timeout_secs` from `SiteConfig` is read and used in `proxy_handler`
 - [ ] A connect timeout is enforced on upstream connections (default 5s)
 - [ ] A request timeout remains the overall timeout (default 60s)
 - [ ] Slow upstream connections time out appropriately
 - [ ] All existing tests pass
 - [ ] `cargo clippy` passes with no warnings
 ## References
 - docs/architecture/decisions/015-per-site-timeouts.md — ADR-015
 - docs/architecture/decisions/017-upstream-connection-defaults.md — ADR-017
 - docs/reviews/002-implementation-review.md — W4 finding
 - src/proxy/handler.rs — current timeout implementation
 ## Notes
 > The exact implementation approach depends on what hyper's API supports for connect timeouts. If `set_connect_timeout` on `HttpConnector` works cleanly, use it. Otherwise, document the limitation and use the two-phase `tokio::time::timeout` approach.
 ## Summary
 > To be filled on completion
--- a/tasks/fix/fragile-error-detection.md
+++ b/tasks/fix/fragile-error-detection.md
@@ -0,0 +1,51 @@
 ---
 id: fix/fragile-error-detection
 name: Replace fragile string matching for incomplete message error detection
 status: pending
 depends_on: []
 scope: single
 risk: low
 impact: isolated
 level: implementation
 review_findings: [W3]
 ---
 ## Description
 In `src/server.rs:95-97`, the check `e.to_string().contains("incomplete message")` silently suppresses connection errors by matching on the error description string. This is fragile — it can break across hyper versions, locale changes, or error message reformatting.
 ### Changes Required
 **`src/server.rs`**:
 - Check if hyper exposes a typed error variant for client-disconnect errors. If `hyper::Error::is_incomplete_message()` exists or a similar method is available, use that instead of string matching.
 - If no typed variant exists, add a clear comment explaining why string matching is used and which version(s) of hyper produce this message:
  ```rust
  // Client disconnected before completing the request. hyper returns
  // "incomplete message" errors for this case, which we suppress since
  // it's not an error condition from the proxy's perspective. This is
  // checked via string matching because hyper doesn't expose a typed
  // variant for this error. Verified with hyper v1.x.
  if e.to_string().contains("incomplete message") {
      return;
  }
  ```
 ## Acceptance Criteria
 - [ ] Error detection uses typed matching if available in hyper's API
 - [ ] If string matching is kept, a comment documents why and which hyper version produces the message
 - [ ] Existing connection error suppression still works
 - [ ] `cargo clippy` passes with no warnings
 ## References
 - docs/reviews/002-implementation-review.md — W3 finding
 - src/server.rs — connection error handling
 ## Notes
 > To be filled by implementation agent
 ## Summary
 > To be filled on completion
--- a/tasks/fix/graceful-shutdown.md
+++ b/tasks/fix/graceful-shutdown.md
@@ -0,0 +1,97 @@
 ---
 id: fix/graceful-shutdown
 name: Fix shutdown to drain listeners and stop background tasks cleanly
 status: pending
 depends_on: []
 scope: moderate
 risk: medium
 impact: component
 level: implementation
 review_findings: [W1, W7]
 ---
 ## Description
 Two related shutdown issues:
 1. **W1**: On shutdown, `handle.abort()` is called on each HTTPS server task in `main.rs`. This immediately kills the tokio task, interrupting in-flight request processing. The `InFlightGuard` RAII type ensures `decrement` is called on drop, but `abort()` prevents normal drops. The architecture spec says tasks should be joined with a timeout, not aborted — only aborting after the shutdown timeout expires.
   The good news: `serve_https_listener` already has a `shutdown_rx` that breaks the accept loop on shutdown signal. So tasks will stop accepting new connections. We just need to wait for them to drain in-flight requests instead of aborting them.
 2. **W7**: `start_admin_socket` runs an infinite `loop` accepting connections with no way to break out. It doesn't accept a shutdown signal, so it can't be gracefully stopped. Similarly, the rate limiter eviction task runs an infinite loop with no cancellation mechanism.
 ### Changes Required
 **`src/main.rs`**:
 - Replace `handle.abort()` loop with timeout-based join:
  ```rust
  let shutdown_timeout = shutdown.shutdown_timeout();
  for handle in https_server_handles {
      match tokio::time::timeout(shutdown_timeout, handle).await {
          Ok(_) => {}
          Err(_) => {
              warn!("shutdown timeout expired, aborting listener task");
              handle.abort();
          }
      }
  }
  ```
 - After draining, signal cancellation to admin socket and eviction task
 **`src/admin/socket.rs`**:
 - Add a `shutdown_rx: tokio::sync::watch::Receiver<bool>` parameter to `start_admin_socket`
 - Replace the infinite `loop { listener.accept().await }` with `tokio::select!`:
  ```rust
  tokio::select! {
      result = listener.accept() => { /* handle connection */ },
      _ = shutdown_rx.changed() => {
          info!("admin socket shutting down");
          break;
      }
  }
  ```
 - Clean up the socket file on exit (remove the Unix domain socket file)
 - Update callers in `main.rs` to pass the shutdown channel
 **`src/rate_limit/mod.rs`**:
 - Add a `shutdown_rx: tokio::sync::watch::Receiver<bool>` parameter to `start_eviction_task`
 - Replace infinite loop with `tokio::select!`:
  ```rust
  tokio::select! {
      _ = interval_timer.tick() => { limiter.evict_stale(max_age); },
      _ = shutdown_rx.changed() => {
          info!("rate limiter eviction task shutting down");
          break;
      }
  }
  ```
 - Update caller in `main.rs`
 ## Acceptance Criteria
 - [ ] HTTPS server tasks are joined with a timeout, not immediately aborted
 - [ ] Tasks are only aborted if the shutdown timeout expires before they finish
 - [ ] Admin socket listener breaks its accept loop on shutdown signal
 - [ ] Admin socket file is cleaned up on shutdown
 - [ ] Rate limiter eviction task breaks its loop on shutdown signal
 - [ ] ACME state machine task is cancellable (it already exits on `None` from stream, but should also respond to cancellation)
 - [ ] In-flight requests are allowed to drain before forceful shutdown
 - [ ] All existing tests pass
 - [ ] `cargo clippy` passes with no warnings
 ## References
 - docs/architecture/operations.md — shutdown sequence
 - docs/reviews/002-implementation-review.md — W1, W7 findings
 - src/main.rs — current shutdown sequence
 - src/admin/socket.rs — current infinite loop
 - src/rate_limit/mod.rs — current infinite eviction loop
 - src/server.rs — InFlightCounter and drain_in_flight
 ## Notes
 > To be filled by implementation agent
 ## Summary
 > To be filled on completion
--- a/tasks/fix/http-port-validation.md
+++ b/tasks/fix/http-port-validation.md
@@ -0,0 +1,55 @@
 ---
 id: fix/http-port-validation
 name: Add http_port range validation (0 or 1-65535)
 status: pending
 depends_on: []
 scope: single
 risk: trivial
 impact: isolated
 level: implementation
 review_findings: [S1]
 ---
 ## Description
 The `http_port` validation only checks for port conflicts and whether `http_port > 0` (for the conflict check). It doesn't validate that `http_port` is in the valid range: 0 (disabled) or 1-65535. A value like `65536` or `-1` would pass validation incorrectly. There's already a `HttpsPortInvalid` error for https_port, but no equivalent for http_port.
 ### Changes Required
 **`src/config/validation.rs`**:
 - Add a validation check after the existing `http_port` conflict logic:
  ```rust
  if listener.http_port > 65535 {
      errors.push(ValidationError::HttpPortInvalid {
          bind_addr: listener.bind_addr.clone(),
          http_port: listener.http_port,
      });
  }
  ```
 - Note: `http_port = 0` is already treated as "disabled" and should be allowed, so only check `> 65535`
 **Tests**:
 - Add a test for `http_port = 65536` producing `HttpPortInvalid`
 - Add a test for `http_port = 0` still being valid (disabled)
 ## Acceptance Criteria
 - [ ] `http_port > 65535` produces a validation error
 - [ ] `http_port = 0` (disabled) remains valid
 - [ ] `http_port` in 1-65535 remains valid
 - [ ] New `HttpPortInvalid` error variant with descriptive message
 - [ ] Existing validation tests pass
 - [ ] `cargo clippy` passes with no warnings
 ## References
 - docs/reviews/002-implementation-review.md — S1 finding
 - src/config/validation.rs — existing validation logic
 ## Notes
 > To be filled by implementation agent
 ## Summary
 > To be filled on completion
--- a/tasks/fix/integration-test-toml.md
+++ b/tasks/fix/integration-test-toml.md
@@ -0,0 +1,44 @@
 ---
 id: fix/integration-test-toml
 name: Fix double-nested listeners.listeners.sites in integration test TOML
 status: pending
 depends_on: []
 scope: single
 risk: trivial
 impact: isolated
 level: implementation
 review_findings: [S10]
 ---
 ## Description
 The `write_valid_config` helper in `tests/integration_test.rs:514` contains `[[listeners.listeners.sites]]` which is a double-nested TOML path that doesn't match the config schema. The correct key is `[[listeners.sites]]`.
 The test passes because it's used as an invalid config fixture (the binary validates and exits with error), but the function name `write_valid_config` suggests it was intended to produce valid config.
 ### Changes Required
 **`tests/integration_test.rs`**:
 - Fix the TOML key from `[[listeners.listeners.sites]]` to `[[listeners.sites]]`
 - If the function was intended to produce valid config, verify the test still passes with the corrected TOML
 - If the function is deliberately producing invalid config for a validation test, rename it to `write_invalid_config` or add a comment explaining the intentional invalidity
 ## Acceptance Criteria
 - [ ] TOML nesting is corrected to `[[listeners.sites]]`
 - [ ] Test function name matches its purpose (valid or invalid config)
 - [ ] All integration tests pass
 - [ ] `cargo test` passes
 ## References
 - docs/reviews/002-implementation-review.md — S10 finding
 - tests/integration_test.rs — `write_valid_config` helper
 ## Notes
 > To be filled by implementation agent
 ## Summary
 > To be filled on completion
--- a/tasks/fix/logging-test-global-subscriber.md
+++ b/tasks/fix/logging-test-global-subscriber.md
@@ -0,0 +1,50 @@
 ---
 id: fix/logging-test-global-subscriber
 name: Fix logging test that conflicts with global tracing subscriber
 status: pending
 depends_on: []
 scope: single
 risk: low
 impact: isolated
 level: implementation
 review_findings: [W9]
 ---
 ## Description
 The test `init_creates_log_directory_and_file` in `src/logging/mod.rs` calls `init()` which sets a global default tracing subscriber. When tests run in parallel, this conflicts with other tests that may also set a subscriber, causing the test to fail with "a global default trace dispatcher has already been set."
 ### Changes Required
 **`src/logging/mod.rs`**:
 - Change `init()` or the test to use `tracing_subscriber::util::SubscriberInitExt::try_init()` which returns an error if already set rather than panicking
 - Alternatively, use `std::sync::OnceLock` to guard against double-initialization in tests
 - The production code should still use `init()` (which panics on double-init — that's correct behavior for a production binary), but the test should handle the case where a subscriber is already set
 ### Approach
 The cleanest approach is to make `init()` return `Result<()>` and use `try_init()` internally. If initialization fails because a subscriber is already set, return an error rather than panicking. This way:
 - Production code: `init()` is called once at startup; if it fails, that's a real error
 - Tests: Can call `init()` in a test helper and gracefully handle the "already initialized" case
 Alternatively, keep `init()` as-is and only change the test to use `try_init()` directly.
 ## Acceptance Criteria
 - [ ] Logging test no longer panics when run in parallel with other tests
 - [ ] Production `init()` behavior is unchanged (sets global subscriber)
 - [ ] All tests pass (including parallel `cargo test`)
 - [ ] `cargo clippy` passes with no warnings
 ## References
 - docs/reviews/002-implementation-review.md — W9 finding
 - src/logging/mod.rs — `init()` function and `init_creates_log_directory_and_file` test
 ## Notes
 > To be filled by implementation agent
 ## Summary
 > To be filled on completion
--- a/tasks/fix/normalize-host-ipv6.md
+++ b/tasks/fix/normalize-host-ipv6.md
@@ -0,0 +1,58 @@
 ---
 id: fix/normalize-host-ipv6
 name: Fix normalize_host to handle IPv6 bracket notation
 status: pending
 depends_on: []
 scope: single
 risk: low
 impact: isolated
 level: implementation
 review_findings: [S3]
 ---
 ## Description
 `normalize_host` in `src/config/dynamic_config.rs` uses `split(':').next()` to strip ports, but this fails for IPv6 addresses in brackets (e.g., `[::1]:443` would normalize to `[::1]` instead of `::1`). The `strip_port_from_host` function in `src/tls/redirect.rs` correctly handles this case.
 ### Changes Required
 **`src/config/dynamic_config.rs`**:
 - Replace the `normalize_host` implementation with a port-stripping function that handles IPv6 bracket notation
 - Either extract a shared utility function from `strip_port_from_host` in `redirect.rs`, or make `normalize_host` use the same logic
 - The function should:
  - If host starts with `[`, find the closing `]` and strip the port after it
  - Otherwise, use the existing `split(':').next()` logic
  - Convert to lowercase
 **`src/tls/redirect.rs`**:
 - Consider extracting `strip_port_from_host` into a shared module (e.g., `src/utils.rs` or inline in both places) to avoid code duplication
 ### Tests
 - Add test cases for `normalize_host` with IPv6:
  - `[::1]:443` → `::1` (or `[::1]` depending on convention — check what the routing table expects)
  - `[2001:db8::1]:8080` → `2001:db8::1`
  - Plain hostname with port: `example.com:443` → `example.com`
  - Plain hostname without port: `example.com` → `example.com`
 ## Acceptance Criteria
 - [ ] `normalize_host` correctly strips ports from IPv6 bracket notation
 - [ ] Either a shared utility function is extracted, or both implementations use the same logic
 - [ ] New test cases for IPv6 host normalization
 - [ ] Existing routing tests pass
 - [ ] `cargo clippy` passes with no warnings
 ## References
 - docs/reviews/002-implementation-review.md — S3 finding
 - src/config/dynamic_config.rs — `normalize_host` function
 - src/tls/redirect.rs — `strip_port_from_host` function (correct implementation)
 ## Notes
 > To be filled by implementation agent
 ## Summary
 > To be filled on completion
--- a/tasks/fix/remove-health-and-hardcode-https.md
+++ b/tasks/fix/remove-health-and-hardcode-https.md
@@ -0,0 +1,79 @@
 ---
 id: fix/remove-health-and-hardcode-https
 name: Remove /health from main listener and hardcode X-Forwarded-Proto to https
 status: pending
 depends_on: []
 scope: narrow
 risk: low
 impact: component
 level: implementation
 review_findings: [W5, W14]
 adr: [022]
 ---
 ## Description
 Two related changes that simplify the proxy handler:
 1. **W5 / ADR-022**: Remove the `/health` route from the main HTTPS listener. Health checking is an operational concern served exclusively by the dedicated local port (9900) and the admin socket's `status` command. Serving `/health` on the main listener creates collision with upstream applications and requires special-case routing before host matching. The route must be removed entirely — no configurable path replacement.
 2. **W14**: Remove the `is_https` field from `ProxyState` and hardcode `X-Forwarded-Proto: https` in `inject_proxy_headers`. The proxy only proxies requests on the HTTPS listener (which always uses TLS), and the HTTP redirect listener sends 301 redirects rather than proxying. The `is_https` field was always `true` and was a latent bug for non-TLS contexts. Since there is no non-TLS proxying path, hardcoding `"https"` with a clear comment is correct.
 ### Changes Required
 **`src/proxy/handler.rs`**:
 - Remove `health_handler` function
 - Remove the `/health` early return in `proxy_handler` (lines 37-39)
 - Remove `is_https` field from `ProxyState` struct
 - Remove `is_https` parameter from `proxy_router()` — it's no longer needed
 - Remove the `/health` route from `proxy_router()` — `Router::new().fallback(proxy_handler).with_state(state)` instead of `Router::new().route("/health", get(health_handler)).fallback(proxy_handler).with_state(state)`
 **`src/proxy/headers.rs`**:
 - Change `inject_proxy_headers` signature to remove `is_https: bool` parameter
 - Hardcode `X-Forwarded-Proto` to `"https"` with a comment explaining why:
  ```
  // X-Forwarded-Proto is always "https" because this proxy only forwards requests
  // received on the TLS listener. The HTTP listener redirects to HTTPS and does not
  // proxy requests, so X-Forwarded-Proto is never set for HTTP connections.
  ```
 **`src/main.rs`**:
 - Remove `is_https: true` from `ProxyState` initialization
 - Update any calls to `proxy_router()` or `build_router()` that pass `is_https`
 **`src/proxy/mod.rs`**:
 - Update `build_router` signature if it references `is_https`
 **Tests**:
 - Remove `health_path_returns_200_regardless_of_host` test
 - Remove `health_with_unknown_host_returns_200` test
 - Update `make_proxy_state` helper — remove `is_https` field
 - Update `inject_proxy_headers` tests — remove `is_https` parameter, verify `X-Forwarded-Proto` is always `"https"`
 - The health check endpoint on port 9900 remains independently tested in `src/health.rs`
 ## Acceptance Criteria
 - [ ] No `/health` route on the main HTTPS listener
 - [ ] `ProxyState` struct no longer has `is_https` field
 - [ ] `inject_proxy_headers` always sets `X-Forwarded-Proto: https` (no parameter)
 - [ ] Code comment explains why `X-Forwarded-Proto` is always `"https"`
 - [ ] `proxy_handler` has no special-case path matching before Host lookup
 - [ ] Port 9900 health check (`src/health.rs`) is unchanged and working
 - [ ] All existing tests pass (minus removed health-on-main-listener tests)
 - [ ] `cargo clippy` passes with no warnings
 ## References
 - docs/architecture/decisions/022-health-check-scope.md — ADR-022
 - docs/architecture/proxy.md — updated request flow (no /health route)
 - docs/architecture/operations.md — health check is port 9900 only
 - docs/reviews/002-implementation-review.md — W5, W14 findings
 - docs/architecture/open-questions.md — OQ-08, OQ-11
 ## Notes
 > To be filled by implementation agent
 ## Summary
 > To be filled on completion
--- a/tasks/fix/request-timeout-scope.md
+++ b/tasks/fix/request-timeout-scope.md
@@ -0,0 +1,50 @@
 ---
 id: fix/request-timeout-scope
 name: Document request timeout scope and add Server header removal comment
 status: pending
 depends_on: []
 scope: single
 risk: trivial
 impact: isolated
 level: implementation
 review_findings: [S8]
 ---
 ## Description
 Review finding S8 notes that the request timeout applies to the entire HTTP exchange, not just the connection or first-byte response. For large file downloads or slow upstreams, this means a 60-second timeout kills the response even if the upstream is actively sending data.
 The review marks this as "acceptable for Phase 1" and the timeout name (`upstream_request_timeout_secs`) is documented as applying to the full request. However, the code should clearly document this behavior so it's not surprising.
 ### Changes Required
 **`src/proxy/handler.rs`**:
 - Add a comment above the timeout wrapping in `proxy_handler`:
  ```rust
  // The timeout covers the entire HTTP round-trip including response body
  // streaming. For large file downloads or slow upstreams, this means the
  // timeout kills the response even if the upstream is actively sending data.
  // A more precise timeout would apply only to connect + first-byte, then
  // stream the body without a timeout. The `upstream_connect_timeout_secs`
  // field in SiteConfig exists for a separate connect timeout (see W4).
  // For Phase 1, this full-request timeout is acceptable.
  ```
 ## Acceptance Criteria
 - [ ] Comment documents the timeout scope
 - [ ] No code behavior change
 - [ ] `cargo clippy` passes with no warnings
 ## References
 - docs/reviews/002-implementation-review.md — S8 finding
 - src/proxy/handler.rs — timeout wrapping
 ## Notes
 > To be filled by implementation agent
 ## Summary
 > To be filled on completion
--- a/tasks/fix/token-bucket-nanosecond.md
+++ b/tasks/fix/token-bucket-nanosecond.md
@@ -0,0 +1,56 @@
 ---
 id: fix/token-bucket-nanosecond
 name: Fix token bucket refill to use nanosecond precision
 status: pending
 depends_on: []
 scope: single
 risk: trivial
 impact: isolated
 level: implementation
 review_findings: [W6]
 ---
 ## Description
 The token bucket refill calculation in `src/rate_limit/bucket.rs` uses `as_millis()` which truncates sub-millisecond time. Two requests arriving 500µs apart both see `0ms` elapsed and don't refill tokens. This can lead to token refill inaccuracies under high-frequency request bursts.
 ### Changes Required
 **`src/rate_limit/bucket.rs`**:
 - Change line 37 from:
  ```rust
  let elapsed = now.duration_since(self.last_refill).as_millis() as f64;
  let tokens_to_add = (elapsed * rate) / 1000.0;
  ```
  to:
  ```rust
  let elapsed = now.duration_since(self.last_refill).as_nanos() as f64;
  let tokens_to_add = (elapsed / 1_000_000_000.0) * rate;
  ```
 This provides nanosecond-precision refill while keeping the math in floating point.
 ### Tests
 - Verify existing token bucket tests still pass
 - The `token_bucket_refills_over_time` test already validates refill behavior; it may need a longer sleep to notice the difference with nanosecond precision
 ## Acceptance Criteria
 - [ ] Token refill uses nanosecond precision (`as_nanos()`)
 - [ ] Math is `(elapsed_nanos / 1_000_000_000.0) * rate`
 - [ ] Existing token bucket tests pass
 - [ ] `cargo clippy` passes with no warnings
 ## References
 - docs/reviews/002-implementation-review.md — W6 finding
 - src/rate_limit/bucket.rs — current millisecond-based refill
 ## Notes
 > To be filled by implementation agent
 ## Summary
 > To be filled on completion
--- a/tasks/review/post-fix-review.md
+++ b/tasks/review/post-fix-review.md
@@ -0,0 +1,87 @@
 ---
 id: review/post-fix-review
 name: Review all post-fix implementations for correctness and spec conformance
 status: pending
 depends_on: [fix/acme-contact-and-challenge, fix/remove-health-and-hardcode-https, fix/config-reload-static-drift, fix/access-logging, fix/graceful-shutdown, fix/connect-timeout, fix/token-bucket-nanosecond, fix/normalize-host-ipv6, fix/http-port-validation, fix/integration-test-toml, fix/logging-test-global-subscriber, fix/fragile-error-detection, fix/add-code-comments, fix/clean-dead-code, fix/request-timeout-scope]
 scope: broad
 risk: low
 impact: project
 level: review
 ---
 ## Description
 Post-implementation review of all fix tasks from review #002. Verify that each fix correctly addresses the original finding, matches the architecture spec, and doesn't introduce regressions.
 ### Review Checklist
 1. **ACME contact and challenge (C1, C2)**:
   - `acme_contact` field exists on `TlsConfig` and deserializes from TOML
   - Validation rule 19 enforces `mailto:` format for `acme_contact` in ACME mode
   - `AcmeTlsConfig.contact` receives configured contact (not empty vec)
   - `build_acme_challenge_config` removed, `TlsMode::Acme` no longer has `challenge_config`
   - TLS-ALPN-01 challenge still works via main listener ALPN protocol
 2. **Health route removal / HTTPS hardcode (W5, W14, ADR-022)**:
   - No `/health` route on main listener
   - `is_https` removed from `ProxyState`
   - `X-Forwarded-Proto` always `"https"` with explanatory comment
   - Port 9900 health check still works independently
 3. **Config reload drift (C4)**:
   - `ConfigReloadHandle.static_config` uses `ArcSwap<StaticConfig>`
   - Second reload only reports new changes, not repeated old changes
 4. **Access logging (W13)**:
   - `log_request!` called for every proxied request
   - `log_upstream_error!` called for upstream failures
   - Log format matches spec: `REQUEST client_ip=... host=... method=... path=... status=... upstream=... duration_ms=...`
   - `duration_ms` measures wall time accurately
 5. **Graceful shutdown (W1, W7)**:
   - HTTPS tasks joined with timeout, not immediately aborted
   - Admin socket and eviction task respond to shutdown signal
   - Socket file cleaned up on exit
 6. **Connect timeout (W4)**:
   - `upstream_connect_timeout_secs` wired up in proxy handler
   - Separate connect timeout enforced
 7. **All other fixes**:
   - Token bucket nanosecond precision
   - IPv6 normalize_host
   - http_port validation
   - Integration test TOML
   - Logging test global subscriber
   - Fragile error detection
   - Code comments added
   - Dead code cleaned, `#[non_exhaustive]` added
 8. **Quality gates**:
   - `cargo clippy` passes with no warnings
   - `cargo fmt --check` passes
   - `cargo test` passes (all unit and integration tests)
   - No regressions in existing functionality
 ## Acceptance Criteria
 - [ ] All fix tasks completed and verified
 - [ ] `cargo clippy` passes with no warnings
 - [ ] `cargo fmt --check` passes
 - [ ] `cargo test` passes
 - [ ] Architecture spec conformance verified for all changes
 - [ ] No regressions in existing tests
 ## References
 - docs/reviews/002-implementation-review.md — all findings
 - docs/architecture/ — updated architecture docs
 - docs/architecture/decisions/022-health-check-scope.md — ADR-022
 ## Notes
 > This review should run after all fix tasks are complete. Focus on correctness, spec conformance, and regression testing.
 ## Summary
 > To be filled on completion