Add post-implementation code review (4 critical, 12 warning, 8 suggestion findings)

2026-06-11 14:20:06 +00:00
parent 57cb071ff2
commit 39e1b82308
1 changed files with 586 additions and 0 deletions
--- a/docs/reviews/002-implementation-review.md
+++ b/docs/reviews/002-implementation-review.md
@@ -0,0 +1,586 @@
 ---
 status: draft
 last_updated: 2026-06-11
 reviewed_code:
  - src/main.rs
  - src/server.rs
  - src/tls/acceptor.rs
  - src/tls/acme.rs
  - src/tls/config.rs
  - src/tls/redirect.rs
  - src/config/validation.rs
  - src/config/dynamic_config.rs
  - src/config/static_config.rs
  - src/config/mod.rs
  - src/proxy/handler.rs
  - src/proxy/headers.rs
  - src/proxy/error.rs
  - src/proxy/body_limit.rs
  - src/proxy/mod.rs
  - src/rate_limit/mod.rs
  - src/rate_limit/bucket.rs
  - src/admin/socket.rs
  - src/shutdown.rs
  - src/health.rs
  - src/logging/mod.rs
  - src/logging/format.rs
  - src/cli.rs
 reviewer: code-reviewer
 ---
 # Implementation Review #002
 ## Purpose
 Post-implementation review of all modules. Each finding is structured as
 **Problem** → **Solution** (or **Open Question** where no solution is yet known).
 ## Severity Definitions
 | Severity | Meaning |
 |----------|---------|
 | **Critical** | Will cause incorrect behavior or security issues in production |
 | **Warning** | Could cause issues under specific conditions or represents a missed edge case |
 | **Suggestion** | Code quality, style, or minor improvement opportunity |
 ---
 ## Critical Findings
 ### C1. ACME Challenge Listener Not Started
 **File**: `src/main.rs:168-181`
 **Problem**: When `TlsMode::Acme` is matched, the `challenge_config` and `resolver`
 fields are destructured with `_` (discarded). The ACME TLS-ALPN-01 challenge
 listener — required for Let's Encrypt certificate provisioning — is never started.
 Without it, ACME certificate issuance will fail: Let's Encrypt cannot verify domain
 ownership, and no certificates will ever be obtained.
 **Solution**: Wire the challenge config into a separate TLS listener or configure
 the `rustls-acme` resolver to serve TLS-ALPN-01 challenges on the same port.
 The `rustls-acme` crate's `ResolvesServerCertAcme` handles ALPN protocol
 negotiation automatically — if the ACME challenge ALPN protocol (`acme-tls/1`)
 is included in the server's ALPN list, the resolver will serve challenge
 certificates on the main HTTPS port. Verify that the `build_acme_server_config`
 function includes `ACME_TLS_ALPN_01` in its ALPN list (it already does at line 28).
 This means the main listener should be sufficient for TLS-ALPN-01 challenges
 if the resolver is correctly installed. The separate `challenge_config` in
 `TlsMode::Acme` may be unnecessary — confirm whether `rustls-acme` requires a
 dedicated listener or if the resolver on the default config handles challenges
 automatically.
 ---
 ### C2. ACME Contact Email Always Empty
 **File**: `src/tls/acme.rs:86`
 **Problem**: `AcmeTlsConfig` always sets `contact: vec![]`. Let's Encrypt requires
 a contact email for production certificate requests. The `acme_directory` config
 field supports `"production"` but an empty contact list will cause ACME
 registration to fail with a 400-level error from the Let's Encrypt API.
 **Solution**: Add an `acme_contact` field (e.g., `acme_contact = "mailto:admin@example.com"`)
 to the `TlsConfig` struct and wire it through to `AcmeTlsConfig.contact`. This
 requires changes to:
 1. `src/config/static_config.rs` — add `acme_contact: Vec<String>` to `TlsConfig`
 2. `src/tls/acme.rs` — use `self.contact` in `AcmeTlsConfig::setup`
 3. `src/tls/acceptor.rs` — pass the contact list from `TlsConfig`
 ---
 ### C3. X-Forwarded-For Replaces Instead of Appending
 **File**: `src/proxy/headers.rs:28`
 **Problem**: `inject_proxy_headers` uses `headers.insert()` for `X-Forwarded-For`,
 which **replaces** any existing value. The spec (review #001, finding C2) decided
 that as an edge proxy the X-Forwarded-For should be **set** (not appended),
 because there are no trusted proxies in front. However, the implementation
 doesn't match either behavior — it silently discards any existing header rather
 than explicitly setting it to just the client IP, which is correct for an edge
 proxy but was implemented without a clear comment explaining the rationale.
 **Solution**: The current behavior is actually **correct** for an edge proxy
 (per review #001/C2). The fix is documentation, not code change. Add a comment
 to `inject_proxy_headers` explaining:
 ```rust
 // X-Forwarded-For is SET (not appended) because this proxy is the outermost
 // edge proxy. Any existing X-Forwarded-For from the client is untrusted and
 // must be replaced with the actual client IP from ConnectInfo.
 ```
 This ensures future maintainers don't "fix" this by changing `insert` to `append`.
 ---
 ### C4. ConfigReloadHandle Never Updates Stored StaticConfig
 **File**: `src/config/dynamic_config.rs:124-148`
 **Problem**: The `reload()` method computes `diff_static_config(&self.static_config, &new_static)`
 and returns the changed fields, but **never updates `self.static_config`** with
 the new static config. This means:
 - The first reload correctly reports changed fields
 - The second reload compares against the **original** static config, not the
  last-reloaded one, and reports the same changes again
 - Operators get repeated "static config fields changed" warnings for the same
  fields on every reload
 **Solution**: After validation succeeds and the diff is computed, update
 `self.static_config` with the new value. Since `ConfigReloadHandle` is accessed
 concurrently, the static config field needs interior mutability. Use
 `ArcSwap<StaticConfig>` or `std::sync::RwLock<StaticConfig>`:
 ```rust
 pub struct ConfigReloadHandle {
    config: Arc<ArcSwap<DynamicConfig>>,
    static_config: ArcSwap<StaticConfig>,  // Changed from StaticConfig
    reload_mutex: Mutex<()>,
 }
 ```
 Then in `reload()`:
 ```rust
 let changed_fields = diff_static_config(&self.static_config.load(), &new_static);
 self.static_config.store(Arc::new(new_static));
 self.config.store(Arc::new(new_dynamic));
 ```
 ---
 ## Warning Findings
 ### W1. Shutdown Aborts Listeners Without Draining
 **File**: `src/main.rs:238-240`
 **Problem**: On shutdown, `handle.abort()` is called on each HTTPS server task.
 This immediately kills the tokio task, which can interrupt in-flight request
 processing. The `drain_in_flight` counter is only decremented in
 `InFlightGuard::drop`, but `abort()` prevents the guard from being dropped
 normally for connections still being processed by the task.
 **Solution**: Instead of `abort()`, use `tokio::time::timeout` with the shutdown
 timeout to `join` each handle:
 ```rust
 let shutdown_timeout = shutdown.shutdown_timeout();
 for handle in https_server_handles {
    match tokio::time::timeout(shutdown_timeout, handle).await {
        Ok(_) => {}
        Err(_) => {
            warn!("shutdown timeout expired, aborting listener task");
            handle.abort();
        }
    }
 }
 ```
 Alternatively, keep the `shutdown_rx` pattern already in `serve_https_listener`
 (which breaks the accept loop on shutdown signal) and remove the `abort()` calls
 — the tasks will naturally exit when they stop accepting new connections and
 existing connections drain.
 ---
 ### W2. Shutdown Doesn't Stop Admin Socket or Background Tasks
 **File**: `src/main.rs:238-250`
 **Problem**: The shutdown sequence aborts HTTPS listeners but doesn't stop:
 - The admin socket listener (runs in infinite loop at `src/admin/socket.rs:99-111`)
 - The rate limiter eviction task (runs in infinite loop at `src/rate_limit/mod.rs:106-112`)
 - The ACME state machine task
 These tasks will continue running until the process exits, which is fine for
 process termination but means they can't be gracefully stopped in tests or
 during a clean shutdown.
 **Solution**: For Phase 1, this is acceptable — process termination will clean
 up these tasks. For future improvements:
 - Pass a `CancellationToken` or `watch::Receiver<bool>` to `start_admin_socket`
  and the eviction task
 - On shutdown, signal cancellation before waiting for drain
 - The ACME state machine task already exits when the stream ends (`None` branch
  at `src/tls/acme.rs:152-158`), but it should also be cancellable
 ---
 ### W3. Fragile Error Detection in Connection Error Handler
 **File**: `src/server.rs:95-97`
 **Problem**: The check `e.to_string().contains("incomplete message")` to silently
 suppress connection errors is fragile. String matching on error descriptions can
 break across hyper versions, locale changes, or error message reformatting.
 **Solution**: Match on the error type instead. Check if `hyper` exposes a
 client-disconnect error variant, or use `e.is_incomplete_message()` if available
 in the hyper error API. If no typed variant exists, add a comment explaining
 why string matching is used and which version(s) of hyper produce this message.
 ---
 ### W4. No Separate Connect Timeout for Upstream Requests
 **File**: `src/proxy/handler.rs:73-79`
 **Problem**: The proxy uses `tokio::time::timeout` with
 `upstream_request_timeout_secs` for the entire request, but there's no separate
 connect timeout. A slow DNS resolution or TCP handshake will consume the full
 request timeout budget, leaving no time for the actual request/response cycle.
 **Solution**: Add a `connect_timeout` (either a fixed default or from
 `upstream_connect_timeout_secs` in `SiteConfig`). Structure the proxy call as:
 ```rust
 let connect_timeout = Duration::from_secs(site.upstream_connect_timeout_secs);
 let result = tokio::time::timeout(
    request_timeout,
    async {
        let upstream_req = build_upstream_request(req, &upstream_uri)?;
        // The hyper client handles connect internally, but we can wrap
        // the request in a two-phase timeout if needed
        client.request(upstream_req).await
    }
 ).await;
 ```
 The `SiteConfig` already has `upstream_connect_timeout_secs` but it's not used
 in `proxy_handler`. This should be wired up to either set the client's connect
 timeout or to implement a two-phase timeout.
 ---
 ### W5. Hardcoded `/health` Path Intercepted on All Hosts
 **File**: `src/proxy/handler.rs:37-39`
 **Problem**: The proxy handler returns 200 OK for `GET /health` regardless of the
 Host header. This means any site's `/health` path will be intercepted by the
 proxy and never reach the upstream. If an upstream application uses `/health`
 for its own health checks, those requests will never reach it.
 **Solution**: Either:
 1. Use a less common path like `/__health` or `/healthz` that won't collide
   with upstream applications, OR
 2. Only intercept `/health` when the Host header doesn't match any known site
   (fallthrough), OR
 3. Make the health check path configurable via `StaticConfig`
 Option 1 is simplest for Phase 1. Option 3 is most flexible long-term.
 ---
 ### W6. Token Bucket Refill Uses Millisecond Precision
 **File**: `src/rate_limit/bucket.rs:37`
 **Problem**: `elapsed.as_millis()` truncates sub-millisecond time, which can lead
 to token refill inaccuracies under high-frequency request bursts. For example,
 two requests arriving 500µs apart both see `0ms` elapsed and don't refill tokens.
 **Solution**: Use `as_nanos()` for the refill calculation:
 ```rust
 let elapsed = now.duration_since(self.last_refill).as_nanos() as f64;
 let tokens_to_add = (elapsed / 1_000_000_000.0) * rate;
 ```
 This provides nanosecond-precision refill while keeping the math in floating point.
 ---
 ### W7. Admin Socket Has No Shutdown Mechanism
 **File**: `src/admin/socket.rs:99-111`
 **Problem**: `start_admin_socket` runs an infinite `loop` accepting connections
 with no way to break out. It doesn't accept a shutdown signal, so it can't be
 gracefully stopped during process shutdown or in tests.
 **Solution**: Accept a `watch::Receiver<bool>` parameter and use `tokio::select!`
 to check for shutdown:
 ```rust
 tokio::select! {
    result = listener.accept() => { /* handle connection */ },
    _ = shutdown_rx.changed() => {
        info!("admin socket shutting down");
        break;
    }
 }
 ```
 This also requires cleaning up the socket file on exit.
 ---
 ### W8. Server Header Unconditionally Stripped from Upstream Response
 **File**: `src/proxy/handler.rs:85`
 **Problem**: `parts.headers.remove("server")` unconditionally removes the upstream
 `Server` header. This is a design choice, not necessarily a bug, but it means
 downstream clients can't see what software the upstream is running, which may be
 undesirable for debugging.
 **Solution**: This is acceptable behavior for a security-focused reverse proxy
 (hiding upstream identity is a defense-in-depth measure). Document this decision
 with a comment in the code explaining the rationale.
 ---
 ### W9. Logging Test Fails Due to Global Subscriber
 **File**: `src/logging/mod.rs:99-111`
 **Problem**: The test `init_creates_log_directory_and_file` calls `init()` which
 sets a global default tracing subscriber. When tests run in parallel, this
 conflicts with other tests that may also set a subscriber, causing the test to
 fail with "a global default trace dispatcher has already been set."
 **Solution**: Use `tracing_subscriber::fmt().with_test_writer()` and guard
 against double-initialization. Alternatively, use `std::sync::OnceLock` or
 `tracing_subscriber::util::SubscriberInitExt::try_init()` which returns an error
 if already set rather than panicking.
 ---
 ### W10. Body Limit Middleware Only Checks Content-Length Header
 **File**: `src/proxy/body_limit.rs:26-33`
 **Problem**: The middleware checks the `Content-Length` header but doesn't handle
 requests that lack `Content-Length` (e.g., chunked transfers, HTTP/2). A
 malicious client can send `Transfer-Encoding: chunked` without `Content-Length`
 and bypass the initial check, though the `Limited` body wrapper at line 37 will
 still enforce the limit during streaming.
 **Solution**: This is actually acceptable — the `Limited` body wrapper on line
 37 is the real enforcement mechanism. The `Content-Length` check on line 26 is
 an early-rejection optimization for clients that do include the header. Add a
 comment explaining this two-layer defense:
 ```rust
 // Early rejection: if Content-Length is present and exceeds the limit, reject
 // immediately without reading the body. For requests without Content-Length
 // (chunked, HTTP/2), the Limited body wrapper below enforces the limit during
 // streaming.
 ```
 ---
 ### W11. Health Check Port Conflict Check Is Incomplete
 **File**: `src/config/validation.rs:169-186`
 **Problem**: The validation checks if `health_check_port` conflicts with any
 listener's `https_port` or `http_port`, but it doesn't check whether the health
 check port's bind address conflicts with a listener on a different bind address.
 For example, `health_check_port = 80` bound to `127.0.0.1` shouldn't conflict
 with a listener's `http_port = 80` bound to `203.0.113.10`.
 **Solution**: This is acceptable for Phase 1 since the health check always binds
 to `127.0.0.1` (hardcoded in `src/health.rs:20`). Document that health check
 always binds to localhost, so the conflict check is conservative (warns even
 when it might not actually conflict). Add a comment in the validation code
 explaining this.
 ---
 ### W12. `build_upstream_request` Copies All Headers Without Filtering
 **File**: `src/proxy/handler.rs:124-128`
 **Problem**: `build_upstream_request` copies all headers from the original request
 to the upstream request. However, `remove_hop_by_hop` was already called on the
 original request's headers at line 59 before `build_upstream_request` is called,
 so hop-by-hop headers have already been removed. The proxy headers
 (`X-Real-IP`, `X-Forwarded-For`, `X-Forwarded-Proto`) were also already injected
 at line 58. This means the function works correctly, but the code path is
 spread across multiple locations, making it harder to reason about header
 lifecycle.
 **Solution**: Consider consolidating header manipulation into a single function
 or adding inline comments that trace the header lifecycle:
 ```rust
 // Header lifecycle:
 // 1. inject_proxy_headers() — adds X-Real-IP, X-Forwarded-For, X-Forwarded-Proto
 // 2. remove_hop_by_hop() — removes Connection, Keep-Alive, etc.
 // 3. build_upstream_request() — copies remaining headers to upstream request
 // 4. Response: remove_hop_by_hop() on upstream response headers
 ```
 ---
 ## Suggestions
 ### S1. `http_port` Validation Allows 0 but Not Documented
 **File**: `src/config/validation.rs:106`
 **Problem**: `http_port = 0` is treated as "disabled" (not validated as a port
 number), which is correct, but the validation error message for invalid ports
 says "must be 0 (disabled) or 1-65535" while `HttpPortInvalid` only checks
 `http_port > 0 && http_port < 1` implicitly. The actual check is `http_port > 0`
 to enter the conflict check, and port 0 is always allowed for HTTP.
 **Solution**: Add explicit validation for `http_port` being in the range 0 or
 1-65535 (currently it only validates conflicts, not the port range itself for
 http). Add a `HttpPortInvalid` check similar to `HttpsPortInvalid`:
 ```rust
 if listener.http_port > 65535 {
    errors.push(ValidationError::HttpPortInvalid { ... });
 }
 ```
 ---
 ### S2. Consider Using `#[non_exhaustive]` on Public Enums
 **Files**: `src/tls/acceptor.rs:49` (`TlsMode`), `src/proxy/error.rs:5` (`ProxyError`),
 `src/admin/socket.rs:15` (`AdminSocketError`), `src/config/validation.rs:10` (`ValidationError`)
 **Problem**: These public enums can be matched exhaustively by downstream consumers.
 Adding a new variant would be a breaking change.
 **Solution**: Add `#[non_exhaustive]` to these enums to allow future expansion
 without breaking changes. This is especially important for `TlsMode` (which may
 gain a `"letsencrypt"` or `"auto"` mode) and `ProxyError` (which may gain
 `UpstreamTls` error handling).
 ---
 ### S3. `normalize_host` in `dynamic_config.rs` Doesn't Handle Edge Cases
 **File**: `src/config/dynamic_config.rs:52-55`
 **Problem**: `normalize_host` uses `split(':').next()` to strip ports, but this
 fails for IPv6 addresses in brackets (e.g., `[::1]:443` would normalize to
 `[::1]` instead of `::1`). The `strip_port_from_host` function in
 `src/tls/redirect.rs:16-28` correctly handles this case.
 **Solution**: Either reuse the `strip_port_from_host` logic from `redirect.rs`
 or add a shared utility function. The IPv6 bracket handling is important for
 correctness if the proxy ever receives an IPv6 Host header.
 ---
 ### S4. Multiple `#[allow(dead_code)]` Annotations on Public API
 **Files**: `src/tls/acceptor.rs:14,33,48,58`, `src/tls/acme.rs:9,11,15,23,55`,
 `src/config/static_config.rs:4,31,44,49,54,56,70,76,86,91`
 **Problem**: Many public items are annotated with `#[allow(dead_code)]`, which
 suggests they're defined but not yet used by the binary crate. This is fine
 during initial development but should be cleaned up before release.
 **Solution**: Remove `#[allow(dead_code)]` annotations once all features are
 wired up (especially after C1 is fixed, since `build_acme_challenge_config` and
 `TlsMode::Acme.challenge_config` will be needed). Run `cargo check` to identify
 which items are actually dead code vs. just not yet referenced.
 ---
 ### S5. `InFlightCounter` Could Use Atomic Usize Directly
 **File**: `src/server.rs:16-46`
 **Problem**: `InFlightCounter` wraps an `AtomicUsize` with `increment`/`decrement`/`is_zero`
 methods, plus a separate `InFlightGuard` RAII type. This is clean, but the
 `Arc<InFlightCounter>` pattern means every connection task clones the `Arc`,
 which has a small allocation cost.
 **Solution**: This is a fine pattern for correctness. No change needed — the
 RAII guard pattern correctly ensures `decrement` is always called even on panic.
 Mentioning as a suggestion for awareness, not action.
 ---
 ### S6. Add `Clone` to `SiteConfig` Derive Is Correct But May Mask Cloning Hot Paths
 **File**: `src/config/dynamic_config.rs:70`
 **Problem**: `SiteConfig` derives `Clone`, which is used in
 `src/proxy/handler.rs:53` where `site.clone()` clones the site config on every
 request. For hot paths, this allocates new `String`s for `host`, `upstream`, and
 `upstream_scheme`.
 **Solution**: Consider using `Arc<SiteConfig>` in the routing table so that
 lookups return an `Arc` clone (cheap atomic refcount) instead of a full `String`
 clone. This would change `routing_table` from `HashMap<String, SiteConfig>` to
 `HashMap<String, Arc<SiteConfig>>`. This is a performance optimization for
 later — not a correctness issue.
 ---
 ### S7. HTTPS Client Trusts System Root Certificates Unconditionally
 **File**: `src/proxy/handler.rs:153-165`
 **Problem**: The `root_certs()` function loads native certificates and silently
 skips any that fail to parse (`roots.add(cert).ok()`). While this matches the
 spec (review #001, W13), certificate validation failures for upstream TLS
 connections will produce opaque IO errors.
 **Solution**: This is acceptable for Phase 1 per the spec (ADR-017). Consider
 logging the number of root certificates loaded for operational visibility:
 ```rust
 let cert_count = result.certs.len();
 info!(certs_loaded = cert_count, "loaded system root certificates");
 ```
 ---
 ### S8. Request Timeout Applies to Entire HTTP Exchange, Not Just Response
 **File**: `src/proxy/handler.rs:75-79`
 **Problem**: `tokio::time::timeout(request_timeout, client.request(...))` applies
 the timeout to the entire HTTP round-trip including response body streaming.
 For large file downloads or slow upstreams, this means a 60-second timeout kills
 the response even if the upstream is actively sending data. A more precise
 timeout would apply only to the connection + first-byte response, then stream
 the body without a timeout.
 **Solution**: For Phase 1, this is acceptable behavior — the timeout name
 (`upstream_request_timeout_secs`) is documented as applying to the full request.
 Consider splitting into connect-timeout and response-timeout in Phase 2. The
 `SiteConfig` already has separate `upstream_connect_timeout_secs` and
 `upstream_request_timeout_secs` fields, but `upstream_connect_timeout_secs` is
 unused (see W4).
 ---
 ## Summary Statistics
 | Severity | Count | Status |
 |----------|-------|--------|
 | Critical | 4 | Must fix before production |
 | Warning | 12 | Should fix — correctness and robustness |
 | Suggestion | 8 | Consider for code quality |
 ## Recommended Fix Priority
 1. **C1 (ACME challenge listener)** — Without this, ACME cert provisioning is
   completely broken. This is the highest priority fix.
 2. **C2 (ACME contact email)** — Without this, production Let's Encrypt
   registration fails.
 3. **C4 (ConfigReloadHandle static config drift)** — Every reload will produce
   stale warnings, confusing operators.
 4. **C3 (X-Forwarded-For comment)** — Correct behavior, just needs a clarifying
   comment to prevent future "fixes."
 5. **W1 (Shutdown drain)** — Can cause connection drops on graceful restart.
 6. **W4 (Connect timeout)** — Slow upstreams can exhaust the full request
   timeout budget.
 7. **W5 (Health path collision)** — Upstream `/health` endpoints are silently
   intercepted.
 8. **W7 (Admin socket shutdown)** — Cannot gracefully stop.
 9. **Remaining W and S findings** — Fix opportunistically.