--- status: draft last_updated: 2026-06-11 reviewed_code: - src/main.rs - src/server.rs - src/tls/acceptor.rs - src/tls/acme.rs - src/tls/config.rs - src/tls/redirect.rs - src/config/validation.rs - src/config/dynamic_config.rs - src/config/static_config.rs - src/config/mod.rs - src/proxy/handler.rs - src/proxy/headers.rs - src/proxy/error.rs - src/proxy/body_limit.rs - src/proxy/mod.rs - src/rate_limit/mod.rs - src/rate_limit/bucket.rs - src/admin/socket.rs - src/shutdown.rs - src/health.rs - src/logging/mod.rs - src/logging/format.rs - src/cli.rs reviewer: code-reviewer --- # Implementation Review #002 ## Purpose Post-implementation review of all modules. Each finding is structured as **Problem** → **Solution** (or **Open Question** where no solution is yet known). ## Severity Definitions | Severity | Meaning | |----------|---------| | **Critical** | Will cause incorrect behavior or security issues in production | | **Warning** | Could cause issues under specific conditions or represents a missed edge case | | **Suggestion** | Code quality, style, or minor improvement opportunity | --- ## Critical Findings ### C1. ACME Challenge Listener Not Started **File**: `src/main.rs:168-181` **Problem**: When `TlsMode::Acme` is matched, the `challenge_config` and `resolver` fields are destructured with `_` (discarded). The ACME TLS-ALPN-01 challenge listener — required for Let's Encrypt certificate provisioning — is never started. Without it, ACME certificate issuance will fail: Let's Encrypt cannot verify domain ownership, and no certificates will ever be obtained. **Solution**: Wire the challenge config into a separate TLS listener or configure the `rustls-acme` resolver to serve TLS-ALPN-01 challenges on the same port. The `rustls-acme` crate's `ResolvesServerCertAcme` handles ALPN protocol negotiation automatically — if the ACME challenge ALPN protocol (`acme-tls/1`) is included in the server's ALPN list, the resolver will serve challenge certificates on the main HTTPS port. Verify that the `build_acme_server_config` function includes `ACME_TLS_ALPN_01` in its ALPN list (it already does at line 28). This means the main listener should be sufficient for TLS-ALPN-01 challenges if the resolver is correctly installed. The separate `challenge_config` in `TlsMode::Acme` may be unnecessary — confirm whether `rustls-acme` requires a dedicated listener or if the resolver on the default config handles challenges automatically. --- ### C2. ACME Contact Email Always Empty **File**: `src/tls/acme.rs:86` **Problem**: `AcmeTlsConfig` always sets `contact: vec![]`. Let's Encrypt requires a contact email for production certificate requests. The `acme_directory` config field supports `"production"` but an empty contact list will cause ACME registration to fail with a 400-level error from the Let's Encrypt API. **Solution**: Add an `acme_contact` field (e.g., `acme_contact = "mailto:admin@example.com"`) to the `TlsConfig` struct and wire it through to `AcmeTlsConfig.contact`. This requires changes to: 1. `src/config/static_config.rs` — add `acme_contact: Vec` to `TlsConfig` 2. `src/tls/acme.rs` — use `self.contact` in `AcmeTlsConfig::setup` 3. `src/tls/acceptor.rs` — pass the contact list from `TlsConfig` --- ### C3. X-Forwarded-For Replaces Instead of Appending **File**: `src/proxy/headers.rs:28` **Problem**: `inject_proxy_headers` uses `headers.insert()` for `X-Forwarded-For`, which **replaces** any existing value. The spec (review #001, finding C2) decided that as an edge proxy the X-Forwarded-For should be **set** (not appended), because there are no trusted proxies in front. However, the implementation doesn't match either behavior — it silently discards any existing header rather than explicitly setting it to just the client IP, which is correct for an edge proxy but was implemented without a clear comment explaining the rationale. **Solution**: The current behavior is actually **correct** for an edge proxy (per review #001/C2). The fix is documentation, not code change. Add a comment to `inject_proxy_headers` explaining: ```rust // X-Forwarded-For is SET (not appended) because this proxy is the outermost // edge proxy. Any existing X-Forwarded-For from the client is untrusted and // must be replaced with the actual client IP from ConnectInfo. ``` This ensures future maintainers don't "fix" this by changing `insert` to `append`. --- ### C4. ConfigReloadHandle Never Updates Stored StaticConfig **File**: `src/config/dynamic_config.rs:124-148` **Problem**: The `reload()` method computes `diff_static_config(&self.static_config, &new_static)` and returns the changed fields, but **never updates `self.static_config`** with the new static config. This means: - The first reload correctly reports changed fields - The second reload compares against the **original** static config, not the last-reloaded one, and reports the same changes again - Operators get repeated "static config fields changed" warnings for the same fields on every reload **Solution**: After validation succeeds and the diff is computed, update `self.static_config` with the new value. Since `ConfigReloadHandle` is accessed concurrently, the static config field needs interior mutability. Use `ArcSwap` or `std::sync::RwLock`: ```rust pub struct ConfigReloadHandle { config: Arc>, static_config: ArcSwap, // Changed from StaticConfig reload_mutex: Mutex<()>, } ``` Then in `reload()`: ```rust let changed_fields = diff_static_config(&self.static_config.load(), &new_static); self.static_config.store(Arc::new(new_static)); self.config.store(Arc::new(new_dynamic)); ``` --- ## Warning Findings ### W1. Shutdown Aborts Listeners Without Draining **File**: `src/main.rs:238-240` **Problem**: On shutdown, `handle.abort()` is called on each HTTPS server task. This immediately kills the tokio task, which can interrupt in-flight request processing. The `drain_in_flight` counter is only decremented in `InFlightGuard::drop`, but `abort()` prevents the guard from being dropped normally for connections still being processed by the task. **Solution**: Instead of `abort()`, use `tokio::time::timeout` with the shutdown timeout to `join` each handle: ```rust let shutdown_timeout = shutdown.shutdown_timeout(); for handle in https_server_handles { match tokio::time::timeout(shutdown_timeout, handle).await { Ok(_) => {} Err(_) => { warn!("shutdown timeout expired, aborting listener task"); handle.abort(); } } } ``` Alternatively, keep the `shutdown_rx` pattern already in `serve_https_listener` (which breaks the accept loop on shutdown signal) and remove the `abort()` calls — the tasks will naturally exit when they stop accepting new connections and existing connections drain. --- ### W2. Shutdown Doesn't Stop Admin Socket or Background Tasks **File**: `src/main.rs:238-250` **Problem**: The shutdown sequence aborts HTTPS listeners but doesn't stop: - The admin socket listener (runs in infinite loop at `src/admin/socket.rs:99-111`) - The rate limiter eviction task (runs in infinite loop at `src/rate_limit/mod.rs:106-112`) - The ACME state machine task These tasks will continue running until the process exits, which is fine for process termination but means they can't be gracefully stopped in tests or during a clean shutdown. **Solution**: For Phase 1, this is acceptable — process termination will clean up these tasks. For future improvements: - Pass a `CancellationToken` or `watch::Receiver` to `start_admin_socket` and the eviction task - On shutdown, signal cancellation before waiting for drain - The ACME state machine task already exits when the stream ends (`None` branch at `src/tls/acme.rs:152-158`), but it should also be cancellable --- ### W3. Fragile Error Detection in Connection Error Handler **File**: `src/server.rs:95-97` **Problem**: The check `e.to_string().contains("incomplete message")` to silently suppress connection errors is fragile. String matching on error descriptions can break across hyper versions, locale changes, or error message reformatting. **Solution**: Match on the error type instead. Check if `hyper` exposes a client-disconnect error variant, or use `e.is_incomplete_message()` if available in the hyper error API. If no typed variant exists, add a comment explaining why string matching is used and which version(s) of hyper produce this message. --- ### W4. No Separate Connect Timeout for Upstream Requests **File**: `src/proxy/handler.rs:73-79` **Problem**: The proxy uses `tokio::time::timeout` with `upstream_request_timeout_secs` for the entire request, but there's no separate connect timeout. A slow DNS resolution or TCP handshake will consume the full request timeout budget, leaving no time for the actual request/response cycle. **Solution**: Add a `connect_timeout` (either a fixed default or from `upstream_connect_timeout_secs` in `SiteConfig`). Structure the proxy call as: ```rust let connect_timeout = Duration::from_secs(site.upstream_connect_timeout_secs); let result = tokio::time::timeout( request_timeout, async { let upstream_req = build_upstream_request(req, &upstream_uri)?; // The hyper client handles connect internally, but we can wrap // the request in a two-phase timeout if needed client.request(upstream_req).await } ).await; ``` The `SiteConfig` already has `upstream_connect_timeout_secs` but it's not used in `proxy_handler`. This should be wired up to either set the client's connect timeout or to implement a two-phase timeout. --- ### W5. Hardcoded `/health` Path Intercepted on All Hosts **File**: `src/proxy/handler.rs:37-39` **Problem**: The proxy handler returns 200 OK for `GET /health` regardless of the Host header. This means any site's `/health` path will be intercepted by the proxy and never reach the upstream. If an upstream application uses `/health` for its own health checks, those requests will never reach it. **Solution**: Either: 1. Use a less common path like `/__health` or `/healthz` that won't collide with upstream applications, OR 2. Only intercept `/health` when the Host header doesn't match any known site (fallthrough), OR 3. Make the health check path configurable via `StaticConfig` Option 1 is simplest for Phase 1. Option 3 is most flexible long-term. --- ### W6. Token Bucket Refill Uses Millisecond Precision **File**: `src/rate_limit/bucket.rs:37` **Problem**: `elapsed.as_millis()` truncates sub-millisecond time, which can lead to token refill inaccuracies under high-frequency request bursts. For example, two requests arriving 500µs apart both see `0ms` elapsed and don't refill tokens. **Solution**: Use `as_nanos()` for the refill calculation: ```rust let elapsed = now.duration_since(self.last_refill).as_nanos() as f64; let tokens_to_add = (elapsed / 1_000_000_000.0) * rate; ``` This provides nanosecond-precision refill while keeping the math in floating point. --- ### W7. Admin Socket Has No Shutdown Mechanism **File**: `src/admin/socket.rs:99-111` **Problem**: `start_admin_socket` runs an infinite `loop` accepting connections with no way to break out. It doesn't accept a shutdown signal, so it can't be gracefully stopped during process shutdown or in tests. **Solution**: Accept a `watch::Receiver` parameter and use `tokio::select!` to check for shutdown: ```rust tokio::select! { result = listener.accept() => { /* handle connection */ }, _ = shutdown_rx.changed() => { info!("admin socket shutting down"); break; } } ``` This also requires cleaning up the socket file on exit. --- ### W8. Server Header Unconditionally Stripped from Upstream Response **File**: `src/proxy/handler.rs:85` **Problem**: `parts.headers.remove("server")` unconditionally removes the upstream `Server` header. This is a design choice, not necessarily a bug, but it means downstream clients can't see what software the upstream is running, which may be undesirable for debugging. **Solution**: This is acceptable behavior for a security-focused reverse proxy (hiding upstream identity is a defense-in-depth measure). Document this decision with a comment in the code explaining the rationale. --- ### W9. Logging Test Fails Due to Global Subscriber **File**: `src/logging/mod.rs:99-111` **Problem**: The test `init_creates_log_directory_and_file` calls `init()` which sets a global default tracing subscriber. When tests run in parallel, this conflicts with other tests that may also set a subscriber, causing the test to fail with "a global default trace dispatcher has already been set." **Solution**: Use `tracing_subscriber::fmt().with_test_writer()` and guard against double-initialization. Alternatively, use `std::sync::OnceLock` or `tracing_subscriber::util::SubscriberInitExt::try_init()` which returns an error if already set rather than panicking. --- ### W10. Body Limit Middleware Only Checks Content-Length Header **File**: `src/proxy/body_limit.rs:26-33` **Problem**: The middleware checks the `Content-Length` header but doesn't handle requests that lack `Content-Length` (e.g., chunked transfers, HTTP/2). A malicious client can send `Transfer-Encoding: chunked` without `Content-Length` and bypass the initial check, though the `Limited` body wrapper at line 37 will still enforce the limit during streaming. **Solution**: This is actually acceptable — the `Limited` body wrapper on line 37 is the real enforcement mechanism. The `Content-Length` check on line 26 is an early-rejection optimization for clients that do include the header. Add a comment explaining this two-layer defense: ```rust // Early rejection: if Content-Length is present and exceeds the limit, reject // immediately without reading the body. For requests without Content-Length // (chunked, HTTP/2), the Limited body wrapper below enforces the limit during // streaming. ``` --- ### W11. Health Check Port Conflict Check Is Incomplete **File**: `src/config/validation.rs:169-186` **Problem**: The validation checks if `health_check_port` conflicts with any listener's `https_port` or `http_port`, but it doesn't check whether the health check port's bind address conflicts with a listener on a different bind address. For example, `health_check_port = 80` bound to `127.0.0.1` shouldn't conflict with a listener's `http_port = 80` bound to `203.0.113.10`. **Solution**: This is acceptable for Phase 1 since the health check always binds to `127.0.0.1` (hardcoded in `src/health.rs:20`). Document that health check always binds to localhost, so the conflict check is conservative (warns even when it might not actually conflict). Add a comment in the validation code explaining this. --- ### W12. `build_upstream_request` Copies All Headers Without Filtering **File**: `src/proxy/handler.rs:124-128` **Problem**: `build_upstream_request` copies all headers from the original request to the upstream request. However, `remove_hop_by_hop` was already called on the original request's headers at line 59 before `build_upstream_request` is called, so hop-by-hop headers have already been removed. The proxy headers (`X-Real-IP`, `X-Forwarded-For`, `X-Forwarded-Proto`) were also already injected at line 58. This means the function works correctly, but the code path is spread across multiple locations, making it harder to reason about header lifecycle. **Solution**: Consider consolidating header manipulation into a single function or adding inline comments that trace the header lifecycle: ```rust // Header lifecycle: // 1. inject_proxy_headers() — adds X-Real-IP, X-Forwarded-For, X-Forwarded-Proto // 2. remove_hop_by_hop() — removes Connection, Keep-Alive, etc. // 3. build_upstream_request() — copies remaining headers to upstream request // 4. Response: remove_hop_by_hop() on upstream response headers ``` --- ## Suggestions ### S1. `http_port` Validation Allows 0 but Not Documented **File**: `src/config/validation.rs:106` **Problem**: `http_port = 0` is treated as "disabled" (not validated as a port number), which is correct, but the validation error message for invalid ports says "must be 0 (disabled) or 1-65535" while `HttpPortInvalid` only checks `http_port > 0 && http_port < 1` implicitly. The actual check is `http_port > 0` to enter the conflict check, and port 0 is always allowed for HTTP. **Solution**: Add explicit validation for `http_port` being in the range 0 or 1-65535 (currently it only validates conflicts, not the port range itself for http). Add a `HttpPortInvalid` check similar to `HttpsPortInvalid`: ```rust if listener.http_port > 65535 { errors.push(ValidationError::HttpPortInvalid { ... }); } ``` --- ### S2. Consider Using `#[non_exhaustive]` on Public Enums **Files**: `src/tls/acceptor.rs:49` (`TlsMode`), `src/proxy/error.rs:5` (`ProxyError`), `src/admin/socket.rs:15` (`AdminSocketError`), `src/config/validation.rs:10` (`ValidationError`) **Problem**: These public enums can be matched exhaustively by downstream consumers. Adding a new variant would be a breaking change. **Solution**: Add `#[non_exhaustive]` to these enums to allow future expansion without breaking changes. This is especially important for `TlsMode` (which may gain a `"letsencrypt"` or `"auto"` mode) and `ProxyError` (which may gain `UpstreamTls` error handling). --- ### S3. `normalize_host` in `dynamic_config.rs` Doesn't Handle Edge Cases **File**: `src/config/dynamic_config.rs:52-55` **Problem**: `normalize_host` uses `split(':').next()` to strip ports, but this fails for IPv6 addresses in brackets (e.g., `[::1]:443` would normalize to `[::1]` instead of `::1`). The `strip_port_from_host` function in `src/tls/redirect.rs:16-28` correctly handles this case. **Solution**: Either reuse the `strip_port_from_host` logic from `redirect.rs` or add a shared utility function. The IPv6 bracket handling is important for correctness if the proxy ever receives an IPv6 Host header. --- ### S4. Multiple `#[allow(dead_code)]` Annotations on Public API **Files**: `src/tls/acceptor.rs:14,33,48,58`, `src/tls/acme.rs:9,11,15,23,55`, `src/config/static_config.rs:4,31,44,49,54,56,70,76,86,91` **Problem**: Many public items are annotated with `#[allow(dead_code)]`, which suggests they're defined but not yet used by the binary crate. This is fine during initial development but should be cleaned up before release. **Solution**: Remove `#[allow(dead_code)]` annotations once all features are wired up (especially after C1 is fixed, since `build_acme_challenge_config` and `TlsMode::Acme.challenge_config` will be needed). Run `cargo check` to identify which items are actually dead code vs. just not yet referenced. --- ### S5. `InFlightCounter` Could Use Atomic Usize Directly **File**: `src/server.rs:16-46` **Problem**: `InFlightCounter` wraps an `AtomicUsize` with `increment`/`decrement`/`is_zero` methods, plus a separate `InFlightGuard` RAII type. This is clean, but the `Arc` pattern means every connection task clones the `Arc`, which has a small allocation cost. **Solution**: This is a fine pattern for correctness. No change needed — the RAII guard pattern correctly ensures `decrement` is always called even on panic. Mentioning as a suggestion for awareness, not action. --- ### S6. Add `Clone` to `SiteConfig` Derive Is Correct But May Mask Cloning Hot Paths **File**: `src/config/dynamic_config.rs:70` **Problem**: `SiteConfig` derives `Clone`, which is used in `src/proxy/handler.rs:53` where `site.clone()` clones the site config on every request. For hot paths, this allocates new `String`s for `host`, `upstream`, and `upstream_scheme`. **Solution**: Consider using `Arc` in the routing table so that lookups return an `Arc` clone (cheap atomic refcount) instead of a full `String` clone. This would change `routing_table` from `HashMap` to `HashMap>`. This is a performance optimization for later — not a correctness issue. --- ### S7. HTTPS Client Trusts System Root Certificates Unconditionally **File**: `src/proxy/handler.rs:153-165` **Problem**: The `root_certs()` function loads native certificates and silently skips any that fail to parse (`roots.add(cert).ok()`). While this matches the spec (review #001, W13), certificate validation failures for upstream TLS connections will produce opaque IO errors. **Solution**: This is acceptable for Phase 1 per the spec (ADR-017). Consider logging the number of root certificates loaded for operational visibility: ```rust let cert_count = result.certs.len(); info!(certs_loaded = cert_count, "loaded system root certificates"); ``` --- ### S8. Request Timeout Applies to Entire HTTP Exchange, Not Just Response **File**: `src/proxy/handler.rs:75-79` **Problem**: `tokio::time::timeout(request_timeout, client.request(...))` applies the timeout to the entire HTTP round-trip including response body streaming. For large file downloads or slow upstreams, this means a 60-second timeout kills the response even if the upstream is actively sending data. A more precise timeout would apply only to the connection + first-byte response, then stream the body without a timeout. **Solution**: For Phase 1, this is acceptable behavior — the timeout name (`upstream_request_timeout_secs`) is documented as applying to the full request. Consider splitting into connect-timeout and response-timeout in Phase 2. The `SiteConfig` already has separate `upstream_connect_timeout_secs` and `upstream_request_timeout_secs` fields, but `upstream_connect_timeout_secs` is unused (see W4). --- ## Summary Statistics | Severity | Count | Status | |----------|-------|--------| | Critical | 4 | Must fix before production | | Warning | 12 | Should fix — correctness and robustness | | Suggestion | 8 | Consider for code quality | ## Recommended Fix Priority 1. **C1 (ACME challenge listener)** — Without this, ACME cert provisioning is completely broken. This is the highest priority fix. 2. **C2 (ACME contact email)** — Without this, production Let's Encrypt registration fails. 3. **C4 (ConfigReloadHandle static config drift)** — Every reload will produce stale warnings, confusing operators. 4. **C3 (X-Forwarded-For comment)** — Correct behavior, just needs a clarifying comment to prevent future "fixes." 5. **W1 (Shutdown drain)** — Can cause connection drops on graceful restart. 6. **W4 (Connect timeout)** — Slow upstreams can exhaust the full request timeout budget. 7. **W5 (Health path collision)** — Upstream `/health` endpoints are silently intercepted. 8. **W7 (Admin socket shutdown)** — Cannot gracefully stop. 9. **Remaining W and S findings** — Fix opportunistically.