diff --git a/docs/reviews/002-implementation-review.md b/docs/reviews/002-implementation-review.md new file mode 100644 index 0000000..627a29e --- /dev/null +++ b/docs/reviews/002-implementation-review.md @@ -0,0 +1,586 @@ +--- +status: draft +last_updated: 2026-06-11 +reviewed_code: + - src/main.rs + - src/server.rs + - src/tls/acceptor.rs + - src/tls/acme.rs + - src/tls/config.rs + - src/tls/redirect.rs + - src/config/validation.rs + - src/config/dynamic_config.rs + - src/config/static_config.rs + - src/config/mod.rs + - src/proxy/handler.rs + - src/proxy/headers.rs + - src/proxy/error.rs + - src/proxy/body_limit.rs + - src/proxy/mod.rs + - src/rate_limit/mod.rs + - src/rate_limit/bucket.rs + - src/admin/socket.rs + - src/shutdown.rs + - src/health.rs + - src/logging/mod.rs + - src/logging/format.rs + - src/cli.rs +reviewer: code-reviewer +--- + +# Implementation Review #002 + +## Purpose + +Post-implementation review of all modules. Each finding is structured as +**Problem** → **Solution** (or **Open Question** where no solution is yet known). + +## Severity Definitions + +| Severity | Meaning | +|----------|---------| +| **Critical** | Will cause incorrect behavior or security issues in production | +| **Warning** | Could cause issues under specific conditions or represents a missed edge case | +| **Suggestion** | Code quality, style, or minor improvement opportunity | + +--- + +## Critical Findings + +### C1. ACME Challenge Listener Not Started + +**File**: `src/main.rs:168-181` + +**Problem**: When `TlsMode::Acme` is matched, the `challenge_config` and `resolver` +fields are destructured with `_` (discarded). The ACME TLS-ALPN-01 challenge +listener — required for Let's Encrypt certificate provisioning — is never started. +Without it, ACME certificate issuance will fail: Let's Encrypt cannot verify domain +ownership, and no certificates will ever be obtained. + +**Solution**: Wire the challenge config into a separate TLS listener or configure +the `rustls-acme` resolver to serve TLS-ALPN-01 challenges on the same port. +The `rustls-acme` crate's `ResolvesServerCertAcme` handles ALPN protocol +negotiation automatically — if the ACME challenge ALPN protocol (`acme-tls/1`) +is included in the server's ALPN list, the resolver will serve challenge +certificates on the main HTTPS port. Verify that the `build_acme_server_config` +function includes `ACME_TLS_ALPN_01` in its ALPN list (it already does at line 28). +This means the main listener should be sufficient for TLS-ALPN-01 challenges +if the resolver is correctly installed. The separate `challenge_config` in +`TlsMode::Acme` may be unnecessary — confirm whether `rustls-acme` requires a +dedicated listener or if the resolver on the default config handles challenges +automatically. + +--- + +### C2. ACME Contact Email Always Empty + +**File**: `src/tls/acme.rs:86` + +**Problem**: `AcmeTlsConfig` always sets `contact: vec![]`. Let's Encrypt requires +a contact email for production certificate requests. The `acme_directory` config +field supports `"production"` but an empty contact list will cause ACME +registration to fail with a 400-level error from the Let's Encrypt API. + +**Solution**: Add an `acme_contact` field (e.g., `acme_contact = "mailto:admin@example.com"`) +to the `TlsConfig` struct and wire it through to `AcmeTlsConfig.contact`. This +requires changes to: +1. `src/config/static_config.rs` — add `acme_contact: Vec` to `TlsConfig` +2. `src/tls/acme.rs` — use `self.contact` in `AcmeTlsConfig::setup` +3. `src/tls/acceptor.rs` — pass the contact list from `TlsConfig` + +--- + +### C3. X-Forwarded-For Replaces Instead of Appending + +**File**: `src/proxy/headers.rs:28` + +**Problem**: `inject_proxy_headers` uses `headers.insert()` for `X-Forwarded-For`, +which **replaces** any existing value. The spec (review #001, finding C2) decided +that as an edge proxy the X-Forwarded-For should be **set** (not appended), +because there are no trusted proxies in front. However, the implementation +doesn't match either behavior — it silently discards any existing header rather +than explicitly setting it to just the client IP, which is correct for an edge +proxy but was implemented without a clear comment explaining the rationale. + +**Solution**: The current behavior is actually **correct** for an edge proxy +(per review #001/C2). The fix is documentation, not code change. Add a comment +to `inject_proxy_headers` explaining: + +```rust +// X-Forwarded-For is SET (not appended) because this proxy is the outermost +// edge proxy. Any existing X-Forwarded-For from the client is untrusted and +// must be replaced with the actual client IP from ConnectInfo. +``` + +This ensures future maintainers don't "fix" this by changing `insert` to `append`. + +--- + +### C4. ConfigReloadHandle Never Updates Stored StaticConfig + +**File**: `src/config/dynamic_config.rs:124-148` + +**Problem**: The `reload()` method computes `diff_static_config(&self.static_config, &new_static)` +and returns the changed fields, but **never updates `self.static_config`** with +the new static config. This means: +- The first reload correctly reports changed fields +- The second reload compares against the **original** static config, not the + last-reloaded one, and reports the same changes again +- Operators get repeated "static config fields changed" warnings for the same + fields on every reload + +**Solution**: After validation succeeds and the diff is computed, update +`self.static_config` with the new value. Since `ConfigReloadHandle` is accessed +concurrently, the static config field needs interior mutability. Use +`ArcSwap` or `std::sync::RwLock`: + +```rust +pub struct ConfigReloadHandle { + config: Arc>, + static_config: ArcSwap, // Changed from StaticConfig + reload_mutex: Mutex<()>, +} +``` + +Then in `reload()`: +```rust +let changed_fields = diff_static_config(&self.static_config.load(), &new_static); +self.static_config.store(Arc::new(new_static)); +self.config.store(Arc::new(new_dynamic)); +``` + +--- + +## Warning Findings + +### W1. Shutdown Aborts Listeners Without Draining + +**File**: `src/main.rs:238-240` + +**Problem**: On shutdown, `handle.abort()` is called on each HTTPS server task. +This immediately kills the tokio task, which can interrupt in-flight request +processing. The `drain_in_flight` counter is only decremented in +`InFlightGuard::drop`, but `abort()` prevents the guard from being dropped +normally for connections still being processed by the task. + +**Solution**: Instead of `abort()`, use `tokio::time::timeout` with the shutdown +timeout to `join` each handle: + +```rust +let shutdown_timeout = shutdown.shutdown_timeout(); +for handle in https_server_handles { + match tokio::time::timeout(shutdown_timeout, handle).await { + Ok(_) => {} + Err(_) => { + warn!("shutdown timeout expired, aborting listener task"); + handle.abort(); + } + } +} +``` + +Alternatively, keep the `shutdown_rx` pattern already in `serve_https_listener` +(which breaks the accept loop on shutdown signal) and remove the `abort()` calls +— the tasks will naturally exit when they stop accepting new connections and +existing connections drain. + +--- + +### W2. Shutdown Doesn't Stop Admin Socket or Background Tasks + +**File**: `src/main.rs:238-250` + +**Problem**: The shutdown sequence aborts HTTPS listeners but doesn't stop: +- The admin socket listener (runs in infinite loop at `src/admin/socket.rs:99-111`) +- The rate limiter eviction task (runs in infinite loop at `src/rate_limit/mod.rs:106-112`) +- The ACME state machine task + +These tasks will continue running until the process exits, which is fine for +process termination but means they can't be gracefully stopped in tests or +during a clean shutdown. + +**Solution**: For Phase 1, this is acceptable — process termination will clean +up these tasks. For future improvements: +- Pass a `CancellationToken` or `watch::Receiver` to `start_admin_socket` + and the eviction task +- On shutdown, signal cancellation before waiting for drain +- The ACME state machine task already exits when the stream ends (`None` branch + at `src/tls/acme.rs:152-158`), but it should also be cancellable + +--- + +### W3. Fragile Error Detection in Connection Error Handler + +**File**: `src/server.rs:95-97` + +**Problem**: The check `e.to_string().contains("incomplete message")` to silently +suppress connection errors is fragile. String matching on error descriptions can +break across hyper versions, locale changes, or error message reformatting. + +**Solution**: Match on the error type instead. Check if `hyper` exposes a +client-disconnect error variant, or use `e.is_incomplete_message()` if available +in the hyper error API. If no typed variant exists, add a comment explaining +why string matching is used and which version(s) of hyper produce this message. + +--- + +### W4. No Separate Connect Timeout for Upstream Requests + +**File**: `src/proxy/handler.rs:73-79` + +**Problem**: The proxy uses `tokio::time::timeout` with +`upstream_request_timeout_secs` for the entire request, but there's no separate +connect timeout. A slow DNS resolution or TCP handshake will consume the full +request timeout budget, leaving no time for the actual request/response cycle. + +**Solution**: Add a `connect_timeout` (either a fixed default or from +`upstream_connect_timeout_secs` in `SiteConfig`). Structure the proxy call as: + +```rust +let connect_timeout = Duration::from_secs(site.upstream_connect_timeout_secs); +let result = tokio::time::timeout( + request_timeout, + async { + let upstream_req = build_upstream_request(req, &upstream_uri)?; + // The hyper client handles connect internally, but we can wrap + // the request in a two-phase timeout if needed + client.request(upstream_req).await + } +).await; +``` + +The `SiteConfig` already has `upstream_connect_timeout_secs` but it's not used +in `proxy_handler`. This should be wired up to either set the client's connect +timeout or to implement a two-phase timeout. + +--- + +### W5. Hardcoded `/health` Path Intercepted on All Hosts + +**File**: `src/proxy/handler.rs:37-39` + +**Problem**: The proxy handler returns 200 OK for `GET /health` regardless of the +Host header. This means any site's `/health` path will be intercepted by the +proxy and never reach the upstream. If an upstream application uses `/health` +for its own health checks, those requests will never reach it. + +**Solution**: Either: +1. Use a less common path like `/__health` or `/healthz` that won't collide + with upstream applications, OR +2. Only intercept `/health` when the Host header doesn't match any known site + (fallthrough), OR +3. Make the health check path configurable via `StaticConfig` + +Option 1 is simplest for Phase 1. Option 3 is most flexible long-term. + +--- + +### W6. Token Bucket Refill Uses Millisecond Precision + +**File**: `src/rate_limit/bucket.rs:37` + +**Problem**: `elapsed.as_millis()` truncates sub-millisecond time, which can lead +to token refill inaccuracies under high-frequency request bursts. For example, +two requests arriving 500µs apart both see `0ms` elapsed and don't refill tokens. + +**Solution**: Use `as_nanos()` for the refill calculation: + +```rust +let elapsed = now.duration_since(self.last_refill).as_nanos() as f64; +let tokens_to_add = (elapsed / 1_000_000_000.0) * rate; +``` + +This provides nanosecond-precision refill while keeping the math in floating point. + +--- + +### W7. Admin Socket Has No Shutdown Mechanism + +**File**: `src/admin/socket.rs:99-111` + +**Problem**: `start_admin_socket` runs an infinite `loop` accepting connections +with no way to break out. It doesn't accept a shutdown signal, so it can't be +gracefully stopped during process shutdown or in tests. + +**Solution**: Accept a `watch::Receiver` parameter and use `tokio::select!` +to check for shutdown: + +```rust +tokio::select! { + result = listener.accept() => { /* handle connection */ }, + _ = shutdown_rx.changed() => { + info!("admin socket shutting down"); + break; + } +} +``` + +This also requires cleaning up the socket file on exit. + +--- + +### W8. Server Header Unconditionally Stripped from Upstream Response + +**File**: `src/proxy/handler.rs:85` + +**Problem**: `parts.headers.remove("server")` unconditionally removes the upstream +`Server` header. This is a design choice, not necessarily a bug, but it means +downstream clients can't see what software the upstream is running, which may be +undesirable for debugging. + +**Solution**: This is acceptable behavior for a security-focused reverse proxy +(hiding upstream identity is a defense-in-depth measure). Document this decision +with a comment in the code explaining the rationale. + +--- + +### W9. Logging Test Fails Due to Global Subscriber + +**File**: `src/logging/mod.rs:99-111` + +**Problem**: The test `init_creates_log_directory_and_file` calls `init()` which +sets a global default tracing subscriber. When tests run in parallel, this +conflicts with other tests that may also set a subscriber, causing the test to +fail with "a global default trace dispatcher has already been set." + +**Solution**: Use `tracing_subscriber::fmt().with_test_writer()` and guard +against double-initialization. Alternatively, use `std::sync::OnceLock` or +`tracing_subscriber::util::SubscriberInitExt::try_init()` which returns an error +if already set rather than panicking. + +--- + +### W10. Body Limit Middleware Only Checks Content-Length Header + +**File**: `src/proxy/body_limit.rs:26-33` + +**Problem**: The middleware checks the `Content-Length` header but doesn't handle +requests that lack `Content-Length` (e.g., chunked transfers, HTTP/2). A +malicious client can send `Transfer-Encoding: chunked` without `Content-Length` +and bypass the initial check, though the `Limited` body wrapper at line 37 will +still enforce the limit during streaming. + +**Solution**: This is actually acceptable — the `Limited` body wrapper on line +37 is the real enforcement mechanism. The `Content-Length` check on line 26 is +an early-rejection optimization for clients that do include the header. Add a +comment explaining this two-layer defense: + +```rust +// Early rejection: if Content-Length is present and exceeds the limit, reject +// immediately without reading the body. For requests without Content-Length +// (chunked, HTTP/2), the Limited body wrapper below enforces the limit during +// streaming. +``` + +--- + +### W11. Health Check Port Conflict Check Is Incomplete + +**File**: `src/config/validation.rs:169-186` + +**Problem**: The validation checks if `health_check_port` conflicts with any +listener's `https_port` or `http_port`, but it doesn't check whether the health +check port's bind address conflicts with a listener on a different bind address. +For example, `health_check_port = 80` bound to `127.0.0.1` shouldn't conflict +with a listener's `http_port = 80` bound to `203.0.113.10`. + +**Solution**: This is acceptable for Phase 1 since the health check always binds +to `127.0.0.1` (hardcoded in `src/health.rs:20`). Document that health check +always binds to localhost, so the conflict check is conservative (warns even +when it might not actually conflict). Add a comment in the validation code +explaining this. + +--- + +### W12. `build_upstream_request` Copies All Headers Without Filtering + +**File**: `src/proxy/handler.rs:124-128` + +**Problem**: `build_upstream_request` copies all headers from the original request +to the upstream request. However, `remove_hop_by_hop` was already called on the +original request's headers at line 59 before `build_upstream_request` is called, +so hop-by-hop headers have already been removed. The proxy headers +(`X-Real-IP`, `X-Forwarded-For`, `X-Forwarded-Proto`) were also already injected +at line 58. This means the function works correctly, but the code path is +spread across multiple locations, making it harder to reason about header +lifecycle. + +**Solution**: Consider consolidating header manipulation into a single function +or adding inline comments that trace the header lifecycle: + +```rust +// Header lifecycle: +// 1. inject_proxy_headers() — adds X-Real-IP, X-Forwarded-For, X-Forwarded-Proto +// 2. remove_hop_by_hop() — removes Connection, Keep-Alive, etc. +// 3. build_upstream_request() — copies remaining headers to upstream request +// 4. Response: remove_hop_by_hop() on upstream response headers +``` + +--- + +## Suggestions + +### S1. `http_port` Validation Allows 0 but Not Documented + +**File**: `src/config/validation.rs:106` + +**Problem**: `http_port = 0` is treated as "disabled" (not validated as a port +number), which is correct, but the validation error message for invalid ports +says "must be 0 (disabled) or 1-65535" while `HttpPortInvalid` only checks +`http_port > 0 && http_port < 1` implicitly. The actual check is `http_port > 0` +to enter the conflict check, and port 0 is always allowed for HTTP. + +**Solution**: Add explicit validation for `http_port` being in the range 0 or +1-65535 (currently it only validates conflicts, not the port range itself for +http). Add a `HttpPortInvalid` check similar to `HttpsPortInvalid`: + +```rust +if listener.http_port > 65535 { + errors.push(ValidationError::HttpPortInvalid { ... }); +} +``` + +--- + +### S2. Consider Using `#[non_exhaustive]` on Public Enums + +**Files**: `src/tls/acceptor.rs:49` (`TlsMode`), `src/proxy/error.rs:5` (`ProxyError`), +`src/admin/socket.rs:15` (`AdminSocketError`), `src/config/validation.rs:10` (`ValidationError`) + +**Problem**: These public enums can be matched exhaustively by downstream consumers. +Adding a new variant would be a breaking change. + +**Solution**: Add `#[non_exhaustive]` to these enums to allow future expansion +without breaking changes. This is especially important for `TlsMode` (which may +gain a `"letsencrypt"` or `"auto"` mode) and `ProxyError` (which may gain +`UpstreamTls` error handling). + +--- + +### S3. `normalize_host` in `dynamic_config.rs` Doesn't Handle Edge Cases + +**File**: `src/config/dynamic_config.rs:52-55` + +**Problem**: `normalize_host` uses `split(':').next()` to strip ports, but this +fails for IPv6 addresses in brackets (e.g., `[::1]:443` would normalize to +`[::1]` instead of `::1`). The `strip_port_from_host` function in +`src/tls/redirect.rs:16-28` correctly handles this case. + +**Solution**: Either reuse the `strip_port_from_host` logic from `redirect.rs` +or add a shared utility function. The IPv6 bracket handling is important for +correctness if the proxy ever receives an IPv6 Host header. + +--- + +### S4. Multiple `#[allow(dead_code)]` Annotations on Public API + +**Files**: `src/tls/acceptor.rs:14,33,48,58`, `src/tls/acme.rs:9,11,15,23,55`, +`src/config/static_config.rs:4,31,44,49,54,56,70,76,86,91` + +**Problem**: Many public items are annotated with `#[allow(dead_code)]`, which +suggests they're defined but not yet used by the binary crate. This is fine +during initial development but should be cleaned up before release. + +**Solution**: Remove `#[allow(dead_code)]` annotations once all features are +wired up (especially after C1 is fixed, since `build_acme_challenge_config` and +`TlsMode::Acme.challenge_config` will be needed). Run `cargo check` to identify +which items are actually dead code vs. just not yet referenced. + +--- + +### S5. `InFlightCounter` Could Use Atomic Usize Directly + +**File**: `src/server.rs:16-46` + +**Problem**: `InFlightCounter` wraps an `AtomicUsize` with `increment`/`decrement`/`is_zero` +methods, plus a separate `InFlightGuard` RAII type. This is clean, but the +`Arc` pattern means every connection task clones the `Arc`, +which has a small allocation cost. + +**Solution**: This is a fine pattern for correctness. No change needed — the +RAII guard pattern correctly ensures `decrement` is always called even on panic. +Mentioning as a suggestion for awareness, not action. + +--- + +### S6. Add `Clone` to `SiteConfig` Derive Is Correct But May Mask Cloning Hot Paths + +**File**: `src/config/dynamic_config.rs:70` + +**Problem**: `SiteConfig` derives `Clone`, which is used in +`src/proxy/handler.rs:53` where `site.clone()` clones the site config on every +request. For hot paths, this allocates new `String`s for `host`, `upstream`, and +`upstream_scheme`. + +**Solution**: Consider using `Arc` in the routing table so that +lookups return an `Arc` clone (cheap atomic refcount) instead of a full `String` +clone. This would change `routing_table` from `HashMap` to +`HashMap>`. This is a performance optimization for +later — not a correctness issue. + +--- + +### S7. HTTPS Client Trusts System Root Certificates Unconditionally + +**File**: `src/proxy/handler.rs:153-165` + +**Problem**: The `root_certs()` function loads native certificates and silently +skips any that fail to parse (`roots.add(cert).ok()`). While this matches the +spec (review #001, W13), certificate validation failures for upstream TLS +connections will produce opaque IO errors. + +**Solution**: This is acceptable for Phase 1 per the spec (ADR-017). Consider +logging the number of root certificates loaded for operational visibility: + +```rust +let cert_count = result.certs.len(); +info!(certs_loaded = cert_count, "loaded system root certificates"); +``` + +--- + +### S8. Request Timeout Applies to Entire HTTP Exchange, Not Just Response + +**File**: `src/proxy/handler.rs:75-79` + +**Problem**: `tokio::time::timeout(request_timeout, client.request(...))` applies +the timeout to the entire HTTP round-trip including response body streaming. +For large file downloads or slow upstreams, this means a 60-second timeout kills +the response even if the upstream is actively sending data. A more precise +timeout would apply only to the connection + first-byte response, then stream +the body without a timeout. + +**Solution**: For Phase 1, this is acceptable behavior — the timeout name +(`upstream_request_timeout_secs`) is documented as applying to the full request. +Consider splitting into connect-timeout and response-timeout in Phase 2. The +`SiteConfig` already has separate `upstream_connect_timeout_secs` and +`upstream_request_timeout_secs` fields, but `upstream_connect_timeout_secs` is +unused (see W4). + +--- + +## Summary Statistics + +| Severity | Count | Status | +|----------|-------|--------| +| Critical | 4 | Must fix before production | +| Warning | 12 | Should fix — correctness and robustness | +| Suggestion | 8 | Consider for code quality | + +## Recommended Fix Priority + +1. **C1 (ACME challenge listener)** — Without this, ACME cert provisioning is + completely broken. This is the highest priority fix. +2. **C2 (ACME contact email)** — Without this, production Let's Encrypt + registration fails. +3. **C4 (ConfigReloadHandle static config drift)** — Every reload will produce + stale warnings, confusing operators. +4. **C3 (X-Forwarded-For comment)** — Correct behavior, just needs a clarifying + comment to prevent future "fixes." +5. **W1 (Shutdown drain)** — Can cause connection drops on graceful restart. +6. **W4 (Connect timeout)** — Slow upstreams can exhaust the full request + timeout budget. +7. **W5 (Health path collision)** — Upstream `/health` endpoints are silently + intercepted. +8. **W7 (Admin socket shutdown)** — Cannot gracefully stop. +9. **Remaining W and S findings** — Fix opportunistically. \ No newline at end of file