Files
reverse-proxy/docs/reviews/002-implementation-review.md

22 KiB

status, last_updated, reviewed_code, reviewer
status last_updated reviewed_code reviewer
draft 2026-06-11
src/main.rs
src/server.rs
src/tls/acceptor.rs
src/tls/acme.rs
src/tls/config.rs
src/tls/redirect.rs
src/config/validation.rs
src/config/dynamic_config.rs
src/config/static_config.rs
src/config/mod.rs
src/proxy/handler.rs
src/proxy/headers.rs
src/proxy/error.rs
src/proxy/body_limit.rs
src/proxy/mod.rs
src/rate_limit/mod.rs
src/rate_limit/bucket.rs
src/admin/socket.rs
src/shutdown.rs
src/health.rs
src/logging/mod.rs
src/logging/format.rs
src/cli.rs
code-reviewer

Implementation Review #002

Purpose

Post-implementation review of all modules. Each finding is structured as ProblemSolution (or Open Question where no solution is yet known).

Severity Definitions

Severity Meaning
Critical Will cause incorrect behavior or security issues in production
Warning Could cause issues under specific conditions or represents a missed edge case
Suggestion Code quality, style, or minor improvement opportunity

Critical Findings

C1. ACME Challenge Listener Not Started

File: src/main.rs:168-181

Problem: When TlsMode::Acme is matched, the challenge_config and resolver fields are destructured with _ (discarded). The ACME TLS-ALPN-01 challenge listener — required for Let's Encrypt certificate provisioning — is never started. Without it, ACME certificate issuance will fail: Let's Encrypt cannot verify domain ownership, and no certificates will ever be obtained.

Solution: Wire the challenge config into a separate TLS listener or configure the rustls-acme resolver to serve TLS-ALPN-01 challenges on the same port. The rustls-acme crate's ResolvesServerCertAcme handles ALPN protocol negotiation automatically — if the ACME challenge ALPN protocol (acme-tls/1) is included in the server's ALPN list, the resolver will serve challenge certificates on the main HTTPS port. Verify that the build_acme_server_config function includes ACME_TLS_ALPN_01 in its ALPN list (it already does at line 28). This means the main listener should be sufficient for TLS-ALPN-01 challenges if the resolver is correctly installed. The separate challenge_config in TlsMode::Acme may be unnecessary — confirm whether rustls-acme requires a dedicated listener or if the resolver on the default config handles challenges automatically.


C2. ACME Contact Email Always Empty

File: src/tls/acme.rs:86

Problem: AcmeTlsConfig always sets contact: vec![]. Let's Encrypt requires a contact email for production certificate requests. The acme_directory config field supports "production" but an empty contact list will cause ACME registration to fail with a 400-level error from the Let's Encrypt API.

Solution: Add an acme_contact field (e.g., acme_contact = "mailto:admin@example.com") to the TlsConfig struct and wire it through to AcmeTlsConfig.contact. This requires changes to:

  1. src/config/static_config.rs — add acme_contact: Vec<String> to TlsConfig
  2. src/tls/acme.rs — use self.contact in AcmeTlsConfig::setup
  3. src/tls/acceptor.rs — pass the contact list from TlsConfig

C3. X-Forwarded-For Replaces Instead of Appending

File: src/proxy/headers.rs:28

Problem: inject_proxy_headers uses headers.insert() for X-Forwarded-For, which replaces any existing value. The spec (review #001, finding C2) decided that as an edge proxy the X-Forwarded-For should be set (not appended), because there are no trusted proxies in front. However, the implementation doesn't match either behavior — it silently discards any existing header rather than explicitly setting it to just the client IP, which is correct for an edge proxy but was implemented without a clear comment explaining the rationale.

Solution: The current behavior is actually correct for an edge proxy (per review #001/C2). The fix is documentation, not code change. Add a comment to inject_proxy_headers explaining:

// X-Forwarded-For is SET (not appended) because this proxy is the outermost
// edge proxy. Any existing X-Forwarded-For from the client is untrusted and
// must be replaced with the actual client IP from ConnectInfo.

This ensures future maintainers don't "fix" this by changing insert to append.


C4. ConfigReloadHandle Never Updates Stored StaticConfig

File: src/config/dynamic_config.rs:124-148

Problem: The reload() method computes diff_static_config(&self.static_config, &new_static) and returns the changed fields, but never updates self.static_config with the new static config. This means:

  • The first reload correctly reports changed fields
  • The second reload compares against the original static config, not the last-reloaded one, and reports the same changes again
  • Operators get repeated "static config fields changed" warnings for the same fields on every reload

Solution: After validation succeeds and the diff is computed, update self.static_config with the new value. Since ConfigReloadHandle is accessed concurrently, the static config field needs interior mutability. Use ArcSwap<StaticConfig> or std::sync::RwLock<StaticConfig>:

pub struct ConfigReloadHandle {
    config: Arc<ArcSwap<DynamicConfig>>,
    static_config: ArcSwap<StaticConfig>,  // Changed from StaticConfig
    reload_mutex: Mutex<()>,
}

Then in reload():

let changed_fields = diff_static_config(&self.static_config.load(), &new_static);
self.static_config.store(Arc::new(new_static));
self.config.store(Arc::new(new_dynamic));

Warning Findings

W1. Shutdown Aborts Listeners Without Draining

File: src/main.rs:238-240

Problem: On shutdown, handle.abort() is called on each HTTPS server task. This immediately kills the tokio task, which can interrupt in-flight request processing. The drain_in_flight counter is only decremented in InFlightGuard::drop, but abort() prevents the guard from being dropped normally for connections still being processed by the task.

Solution: Instead of abort(), use tokio::time::timeout with the shutdown timeout to join each handle:

let shutdown_timeout = shutdown.shutdown_timeout();
for handle in https_server_handles {
    match tokio::time::timeout(shutdown_timeout, handle).await {
        Ok(_) => {}
        Err(_) => {
            warn!("shutdown timeout expired, aborting listener task");
            handle.abort();
        }
    }
}

Alternatively, keep the shutdown_rx pattern already in serve_https_listener (which breaks the accept loop on shutdown signal) and remove the abort() calls — the tasks will naturally exit when they stop accepting new connections and existing connections drain.


W2. Shutdown Doesn't Stop Admin Socket or Background Tasks

File: src/main.rs:238-250

Problem: The shutdown sequence aborts HTTPS listeners but doesn't stop:

  • The admin socket listener (runs in infinite loop at src/admin/socket.rs:99-111)
  • The rate limiter eviction task (runs in infinite loop at src/rate_limit/mod.rs:106-112)
  • The ACME state machine task

These tasks will continue running until the process exits, which is fine for process termination but means they can't be gracefully stopped in tests or during a clean shutdown.

Solution: For Phase 1, this is acceptable — process termination will clean up these tasks. For future improvements:

  • Pass a CancellationToken or watch::Receiver<bool> to start_admin_socket and the eviction task
  • On shutdown, signal cancellation before waiting for drain
  • The ACME state machine task already exits when the stream ends (None branch at src/tls/acme.rs:152-158), but it should also be cancellable

W3. Fragile Error Detection in Connection Error Handler

File: src/server.rs:95-97

Problem: The check e.to_string().contains("incomplete message") to silently suppress connection errors is fragile. String matching on error descriptions can break across hyper versions, locale changes, or error message reformatting.

Solution: Match on the error type instead. Check if hyper exposes a client-disconnect error variant, or use e.is_incomplete_message() if available in the hyper error API. If no typed variant exists, add a comment explaining why string matching is used and which version(s) of hyper produce this message.


W4. No Separate Connect Timeout for Upstream Requests

File: src/proxy/handler.rs:73-79

Problem: The proxy uses tokio::time::timeout with upstream_request_timeout_secs for the entire request, but there's no separate connect timeout. A slow DNS resolution or TCP handshake will consume the full request timeout budget, leaving no time for the actual request/response cycle.

Solution: Add a connect_timeout (either a fixed default or from upstream_connect_timeout_secs in SiteConfig). Structure the proxy call as:

let connect_timeout = Duration::from_secs(site.upstream_connect_timeout_secs);
let result = tokio::time::timeout(
    request_timeout,
    async {
        let upstream_req = build_upstream_request(req, &upstream_uri)?;
        // The hyper client handles connect internally, but we can wrap
        // the request in a two-phase timeout if needed
        client.request(upstream_req).await
    }
).await;

The SiteConfig already has upstream_connect_timeout_secs but it's not used in proxy_handler. This should be wired up to either set the client's connect timeout or to implement a two-phase timeout.


W5. Hardcoded /health Path Intercepted on All Hosts

File: src/proxy/handler.rs:37-39

Problem: The proxy handler returns 200 OK for GET /health regardless of the Host header. This means any site's /health path will be intercepted by the proxy and never reach the upstream. If an upstream application uses /health for its own health checks, those requests will never reach it.

Solution: Either:

  1. Use a less common path like /__health or /healthz that won't collide with upstream applications, OR
  2. Only intercept /health when the Host header doesn't match any known site (fallthrough), OR
  3. Make the health check path configurable via StaticConfig

Option 1 is simplest for Phase 1. Option 3 is most flexible long-term.


W6. Token Bucket Refill Uses Millisecond Precision

File: src/rate_limit/bucket.rs:37

Problem: elapsed.as_millis() truncates sub-millisecond time, which can lead to token refill inaccuracies under high-frequency request bursts. For example, two requests arriving 500µs apart both see 0ms elapsed and don't refill tokens.

Solution: Use as_nanos() for the refill calculation:

let elapsed = now.duration_since(self.last_refill).as_nanos() as f64;
let tokens_to_add = (elapsed / 1_000_000_000.0) * rate;

This provides nanosecond-precision refill while keeping the math in floating point.


W7. Admin Socket Has No Shutdown Mechanism

File: src/admin/socket.rs:99-111

Problem: start_admin_socket runs an infinite loop accepting connections with no way to break out. It doesn't accept a shutdown signal, so it can't be gracefully stopped during process shutdown or in tests.

Solution: Accept a watch::Receiver<bool> parameter and use tokio::select! to check for shutdown:

tokio::select! {
    result = listener.accept() => { /* handle connection */ },
    _ = shutdown_rx.changed() => {
        info!("admin socket shutting down");
        break;
    }
}

This also requires cleaning up the socket file on exit.


W8. Server Header Unconditionally Stripped from Upstream Response

File: src/proxy/handler.rs:85

Problem: parts.headers.remove("server") unconditionally removes the upstream Server header. This is a design choice, not necessarily a bug, but it means downstream clients can't see what software the upstream is running, which may be undesirable for debugging.

Solution: This is acceptable behavior for a security-focused reverse proxy (hiding upstream identity is a defense-in-depth measure). Document this decision with a comment in the code explaining the rationale.


W9. Logging Test Fails Due to Global Subscriber

File: src/logging/mod.rs:99-111

Problem: The test init_creates_log_directory_and_file calls init() which sets a global default tracing subscriber. When tests run in parallel, this conflicts with other tests that may also set a subscriber, causing the test to fail with "a global default trace dispatcher has already been set."

Solution: Use tracing_subscriber::fmt().with_test_writer() and guard against double-initialization. Alternatively, use std::sync::OnceLock or tracing_subscriber::util::SubscriberInitExt::try_init() which returns an error if already set rather than panicking.


W10. Body Limit Middleware Only Checks Content-Length Header

File: src/proxy/body_limit.rs:26-33

Problem: The middleware checks the Content-Length header but doesn't handle requests that lack Content-Length (e.g., chunked transfers, HTTP/2). A malicious client can send Transfer-Encoding: chunked without Content-Length and bypass the initial check, though the Limited body wrapper at line 37 will still enforce the limit during streaming.

Solution: This is actually acceptable — the Limited body wrapper on line 37 is the real enforcement mechanism. The Content-Length check on line 26 is an early-rejection optimization for clients that do include the header. Add a comment explaining this two-layer defense:

// Early rejection: if Content-Length is present and exceeds the limit, reject
// immediately without reading the body. For requests without Content-Length
// (chunked, HTTP/2), the Limited body wrapper below enforces the limit during
// streaming.

W11. Health Check Port Conflict Check Is Incomplete

File: src/config/validation.rs:169-186

Problem: The validation checks if health_check_port conflicts with any listener's https_port or http_port, but it doesn't check whether the health check port's bind address conflicts with a listener on a different bind address. For example, health_check_port = 80 bound to 127.0.0.1 shouldn't conflict with a listener's http_port = 80 bound to 203.0.113.10.

Solution: This is acceptable for Phase 1 since the health check always binds to 127.0.0.1 (hardcoded in src/health.rs:20). Document that health check always binds to localhost, so the conflict check is conservative (warns even when it might not actually conflict). Add a comment in the validation code explaining this.


W12. build_upstream_request Copies All Headers Without Filtering

File: src/proxy/handler.rs:124-128

Problem: build_upstream_request copies all headers from the original request to the upstream request. However, remove_hop_by_hop was already called on the original request's headers at line 59 before build_upstream_request is called, so hop-by-hop headers have already been removed. The proxy headers (X-Real-IP, X-Forwarded-For, X-Forwarded-Proto) were also already injected at line 58. This means the function works correctly, but the code path is spread across multiple locations, making it harder to reason about header lifecycle.

Solution: Consider consolidating header manipulation into a single function or adding inline comments that trace the header lifecycle:

// Header lifecycle:
// 1. inject_proxy_headers() — adds X-Real-IP, X-Forwarded-For, X-Forwarded-Proto
// 2. remove_hop_by_hop() — removes Connection, Keep-Alive, etc.
// 3. build_upstream_request() — copies remaining headers to upstream request
// 4. Response: remove_hop_by_hop() on upstream response headers

Suggestions

S1. http_port Validation Allows 0 but Not Documented

File: src/config/validation.rs:106

Problem: http_port = 0 is treated as "disabled" (not validated as a port number), which is correct, but the validation error message for invalid ports says "must be 0 (disabled) or 1-65535" while HttpPortInvalid only checks http_port > 0 && http_port < 1 implicitly. The actual check is http_port > 0 to enter the conflict check, and port 0 is always allowed for HTTP.

Solution: Add explicit validation for http_port being in the range 0 or 1-65535 (currently it only validates conflicts, not the port range itself for http). Add a HttpPortInvalid check similar to HttpsPortInvalid:

if listener.http_port > 65535 {
    errors.push(ValidationError::HttpPortInvalid { ... });
}

S2. Consider Using #[non_exhaustive] on Public Enums

Files: src/tls/acceptor.rs:49 (TlsMode), src/proxy/error.rs:5 (ProxyError), src/admin/socket.rs:15 (AdminSocketError), src/config/validation.rs:10 (ValidationError)

Problem: These public enums can be matched exhaustively by downstream consumers. Adding a new variant would be a breaking change.

Solution: Add #[non_exhaustive] to these enums to allow future expansion without breaking changes. This is especially important for TlsMode (which may gain a "letsencrypt" or "auto" mode) and ProxyError (which may gain UpstreamTls error handling).


S3. normalize_host in dynamic_config.rs Doesn't Handle Edge Cases

File: src/config/dynamic_config.rs:52-55

Problem: normalize_host uses split(':').next() to strip ports, but this fails for IPv6 addresses in brackets (e.g., [::1]:443 would normalize to [::1] instead of ::1). The strip_port_from_host function in src/tls/redirect.rs:16-28 correctly handles this case.

Solution: Either reuse the strip_port_from_host logic from redirect.rs or add a shared utility function. The IPv6 bracket handling is important for correctness if the proxy ever receives an IPv6 Host header.


S4. Multiple #[allow(dead_code)] Annotations on Public API

Files: src/tls/acceptor.rs:14,33,48,58, src/tls/acme.rs:9,11,15,23,55, src/config/static_config.rs:4,31,44,49,54,56,70,76,86,91

Problem: Many public items are annotated with #[allow(dead_code)], which suggests they're defined but not yet used by the binary crate. This is fine during initial development but should be cleaned up before release.

Solution: Remove #[allow(dead_code)] annotations once all features are wired up (especially after C1 is fixed, since build_acme_challenge_config and TlsMode::Acme.challenge_config will be needed). Run cargo check to identify which items are actually dead code vs. just not yet referenced.


S5. InFlightCounter Could Use Atomic Usize Directly

File: src/server.rs:16-46

Problem: InFlightCounter wraps an AtomicUsize with increment/decrement/is_zero methods, plus a separate InFlightGuard RAII type. This is clean, but the Arc<InFlightCounter> pattern means every connection task clones the Arc, which has a small allocation cost.

Solution: This is a fine pattern for correctness. No change needed — the RAII guard pattern correctly ensures decrement is always called even on panic. Mentioning as a suggestion for awareness, not action.


S6. Add Clone to SiteConfig Derive Is Correct But May Mask Cloning Hot Paths

File: src/config/dynamic_config.rs:70

Problem: SiteConfig derives Clone, which is used in src/proxy/handler.rs:53 where site.clone() clones the site config on every request. For hot paths, this allocates new Strings for host, upstream, and upstream_scheme.

Solution: Consider using Arc<SiteConfig> in the routing table so that lookups return an Arc clone (cheap atomic refcount) instead of a full String clone. This would change routing_table from HashMap<String, SiteConfig> to HashMap<String, Arc<SiteConfig>>. This is a performance optimization for later — not a correctness issue.


S7. HTTPS Client Trusts System Root Certificates Unconditionally

File: src/proxy/handler.rs:153-165

Problem: The root_certs() function loads native certificates and silently skips any that fail to parse (roots.add(cert).ok()). While this matches the spec (review #001, W13), certificate validation failures for upstream TLS connections will produce opaque IO errors.

Solution: This is acceptable for Phase 1 per the spec (ADR-017). Consider logging the number of root certificates loaded for operational visibility:

let cert_count = result.certs.len();
info!(certs_loaded = cert_count, "loaded system root certificates");

S8. Request Timeout Applies to Entire HTTP Exchange, Not Just Response

File: src/proxy/handler.rs:75-79

Problem: tokio::time::timeout(request_timeout, client.request(...)) applies the timeout to the entire HTTP round-trip including response body streaming. For large file downloads or slow upstreams, this means a 60-second timeout kills the response even if the upstream is actively sending data. A more precise timeout would apply only to the connection + first-byte response, then stream the body without a timeout.

Solution: For Phase 1, this is acceptable behavior — the timeout name (upstream_request_timeout_secs) is documented as applying to the full request. Consider splitting into connect-timeout and response-timeout in Phase 2. The SiteConfig already has separate upstream_connect_timeout_secs and upstream_request_timeout_secs fields, but upstream_connect_timeout_secs is unused (see W4).


Summary Statistics

Severity Count Status
Critical 4 Must fix before production
Warning 12 Should fix — correctness and robustness
Suggestion 8 Consider for code quality
  1. C1 (ACME challenge listener) — Without this, ACME cert provisioning is completely broken. This is the highest priority fix.
  2. C2 (ACME contact email) — Without this, production Let's Encrypt registration fails.
  3. C4 (ConfigReloadHandle static config drift) — Every reload will produce stale warnings, confusing operators.
  4. C3 (X-Forwarded-For comment) — Correct behavior, just needs a clarifying comment to prevent future "fixes."
  5. W1 (Shutdown drain) — Can cause connection drops on graceful restart.
  6. W4 (Connect timeout) — Slow upstreams can exhaust the full request timeout budget.
  7. W5 (Health path collision) — Upstream /health endpoints are silently intercepted.
  8. W7 (Admin socket shutdown) — Cannot gracefully stop.
  9. Remaining W and S findings — Fix opportunistically.