Files
reverse-proxy/docs/reviews/003-security-and-bug-review.md

24 KiB
Raw Permalink Blame History

status, last_updated, reviewed_code, reviewer
status last_updated reviewed_code reviewer
draft 2026-06-12
src/main.rs
src/server.rs
src/cli.rs
src/lib.rs
src/shutdown.rs
src/health.rs
src/utils.rs
src/config/static_config.rs
src/config/dynamic_config.rs
src/config/validation.rs
src/config/test_fixtures.rs
src/config/mod.rs
src/proxy/handler.rs
src/proxy/headers.rs
src/proxy/body_limit.rs
src/proxy/error.rs
src/proxy/mod.rs
src/rate_limit/mod.rs
src/rate_limit/bucket.rs
src/admin/socket.rs
src/admin/mod.rs
src/logging/mod.rs
src/logging/format.rs
src/tls/acceptor.rs
src/tls/acme.rs
src/tls/config.rs
src/tls/redirect.rs
src/tls/mod.rs
tests/integration_test.rs
tests/helpers/http_test_helper.rs
tests/helpers/tls_test_helper.rs
tests/helpers/mod.rs
code-reviewer

Security & Bug Review #003

Purpose

Third review of the codebase, focused on security vulnerabilities, logic bugs, and correctness issues that survived the first two reviews. Also flags dead code, code smells, and test gaps.

Severity Definitions

Severity Meaning
Critical Will cause incorrect behavior or security issues in production
Warning Could cause issues under specific conditions or represents a missed edge case
Suggestion Code quality, style, or minor improvement opportunity

Critical Findings

C1. Rate Limiter Uses Client-Supplied X-Forwarded-For for IP Identification

File: src/rate_limit/mod.rs:66-76

Problem: The rate_limit_middleware extracts the client IP by checking the X-Forwarded-For header first, then falling back to ConnectInfo:

let client_ip = req
    .headers()
    .get("x-forwarded-for")   // <-- checked FIRST
    .and_then(|v| v.to_str().ok())
    .and_then(|v| v.split(',').next())
    .and_then(|v| v.trim().parse::<IpAddr>().ok())
    .or_else(|| {
        req.extensions()
            .get::<axum::extract::ConnectInfo<std::net::SocketAddr>>()
            .map(|ci| ci.ip())
    });

The middleware runs before the proxy handler. At that point, inject_proxy_headers has not yet run, so X-Forwarded-For is whatever the client sent — completely untrusted. This creates two attack vectors:

  1. Rate limit bypass: Attacker sends each request with X-Forwarded-For: <random IP>. Every request appears to come from a different IP, evading the per-IP token bucket entirely.
  2. Denial-of-service via IP spoofing: Attacker sends requests with X-Forwarded-For: <victim IP>. The victim's IP gets rate-limited and their legitimate requests receive 429s.

The proxy is the edge — it terminates TLS directly from the internet. The only trustworthy source of client IP is ConnectInfo<SocketAddr>, which ConnectInfoService sets from the TCP peer address before TLS handshake (src/server.rs:171-173).

Solution: Swap the priority: check ConnectInfo first, and only fall back to X-Forwarded-For if ConnectInfo is absent (which shouldn't happen in the current deployment). The X-Forwarded-For header from the client is untrusted at the edge.

let client_ip = req
    .extensions()
    .get::<axum::extract::ConnectInfo<std::net::SocketAddr>>()
    .map(|ci| ci.ip())
    .or_else(|| {
        req.headers()
            .get("x-forwarded-for")
            .and_then(|v| v.to_str().ok())
            .and_then(|v| v.split(',').next())
            .and_then(|v| v.trim().parse::<IpAddr>().ok())
    });

C2. InFlightCounter Never Increments — Drain Logic Is Completely Broken

File: src/server.rs:17-46,73-74

Problem: InFlightCounter has an increment() method that is never called anywhere in the codebase. The InFlightGuard only calls decrement() on drop:

struct InFlightGuard(Arc<InFlightCounter>);

impl Drop for InFlightGuard {
    fn drop(&mut self) {
        self.0.decrement();  // only decrements!
    }
}

The guard is created in serve_https_listener:

tokio::spawn(async move {
    let _guard = InFlightGuard(in_flight.clone());
    // ...
});

Since increment() is never called, count stays at 0. When the first guard drops, fetch_sub(1) on an AtomicUsize with value 0 wraps to usize::MAX. is_zero() checks count == 0, which will never be true again. The drain loop in main.rs:250 will always time out and report usize::MAX remaining connections.

Solution: Call in_flight.increment() before spawning the connection task:

in_flight.increment();
tokio::spawn(async move {
    let _guard = InFlightGuard(in_flight.clone());
    // ...
});

Or fold the increment into InFlightGuard::new():

impl InFlightGuard {
    fn new(counter: Arc<InFlightCounter>) -> Self {
        counter.increment();
        Self(counter)
    }
}

C3. Per-Site upstream_connect_timeout_secs Silently Capped at 5 Seconds

File: src/proxy/handler.rs:86-98,218-224,226-229

Problem: The proxy handler applies a per-site connect timeout via tokio::time::timeout:

let connect_timeout = Duration::from_secs(site.upstream_connect_timeout_secs);
// ...
tokio::time::timeout(connect_timeout, state.http_client.request(upstream_req)).await

But the HTTP connector inside the client has its own hardcoded connect timeout:

pub fn create_http_client() -> Client<HttpConnector, Body> {
    let mut connector = HttpConnector::new();
    connector.set_connect_timeout(Some(Duration::from_secs(5)));  // hardcoded
    // ...
}

The connector's internal timeout takes precedence — if upstream_connect_timeout_secs is set to 10s, the connector still times out at 5s. The tokio::time::timeout wrapper can only make the effective timeout shorter, not longer. Per-site connect timeout values > 5s are silently ignored.

Solution: Set the connector's connect timeout to match (or exceed) the per-site value, or remove the hardcoded connector timeout and rely solely on the tokio::time::timeout wrapper. Since the client is shared across all sites, the simplest fix is to set the connector timeout to a high value (e.g., 30s) and let the per-site tokio::time::timeout enforce the actual limit:

connector.set_connect_timeout(Some(Duration::from_secs(30)));

Or make the connector timeout configurable per-site by creating clients lazily.


C4. init_json Without Log File Doesn't Enable JSON Format

File: src/logging/mod.rs:54-59

Problem: When format = "json" is configured but no log_file_path is set, the None branch of init_json creates a layer without calling .json():

None => {
    let layer = tracing_subscriber::fmt::layer()
        .with_ansi(false)
        .with_filter(env_filter);
    tracing_subscriber::registry().with(layer).try_init()?;
}

The output is plain text, not JSON. The format = "json" config value is silently ignored. The Some(path) branch correctly calls .json() on both layers.

Solution: Add .json() to the None branch:

None => {
    let layer = tracing_subscriber::fmt::layer()
        .json()
        .with_ansi(false)
        .with_filter(env_filter);
    tracing_subscriber::registry().with(layer).try_init()?;
}

Warning Findings

W1. is_valid_upstream Doesn't Validate the Host Part

File: src/config/validation.rs:309-327

Problem: is_valid_upstream checks that the upstream has a host:port format with a valid port number, but performs no validation on the host part beyond checking it's non-empty and doesn't start with http:// or https://. Values like !!!bad!!!:3000 or @#$%:8080 pass validation.

Solution: Validate the host part is a valid DNS name or IP address. Reuse is_valid_hostname for DNS names and add an IP address parse check for the host portion (handling IPv6 bracket notation):

fn is_valid_upstream(upstream: &str) -> bool {
    if let Some(idx) = upstream.rfind(':') {
        let host_part = &upstream[..idx];
        let port_str = &upstream[idx + 1..];
        if host_part.is_empty() { return false; }
        if upstream.starts_with("http://") || upstream.starts_with("https://") { return false; }
        let port: u16 = match port_str.parse() { Ok(p) => p, Err(_) => return false };
        if port == 0 { return false; }
        // Validate host part
        if host_part.starts_with('[') && host_part.ends_with(']') {
            let inner = &host_part[1..host_part.len()-1];
            inner.parse::<std::net::Ipv6Addr>().is_ok()
        } else {
            host_part.parse::<std::net::IpAddr>().is_ok() || is_valid_hostname(host_part)
        }
    } else {
        false
    }
}

W2. ACME Contact Validation Only Checks mailto: Prefix

File: src/config/validation.rs:149

Problem: The validation checks contact.starts_with("mailto:") but doesn't verify there's an actual email address after the prefix. acme_contact = "mailto:" (empty email) passes validation but will fail at the Let's Encrypt API with a 400-level error at certificate provisioning time — after the proxy has already started.

Solution: Validate that the string after mailto: is non-empty and contains an @ sign:

if contact.is_empty() || !contact.starts_with("mailto:") {
    errors.push(ValidationError::AcmeContactInvalid { ... });
} else {
    let email = &contact[7..]; // after "mailto:"
    if email.is_empty() || !email.contains('@') {
        errors.push(ValidationError::AcmeContactInvalid { ... });
    }
}

W3. build_upstream_uri Silently Drops Query String on Parse Failure

File: src/proxy/handler.rs:190-202

Problem: If the full upstream URI (with query string) fails to parse, the fallback drops the query string entirely and .unwrap()s the result:

uri_string.parse::<Uri>().unwrap_or_else(|_| {
    format!("{}://{}{}", scheme, upstream, path)  // query dropped!
        .parse::<Uri>()
        .unwrap()  // panics if this also fails
})

This silently corrupts requests — the upstream receives the wrong URL with no query parameters, and neither the client nor operator is notified. The .unwrap() on the fallback parse could also panic (though unlikely since scheme://host:port/path should always parse).

Solution: Log a warning and return a 502 Bad Gateway instead of silently dropping data:

fn build_upstream_uri(scheme: &str, upstream: &str, original_uri: &Uri) -> Result<Uri, ()> {
    let path = original_uri.path();
    let query = original_uri.query().map(|q| format!("?{}", q)).unwrap_or_default();
    let uri_string = format!("{}://{}{}{}", scheme, upstream, path, query);
    uri_string.parse::<Uri>().map_err(|e| {
        warn!(error = %e, uri = %uri_string, "failed to parse upstream URI");
    })
}

W4. Admin Socket Has No Read Timeout or Line Length Limit

File: src/admin/socket.rs:166-210

Problem: handle_connection reads one newline-terminated line with reader.read_line(&mut line) but sets no timeout and no length limit. A malicious client with access to the admin socket (Unix permissions permitting) can:

  • Connect and send no data, holding a connection and task open indefinitely
  • Send gigabytes of data without a newline, causing unbounded memory allocation

Solution: Wrap the read in tokio::time::timeout and use read_until with a reasonable limit, or use take to cap the stream:

let mut reader = BufReader::new(tokio::io::take(reader, 4096)); // 4KB limit
let read_result = tokio::time::timeout(
    Duration::from_secs(5),
    reader.read_line(&mut line),
).await;

W5. TlsMode Wildcard Match Could Cause Listener/Acceptor Mismatch

File: src/main.rs:170-194,209-210

Problem: The match tls_mode has a wildcard _ arm that logs a warning and pushes no acceptor. Then bound_listeners.into_iter().zip(tls_acceptors.into_iter()) uses zip, which silently stops at the shorter iterator. If the wildcard arm were ever reached, some listeners would have no TLS acceptor and would be silently dropped with no error.

setup_tls already rejects unknown modes with bail!, so the wildcard is unreachable in practice. But it's a latent bug waiting for a future refactor.

Solution: Either remove the wildcard arm (since TlsMode only has two variants and setup_tls already validates), or make the mismatch explicit:

if bound_listeners.len() != tls_acceptors.len() {
    anyhow::bail!("listener/acceptor count mismatch");
}

W6. RawConfig and FullConfig Are Duplicated

Files: src/cli.rs:49-65, src/config/mod.rs:15-31

Problem: RawConfig (used at startup in cli.rs) and FullConfig (used at reload in config/mod.rs) have identical fields and identical serde attributes. They exist because the initial load path manually constructs StaticConfig + SerializableDynamicConfig, while the reload path uses FullConfig::into_static_and_dynamic(). This duplication means any new config field must be added in two places.

Solution: Delete RawConfig and use FullConfig in load_config:

let full: FullConfig = toml::from_str(&config_content)?;
let (static_config, dynamic_config) = full.into_static_and_dynamic();

Then collect_sites can also be removed since into_static_and_dynamic already collects sites from all listeners.


W7. cleanup_stale_socket Connects to Verify Liveness — Side Effect on Other Process

File: src/admin/socket.rs:142-160,162-164

Problem: is_socket_active checks whether another process owns the socket by connecting to it:

async fn is_socket_active(path: &str) -> bool {
    tokio::net::UnixStream::connect(path).await.is_ok()
}

If another process is listening, this creates a real connection that the other process will accept(), read nothing, and close. This is a harmless but unnecessary side effect on the other process.

Solution: Check for liveness by inspecting /proc/net/unix or by attempting a zero-byte sendmsg with MSG_PEEK. For Phase 1, the current approach is acceptable since the admin socket is a local diagnostic interface.


W8. test_health_check_disabled_when_port_zero Test Name Is Misleading

File: tests/integration_test.rs:82-88

Problem: The test is named test_health_check_disabled_when_port_zero but port 0 means "OS picks a random port" — the health check listener still starts and binds successfully. The actual "disabled" logic is in main.rs:94 where health_check_port > 0 gates the call to start_health_check_listener. The test verifies that start_health_check_listener(0) works (binds to a random port), which is correct but the name implies it tests the disable path.

Solution: Rename to test_health_check_binds_random_port_when_zero or add a separate test that verifies the main.rs gating logic.


W9. test_dynamic_config_with_limit Has Empty routing_table

File: tests/integration_test.rs:618-635

Problem: The helper constructs DynamicConfig directly with routing_table: Default::default() (empty HashMap), bypassing DynamicConfig::from_sites() which would normally build the routing table. The body limit tests only read body.limit_bytes so the empty table doesn't matter, but it's a latent trap for anyone who copies this pattern for tests that need routing.

Solution: Use DynamicConfig::from_sites() to construct the config, or at minimum add a comment warning that the routing table is intentionally empty.


W10. TokenBucket Fields Are pub Despite Internal-Only Use

File: src/rate_limit/bucket.rs:4-9

Problem: All TokenBucket fields are pub:

pub struct TokenBucket {
    pub tokens: f64,
    pub last_refill: Instant,
    pub rate: f64,
    pub max: u32,
    pub last_access: Instant,
}

Only last_access is read externally (by evict_stale). The other fields should be private to prevent accidental direct mutation that bypasses try_consume/refill logic.

Solution: Make tokens, last_refill, rate, and max private. Keep last_access pub(crate) for evict_stale.


W11. Admin Socket Exposes reload_mutex for Testing

File: src/admin/socket.rs:71-73

Problem: AdminSocket::reload_mutex() is a public method that exists solely for the test_reload_serialized_with_mutex test. It exposes an internal synchronization primitive, and the test acquires the mutex before sending a reload command to verify serialization — coupling the test to implementation details.

Solution: Remove the method and test serialization through observable behavior (e.g., send two rapid reload requests and verify the final config state reflects the last one), or make the method #[cfg(test)]-gated.


W12. http_port Type Is u32 While https_port Is u16

File: src/config/static_config.rs:34,36

Problem: http_port is declared as u32 but https_port is u16. Both represent TCP port numbers (valid range 165535). The type inconsistency means comparisons require casting (listener.http_port == listener.https_port as u32 at validation.rs:133) and http_port could theoretically hold values > 65535 that are caught by validation rather than the type system.

Solution: Change http_port to u16 for consistency. Port 0 (disabled) fits in u16. Update all as u32 casts in validation and main.rs.


Suggestions

S1. Remove Dead Code Before Release

Files: Multiple

The following items are defined but never called in production code:

Item File Note
log_rate_limit! macro src/logging/format.rs:72-83 Rate limiter uses warn! directly
log_config_reload! macro src/logging/format.rs:97-106 Reload uses info!/warn! directly
format_event_fields() src/logging/format.rs:50-54 Never called
ProxyError::BadGateway (struct variant) src/proxy/error.rs:8 Code uses UpstreamConnection instead
ProxyError::GatewayTimeout (struct variant) src/proxy/error.rs:10 Code uses UpstreamTimeout instead
ProxyError::PayloadTooLarge src/proxy/error.rs:12 Body limit returns tuple directly
ProxyError::NotFound src/proxy/error.rs:20 Code uses UnknownHost instead
ProxyError::BadRequest src/proxy/error.rs:22 Code uses MissingHost instead
ProxyError::UpstreamTls src/proxy/error.rs:28 Never constructed
build_multi_domain_server_config() src/tls/config.rs:78-102 Never called
SniCertResolver src/tls/config.rs:104-126 Only used by above
AcmeTlsConfig::directory_url() src/tls/acme.rs:53-59 Only used in tests
TestUpstream::url() tests/helpers/http_test_helper.rs:39-41 #[allow(dead_code)]
TestUpstream::upstream_addr() tests/helpers/http_test_helper.rs:43-46 #[allow(dead_code)]

Either wire these up or remove them. Dead code increases audit surface and creates confusion about what's actually active.


S2. Make Eviction Interval and Max Age Configurable

File: src/main.rs:196-201

Problem: The eviction task interval (60s) and max age (300s) are hardcoded. In high-traffic deployments, a shorter interval or longer max age might be desirable.

Suggestion: Add rate_limit.eviction_interval_secs and rate_limit.bucket_max_age_secs to RateLimitConfig with defaults of 60 and 300. These are dynamic config fields so they can be hot-reloaded.


S3. Log Root Certificate Count at Startup

File: src/proxy/handler.rs:246-258

Problem: root_certs() loads native certificates silently (only logs errors). If the system has zero root certificates (misconfigured CA bundle), all HTTPS upstream connections will fail with opaque TLS errors.

Suggestion: Log the number of loaded certificates at info level:

let cert_count = result.certs.len();
info!(certs_loaded = cert_count, errors = result.errors.len(), "loaded system root certificates");
if cert_count == 0 {
    warn!("no system root certificates loaded — HTTPS upstream connections will fail");
}

S4. Add Read Timeout and Line Length Limit to Admin Socket

File: src/admin/socket.rs:166-210

See W4 for the problem. Even if the admin socket is only accessible via Unix permissions, defense-in-depth warrants basic resource limits.


S5. Consolidate RawConfig and FullConfig

Files: src/cli.rs:49-65, src/config/mod.rs:15-31

See W6. Using a single FullConfig type for both startup and reload eliminates duplication and ensures both paths stay in sync.


S6. Add #[non_exhaustive] to TokenBucket (or Make Fields Private)

File: src/rate_limit/bucket.rs:4-9

See W10. Making fields private prevents accidental direct mutation. If external crates need to construct TokenBucket, add a new() constructor (which already exists).


S7. Add More Status Info to Admin Socket status Command

File: src/admin/socket.rs:265-275

Suggestion: The status response currently returns uptime_secs and sites. Consider adding:

  • rate_limit (requests_per_second, burst)
  • body_limit_bytes
  • listeners count
  • in_flight_requests (if InFlightCounter is fixed per C2)

This gives operators a quick health snapshot without reading logs.


S8. Consider Making upstream_connect_timeout_secs Actually Work

File: src/proxy/handler.rs:216-229

See C3. The hardcoded 5s connector timeout caps the per-site connect timeout. Either remove the connector timeout or set it to a high ceiling value.


S9. acme_contact Config Field Should Be Vec<String> for Multiple Contacts

File: src/config/static_config.rs:60

Problem: acme_contact is a single String but ACME supports multiple contact emails. The AcmeTlsConfig.contact field is already Vec<String> and the config value is wrapped in vec![...] at src/tls/acceptor.rs:70.

Suggestion: Change acme_contact to Vec<String> in TlsConfig for consistency with the ACME protocol. This is a config format change — document the migration.


S10. Add Integration Test for Rate Limiter Using ConnectInfo

File: tests/integration_test.rs:90-163

Problem: All rate limit integration tests pass the client IP via X-Forwarded-For header. No test verifies the ConnectInfo extraction path, which is the primary path after C1 is fixed.

Suggestion: Add a test that sets ConnectInfo on the request extensions (like make_request_with_connect_info in src/proxy/handler.rs tests) and verifies rate limiting works without the X-Forwarded-For header.


Summary Statistics

Severity Count Status
Critical 4 Must fix before production
Warning 12 Should fix — correctness and robustness
Suggestion 10 Consider for code quality
  1. C1 (rate limiter X-Forwarded-For spoofing) — Active security vulnerability. Any external attacker can bypass rate limiting entirely.
  2. C2 (InFlightCounter never increments) — Graceful shutdown drain logic is completely non-functional. The drain always times out.
  3. C3 (connect timeout capped at 5s) — Per-site timeout configuration is silently ignored for values > 5s.
  4. C4 (JSON format not applied without log file) — Format config is silently ignored in a common deployment mode (stdout-only JSON logging).
  5. W1 (upstream host validation gap) — Invalid upstream addresses pass config validation.
  6. W2 (ACME contact validation gap) — Empty email after mailto: passes validation, fails at runtime.
  7. W3 (query string silently dropped) — Data corruption on URI parse failure.
  8. W4 (admin socket resource limits) — DoS vector for local socket access.
  9. W5 (TlsMode wildcard mismatch) — Latent bug for future refactors.
  10. W6 (RawConfig/FullConfig duplication) — Maintenance burden.
  11. Remaining W and S findings — Fix opportunistically.