Files
reverse-proxy/docs/reviews/003-security-and-bug-review.md

706 lines
24 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
status: draft
last_updated: 2026-06-12
reviewed_code:
- src/main.rs
- src/server.rs
- src/cli.rs
- src/lib.rs
- src/shutdown.rs
- src/health.rs
- src/utils.rs
- src/config/static_config.rs
- src/config/dynamic_config.rs
- src/config/validation.rs
- src/config/test_fixtures.rs
- src/config/mod.rs
- src/proxy/handler.rs
- src/proxy/headers.rs
- src/proxy/body_limit.rs
- src/proxy/error.rs
- src/proxy/mod.rs
- src/rate_limit/mod.rs
- src/rate_limit/bucket.rs
- src/admin/socket.rs
- src/admin/mod.rs
- src/logging/mod.rs
- src/logging/format.rs
- src/tls/acceptor.rs
- src/tls/acme.rs
- src/tls/config.rs
- src/tls/redirect.rs
- src/tls/mod.rs
- tests/integration_test.rs
- tests/helpers/http_test_helper.rs
- tests/helpers/tls_test_helper.rs
- tests/helpers/mod.rs
reviewer: code-reviewer
---
# Security & Bug Review #003
## Purpose
Third review of the codebase, focused on security vulnerabilities, logic bugs,
and correctness issues that survived the first two reviews. Also flags dead code,
code smells, and test gaps.
## Severity Definitions
| Severity | Meaning |
|----------|---------|
| **Critical** | Will cause incorrect behavior or security issues in production |
| **Warning** | Could cause issues under specific conditions or represents a missed edge case |
| **Suggestion** | Code quality, style, or minor improvement opportunity |
---
## Critical Findings
### C1. Rate Limiter Uses Client-Supplied X-Forwarded-For for IP Identification
**File**: `src/rate_limit/mod.rs:66-76`
**Problem**: The `rate_limit_middleware` extracts the client IP by checking the
`X-Forwarded-For` header **first**, then falling back to `ConnectInfo`:
```rust
let client_ip = req
.headers()
.get("x-forwarded-for") // <-- checked FIRST
.and_then(|v| v.to_str().ok())
.and_then(|v| v.split(',').next())
.and_then(|v| v.trim().parse::<IpAddr>().ok())
.or_else(|| {
req.extensions()
.get::<axum::extract::ConnectInfo<std::net::SocketAddr>>()
.map(|ci| ci.ip())
});
```
The middleware runs **before** the proxy handler. At that point,
`inject_proxy_headers` has not yet run, so `X-Forwarded-For` is whatever the
**client** sent — completely untrusted. This creates two attack vectors:
1. **Rate limit bypass**: Attacker sends each request with
`X-Forwarded-For: <random IP>`. Every request appears to come from a
different IP, evading the per-IP token bucket entirely.
2. **Denial-of-service via IP spoofing**: Attacker sends requests with
`X-Forwarded-For: <victim IP>`. The victim's IP gets rate-limited and their
legitimate requests receive 429s.
The proxy is the **edge** — it terminates TLS directly from the internet. The
only trustworthy source of client IP is `ConnectInfo<SocketAddr>`, which
`ConnectInfoService` sets from the TCP peer address before TLS handshake
(`src/server.rs:171-173`).
**Solution**: Swap the priority: check `ConnectInfo` first, and only fall back
to `X-Forwarded-For` if `ConnectInfo` is absent (which shouldn't happen in the
current deployment). The `X-Forwarded-For` header from the client is untrusted
at the edge.
```rust
let client_ip = req
.extensions()
.get::<axum::extract::ConnectInfo<std::net::SocketAddr>>()
.map(|ci| ci.ip())
.or_else(|| {
req.headers()
.get("x-forwarded-for")
.and_then(|v| v.to_str().ok())
.and_then(|v| v.split(',').next())
.and_then(|v| v.trim().parse::<IpAddr>().ok())
});
```
---
### C2. InFlightCounter Never Increments — Drain Logic Is Completely Broken
**File**: `src/server.rs:17-46,73-74`
**Problem**: `InFlightCounter` has an `increment()` method that is **never
called** anywhere in the codebase. The `InFlightGuard` only calls `decrement()`
on drop:
```rust
struct InFlightGuard(Arc<InFlightCounter>);
impl Drop for InFlightGuard {
fn drop(&mut self) {
self.0.decrement(); // only decrements!
}
}
```
The guard is created in `serve_https_listener`:
```rust
tokio::spawn(async move {
let _guard = InFlightGuard(in_flight.clone());
// ...
});
```
Since `increment()` is never called, `count` stays at 0. When the first guard
drops, `fetch_sub(1)` on an `AtomicUsize` with value 0 wraps to `usize::MAX`.
`is_zero()` checks `count == 0`, which will never be true again. The drain loop
in `main.rs:250` will always time out and report `usize::MAX` remaining
connections.
**Solution**: Call `in_flight.increment()` before spawning the connection task:
```rust
in_flight.increment();
tokio::spawn(async move {
let _guard = InFlightGuard(in_flight.clone());
// ...
});
```
Or fold the increment into `InFlightGuard::new()`:
```rust
impl InFlightGuard {
fn new(counter: Arc<InFlightCounter>) -> Self {
counter.increment();
Self(counter)
}
}
```
---
### C3. Per-Site `upstream_connect_timeout_secs` Silently Capped at 5 Seconds
**File**: `src/proxy/handler.rs:86-98,218-224,226-229`
**Problem**: The proxy handler applies a per-site connect timeout via
`tokio::time::timeout`:
```rust
let connect_timeout = Duration::from_secs(site.upstream_connect_timeout_secs);
// ...
tokio::time::timeout(connect_timeout, state.http_client.request(upstream_req)).await
```
But the HTTP connector inside the client has its own hardcoded connect timeout:
```rust
pub fn create_http_client() -> Client<HttpConnector, Body> {
let mut connector = HttpConnector::new();
connector.set_connect_timeout(Some(Duration::from_secs(5))); // hardcoded
// ...
}
```
The connector's internal timeout takes precedence — if
`upstream_connect_timeout_secs` is set to 10s, the connector still times out at
5s. The `tokio::time::timeout` wrapper can only make the effective timeout
**shorter**, not longer. Per-site connect timeout values > 5s are silently
ignored.
**Solution**: Set the connector's connect timeout to match (or exceed) the
per-site value, or remove the hardcoded connector timeout and rely solely on
the `tokio::time::timeout` wrapper. Since the client is shared across all sites,
the simplest fix is to set the connector timeout to a high value (e.g., 30s)
and let the per-site `tokio::time::timeout` enforce the actual limit:
```rust
connector.set_connect_timeout(Some(Duration::from_secs(30)));
```
Or make the connector timeout configurable per-site by creating clients lazily.
---
### C4. `init_json` Without Log File Doesn't Enable JSON Format
**File**: `src/logging/mod.rs:54-59`
**Problem**: When `format = "json"` is configured but no `log_file_path` is set,
the `None` branch of `init_json` creates a layer **without** calling `.json()`:
```rust
None => {
let layer = tracing_subscriber::fmt::layer()
.with_ansi(false)
.with_filter(env_filter);
tracing_subscriber::registry().with(layer).try_init()?;
}
```
The output is plain text, not JSON. The `format = "json"` config value is
silently ignored. The `Some(path)` branch correctly calls `.json()` on both
layers.
**Solution**: Add `.json()` to the `None` branch:
```rust
None => {
let layer = tracing_subscriber::fmt::layer()
.json()
.with_ansi(false)
.with_filter(env_filter);
tracing_subscriber::registry().with(layer).try_init()?;
}
```
---
## Warning Findings
### W1. `is_valid_upstream` Doesn't Validate the Host Part
**File**: `src/config/validation.rs:309-327`
**Problem**: `is_valid_upstream` checks that the upstream has a `host:port`
format with a valid port number, but performs **no validation on the host
part** beyond checking it's non-empty and doesn't start with `http://` or
`https://`. Values like `!!!bad!!!:3000` or `@#$%:8080` pass validation.
**Solution**: Validate the host part is a valid DNS name or IP address. Reuse
`is_valid_hostname` for DNS names and add an IP address parse check for the
host portion (handling IPv6 bracket notation):
```rust
fn is_valid_upstream(upstream: &str) -> bool {
if let Some(idx) = upstream.rfind(':') {
let host_part = &upstream[..idx];
let port_str = &upstream[idx + 1..];
if host_part.is_empty() { return false; }
if upstream.starts_with("http://") || upstream.starts_with("https://") { return false; }
let port: u16 = match port_str.parse() { Ok(p) => p, Err(_) => return false };
if port == 0 { return false; }
// Validate host part
if host_part.starts_with('[') && host_part.ends_with(']') {
let inner = &host_part[1..host_part.len()-1];
inner.parse::<std::net::Ipv6Addr>().is_ok()
} else {
host_part.parse::<std::net::IpAddr>().is_ok() || is_valid_hostname(host_part)
}
} else {
false
}
}
```
---
### W2. ACME Contact Validation Only Checks `mailto:` Prefix
**File**: `src/config/validation.rs:149`
**Problem**: The validation checks `contact.starts_with("mailto:")` but doesn't
verify there's an actual email address after the prefix. `acme_contact =
"mailto:"` (empty email) passes validation but will fail at the Let's Encrypt
API with a 400-level error at certificate provisioning time — after the proxy
has already started.
**Solution**: Validate that the string after `mailto:` is non-empty and contains
an `@` sign:
```rust
if contact.is_empty() || !contact.starts_with("mailto:") {
errors.push(ValidationError::AcmeContactInvalid { ... });
} else {
let email = &contact[7..]; // after "mailto:"
if email.is_empty() || !email.contains('@') {
errors.push(ValidationError::AcmeContactInvalid { ... });
}
}
```
---
### W3. `build_upstream_uri` Silently Drops Query String on Parse Failure
**File**: `src/proxy/handler.rs:190-202`
**Problem**: If the full upstream URI (with query string) fails to parse, the
fallback drops the query string entirely and `.unwrap()`s the result:
```rust
uri_string.parse::<Uri>().unwrap_or_else(|_| {
format!("{}://{}{}", scheme, upstream, path) // query dropped!
.parse::<Uri>()
.unwrap() // panics if this also fails
})
```
This silently corrupts requests — the upstream receives the wrong URL with no
query parameters, and neither the client nor operator is notified. The
`.unwrap()` on the fallback parse could also panic (though unlikely since
`scheme://host:port/path` should always parse).
**Solution**: Log a warning and return a 502 Bad Gateway instead of silently
dropping data:
```rust
fn build_upstream_uri(scheme: &str, upstream: &str, original_uri: &Uri) -> Result<Uri, ()> {
let path = original_uri.path();
let query = original_uri.query().map(|q| format!("?{}", q)).unwrap_or_default();
let uri_string = format!("{}://{}{}{}", scheme, upstream, path, query);
uri_string.parse::<Uri>().map_err(|e| {
warn!(error = %e, uri = %uri_string, "failed to parse upstream URI");
})
}
```
---
### W4. Admin Socket Has No Read Timeout or Line Length Limit
**File**: `src/admin/socket.rs:166-210`
**Problem**: `handle_connection` reads one newline-terminated line with
`reader.read_line(&mut line)` but sets no timeout and no length limit. A
malicious client with access to the admin socket (Unix permissions permitting)
can:
- Connect and send no data, holding a connection and task open indefinitely
- Send gigabytes of data without a newline, causing unbounded memory allocation
**Solution**: Wrap the read in `tokio::time::timeout` and use
`read_until` with a reasonable limit, or use `take` to cap the stream:
```rust
let mut reader = BufReader::new(tokio::io::take(reader, 4096)); // 4KB limit
let read_result = tokio::time::timeout(
Duration::from_secs(5),
reader.read_line(&mut line),
).await;
```
---
### W5. `TlsMode` Wildcard Match Could Cause Listener/Acceptor Mismatch
**File**: `src/main.rs:170-194,209-210`
**Problem**: The `match tls_mode` has a wildcard `_` arm that logs a warning
and pushes **no** acceptor. Then `bound_listeners.into_iter().zip(tls_acceptors.into_iter())`
uses `zip`, which silently stops at the shorter iterator. If the wildcard arm
were ever reached, some listeners would have no TLS acceptor and would be
silently dropped with no error.
`setup_tls` already rejects unknown modes with `bail!`, so the wildcard is
unreachable in practice. But it's a latent bug waiting for a future refactor.
**Solution**: Either remove the wildcard arm (since `TlsMode` only has two
variants and `setup_tls` already validates), or make the mismatch explicit:
```rust
if bound_listeners.len() != tls_acceptors.len() {
anyhow::bail!("listener/acceptor count mismatch");
}
```
---
### W6. `RawConfig` and `FullConfig` Are Duplicated
**Files**: `src/cli.rs:49-65`, `src/config/mod.rs:15-31`
**Problem**: `RawConfig` (used at startup in `cli.rs`) and `FullConfig` (used
at reload in `config/mod.rs`) have identical fields and identical serde
attributes. They exist because the initial load path manually constructs
`StaticConfig` + `SerializableDynamicConfig`, while the reload path uses
`FullConfig::into_static_and_dynamic()`. This duplication means any new config
field must be added in two places.
**Solution**: Delete `RawConfig` and use `FullConfig` in `load_config`:
```rust
let full: FullConfig = toml::from_str(&config_content)?;
let (static_config, dynamic_config) = full.into_static_and_dynamic();
```
Then `collect_sites` can also be removed since `into_static_and_dynamic`
already collects sites from all listeners.
---
### W7. `cleanup_stale_socket` Connects to Verify Liveness — Side Effect on Other Process
**File**: `src/admin/socket.rs:142-160,162-164`
**Problem**: `is_socket_active` checks whether another process owns the socket
by **connecting to it**:
```rust
async fn is_socket_active(path: &str) -> bool {
tokio::net::UnixStream::connect(path).await.is_ok()
}
```
If another process is listening, this creates a real connection that the other
process will `accept()`, read nothing, and close. This is a harmless but
unnecessary side effect on the other process.
**Solution**: Check for liveness by inspecting `/proc/net/unix` or by attempting
a zero-byte `sendmsg` with `MSG_PEEK`. For Phase 1, the current approach is
acceptable since the admin socket is a local diagnostic interface.
---
### W8. `test_health_check_disabled_when_port_zero` Test Name Is Misleading
**File**: `tests/integration_test.rs:82-88`
**Problem**: The test is named `test_health_check_disabled_when_port_zero` but
port 0 means "OS picks a random port" — the health check listener still starts
and binds successfully. The actual "disabled" logic is in `main.rs:94` where
`health_check_port > 0` gates the call to `start_health_check_listener`. The
test verifies that `start_health_check_listener(0)` works (binds to a random
port), which is correct but the name implies it tests the disable path.
**Solution**: Rename to `test_health_check_binds_random_port_when_zero` or add
a separate test that verifies the `main.rs` gating logic.
---
### W9. `test_dynamic_config_with_limit` Has Empty `routing_table`
**File**: `tests/integration_test.rs:618-635`
**Problem**: The helper constructs `DynamicConfig` directly with
`routing_table: Default::default()` (empty HashMap), bypassing
`DynamicConfig::from_sites()` which would normally build the routing table.
The body limit tests only read `body.limit_bytes` so the empty table doesn't
matter, but it's a latent trap for anyone who copies this pattern for tests
that need routing.
**Solution**: Use `DynamicConfig::from_sites()` to construct the config, or at
minimum add a comment warning that the routing table is intentionally empty.
---
### W10. `TokenBucket` Fields Are `pub` Despite Internal-Only Use
**File**: `src/rate_limit/bucket.rs:4-9`
**Problem**: All `TokenBucket` fields are `pub`:
```rust
pub struct TokenBucket {
pub tokens: f64,
pub last_refill: Instant,
pub rate: f64,
pub max: u32,
pub last_access: Instant,
}
```
Only `last_access` is read externally (by `evict_stale`). The other fields
should be private to prevent accidental direct mutation that bypasses
`try_consume`/`refill` logic.
**Solution**: Make `tokens`, `last_refill`, `rate`, and `max` private. Keep
`last_access` pub(crate) for `evict_stale`.
---
### W11. Admin Socket Exposes `reload_mutex` for Testing
**File**: `src/admin/socket.rs:71-73`
**Problem**: `AdminSocket::reload_mutex()` is a public method that exists solely
for the `test_reload_serialized_with_mutex` test. It exposes an internal
synchronization primitive, and the test acquires the mutex before sending a
reload command to verify serialization — coupling the test to implementation
details.
**Solution**: Remove the method and test serialization through observable
behavior (e.g., send two rapid reload requests and verify the final config
state reflects the last one), or make the method `#[cfg(test)]`-gated.
---
### W12. `http_port` Type Is `u32` While `https_port` Is `u16`
**File**: `src/config/static_config.rs:34,36`
**Problem**: `http_port` is declared as `u32` but `https_port` is `u16`. Both
represent TCP port numbers (valid range 165535). The type inconsistency means
comparisons require casting (`listener.http_port == listener.https_port as u32`
at `validation.rs:133`) and `http_port` could theoretically hold values >
65535 that are caught by validation rather than the type system.
**Solution**: Change `http_port` to `u16` for consistency. Port 0 (disabled)
fits in `u16`. Update all `as u32` casts in validation and `main.rs`.
---
## Suggestions
### S1. Remove Dead Code Before Release
**Files**: Multiple
The following items are defined but never called in production code:
| Item | File | Note |
|------|------|------|
| `log_rate_limit!` macro | `src/logging/format.rs:72-83` | Rate limiter uses `warn!` directly |
| `log_config_reload!` macro | `src/logging/format.rs:97-106` | Reload uses `info!`/`warn!` directly |
| `format_event_fields()` | `src/logging/format.rs:50-54` | Never called |
| `ProxyError::BadGateway` (struct variant) | `src/proxy/error.rs:8` | Code uses `UpstreamConnection` instead |
| `ProxyError::GatewayTimeout` (struct variant) | `src/proxy/error.rs:10` | Code uses `UpstreamTimeout` instead |
| `ProxyError::PayloadTooLarge` | `src/proxy/error.rs:12` | Body limit returns tuple directly |
| `ProxyError::NotFound` | `src/proxy/error.rs:20` | Code uses `UnknownHost` instead |
| `ProxyError::BadRequest` | `src/proxy/error.rs:22` | Code uses `MissingHost` instead |
| `ProxyError::UpstreamTls` | `src/proxy/error.rs:28` | Never constructed |
| `build_multi_domain_server_config()` | `src/tls/config.rs:78-102` | Never called |
| `SniCertResolver` | `src/tls/config.rs:104-126` | Only used by above |
| `AcmeTlsConfig::directory_url()` | `src/tls/acme.rs:53-59` | Only used in tests |
| `TestUpstream::url()` | `tests/helpers/http_test_helper.rs:39-41` | `#[allow(dead_code)]` |
| `TestUpstream::upstream_addr()` | `tests/helpers/http_test_helper.rs:43-46` | `#[allow(dead_code)]` |
Either wire these up or remove them. Dead code increases audit surface and
creates confusion about what's actually active.
---
### S2. Make Eviction Interval and Max Age Configurable
**File**: `src/main.rs:196-201`
**Problem**: The eviction task interval (60s) and max age (300s) are hardcoded.
In high-traffic deployments, a shorter interval or longer max age might be
desirable.
**Suggestion**: Add `rate_limit.eviction_interval_secs` and
`rate_limit.bucket_max_age_secs` to `RateLimitConfig` with defaults of 60 and
300. These are dynamic config fields so they can be hot-reloaded.
---
### S3. Log Root Certificate Count at Startup
**File**: `src/proxy/handler.rs:246-258`
**Problem**: `root_certs()` loads native certificates silently (only logs
errors). If the system has zero root certificates (misconfigured CA bundle),
all HTTPS upstream connections will fail with opaque TLS errors.
**Suggestion**: Log the number of loaded certificates at info level:
```rust
let cert_count = result.certs.len();
info!(certs_loaded = cert_count, errors = result.errors.len(), "loaded system root certificates");
if cert_count == 0 {
warn!("no system root certificates loaded — HTTPS upstream connections will fail");
}
```
---
### S4. Add Read Timeout and Line Length Limit to Admin Socket
**File**: `src/admin/socket.rs:166-210`
See W4 for the problem. Even if the admin socket is only accessible via Unix
permissions, defense-in-depth warrants basic resource limits.
---
### S5. Consolidate `RawConfig` and `FullConfig`
**Files**: `src/cli.rs:49-65`, `src/config/mod.rs:15-31`
See W6. Using a single `FullConfig` type for both startup and reload eliminates
duplication and ensures both paths stay in sync.
---
### S6. Add `#[non_exhaustive]` to `TokenBucket` (or Make Fields Private)
**File**: `src/rate_limit/bucket.rs:4-9`
See W10. Making fields private prevents accidental direct mutation. If external
crates need to construct `TokenBucket`, add a `new()` constructor (which
already exists).
---
### S7. Add More Status Info to Admin Socket `status` Command
**File**: `src/admin/socket.rs:265-275`
**Suggestion**: The `status` response currently returns `uptime_secs` and
`sites`. Consider adding:
- `rate_limit` (requests_per_second, burst)
- `body_limit_bytes`
- `listeners` count
- `in_flight_requests` (if InFlightCounter is fixed per C2)
This gives operators a quick health snapshot without reading logs.
---
### S8. Consider Making `upstream_connect_timeout_secs` Actually Work
**File**: `src/proxy/handler.rs:216-229`
See C3. The hardcoded 5s connector timeout caps the per-site connect timeout.
Either remove the connector timeout or set it to a high ceiling value.
---
### S9. `acme_contact` Config Field Should Be `Vec<String>` for Multiple Contacts
**File**: `src/config/static_config.rs:60`
**Problem**: `acme_contact` is a single `String` but ACME supports multiple
contact emails. The `AcmeTlsConfig.contact` field is already `Vec<String>` and
the config value is wrapped in `vec![...]` at `src/tls/acceptor.rs:70`.
**Suggestion**: Change `acme_contact` to `Vec<String>` in `TlsConfig` for
consistency with the ACME protocol. This is a config format change — document
the migration.
---
### S10. Add Integration Test for Rate Limiter Using `ConnectInfo`
**File**: `tests/integration_test.rs:90-163`
**Problem**: All rate limit integration tests pass the client IP via
`X-Forwarded-For` header. No test verifies the `ConnectInfo` extraction path,
which is the primary path after C1 is fixed.
**Suggestion**: Add a test that sets `ConnectInfo` on the request extensions
(like `make_request_with_connect_info` in `src/proxy/handler.rs` tests) and
verifies rate limiting works without the `X-Forwarded-For` header.
---
## Summary Statistics
| Severity | Count | Status |
|----------|-------|--------|
| Critical | 4 | Must fix before production |
| Warning | 12 | Should fix — correctness and robustness |
| Suggestion | 10 | Consider for code quality |
## Recommended Fix Priority
1. **C1 (rate limiter X-Forwarded-For spoofing)** — Active security vulnerability.
Any external attacker can bypass rate limiting entirely.
2. **C2 (InFlightCounter never increments)** — Graceful shutdown drain logic is
completely non-functional. The drain always times out.
3. **C3 (connect timeout capped at 5s)** — Per-site timeout configuration is
silently ignored for values > 5s.
4. **C4 (JSON format not applied without log file)** — Format config is silently
ignored in a common deployment mode (stdout-only JSON logging).
5. **W1 (upstream host validation gap)** — Invalid upstream addresses pass config
validation.
6. **W2 (ACME contact validation gap)** — Empty email after `mailto:` passes
validation, fails at runtime.
7. **W3 (query string silently dropped)** — Data corruption on URI parse failure.
8. **W4 (admin socket resource limits)** — DoS vector for local socket access.
9. **W5 (TlsMode wildcard mismatch)** — Latent bug for future refactors.
10. **W6 (RawConfig/FullConfig duplication)** — Maintenance burden.
11. **Remaining W and S findings** — Fix opportunistically.