Update architecture docs to address security review #003 findings
Add three ADRs (025-027) and update five spec documents to close gaps identified in the security and bug review: - ADR-025: Rate limiter IP source must be ConnectInfo only (C1 fix) - ADR-026: Connector timeout ceiling of 30s for per-site timeouts (C3 fix) - ADR-027: Admin socket resource limits — 5s timeout, 4096 byte line limit (W4 fix) Spec changes: - proxy.md: add rate limiter IP source section, URI error handling constraint, connector ceiling description, renumber sections - operations.md: add ConnectInfo-only IP source, in-flight counter architectural requirement (C2), JSON format guarantee (C4), admin socket resource limits, 100ms drain polling interval - config.md: fix http_port type u32→u16 (W12), tighten upstream host validation (W1), tighten ACME contact validation (W2), add X-Forwarded-Proto cross-reference, clarify alknet ADR-030 reference - overview.md: fix ambiguous C1 reference, add ADR/OQ cross-references - open-questions.md: update OQ-09 resolution, add OQ-13 (acme_contact Vec) and OQ-14 (eviction configurability) - README.md: add ADR-025/026/027 and OQ-13/14, update doc statuses to draft Also fix reviewer findings: alknet ADR-030 scope clarification, RFC 2616 reference updated to RFC 7230.
This commit is contained in:
90
docs/architecture/decisions/026-connector-timeout-ceiling.md
Normal file
90
docs/architecture/decisions/026-connector-timeout-ceiling.md
Normal file
@@ -0,0 +1,90 @@
|
||||
# ADR-026: Connector Timeout Ceiling for Per-Site Timeouts
|
||||
|
||||
## Status
|
||||
|
||||
Accepted
|
||||
|
||||
## Context
|
||||
|
||||
ADR-015 specifies per-site upstream connect timeout configuration with a default
|
||||
of 5 seconds. The proxy enforces connect timeouts using two mechanisms:
|
||||
|
||||
1. **`tokio::time::timeout`**: Wraps the entire `client.request()` call with the
|
||||
per-site `upstream_connect_timeout_secs` value.
|
||||
2. **`HttpConnector::set_connect_timeout()`**: Sets the TCP-level connect timeout
|
||||
on the hyper `HttpConnector` inside the shared client.
|
||||
|
||||
The problem: the HTTP connector's `set_connect_timeout()` is set once at client
|
||||
creation time and applies to all requests through that client. The current
|
||||
implementation hardcodes this to 5 seconds. Since the connector's internal
|
||||
timeout fires before `tokio::time::timeout`, any per-site connect timeout
|
||||
greater than 5 seconds is silently capped — the connector times out at 5s
|
||||
regardless of the configured value.
|
||||
|
||||
Three approaches exist:
|
||||
|
||||
1. **Raise the connector timeout to a high ceiling**: Set the connector's
|
||||
`set_connect_timeout` to a value higher than any reasonable per-site timeout
|
||||
(e.g., 30s). Let `tokio::time::timeout` enforce the actual per-site limit.
|
||||
The connector timeout becomes a safety ceiling, not the primary enforcement
|
||||
mechanism.
|
||||
|
||||
2. **Remove the connector timeout entirely**: Set `set_connect_timeout` to
|
||||
`None` and rely solely on `tokio::time::timeout`. This removes one layer
|
||||
of timeout enforcement but simplifies the model.
|
||||
|
||||
3. **Create per-site client instances**: Each site gets its own hyper Client
|
||||
with its own connector configured with the site's connect timeout. This is
|
||||
the most precise approach but creates many client instances and connection
|
||||
pools, increasing resource usage.
|
||||
|
||||
## Decision
|
||||
|
||||
Use approach 1: set the connector timeout to a high ceiling value (30 seconds)
|
||||
and let `tokio::time::timeout` enforce the actual per-site connect timeout.
|
||||
|
||||
The connector timeout serves as a safety ceiling — it ensures that even if the
|
||||
`tokio::time::timeout` wrapper fails or is misconfigured, TCP connections
|
||||
cannot hang indefinitely. The ceiling of 30s is well above the default 5s and
|
||||
any reasonable per-site override.
|
||||
|
||||
The `tokio::time::timeout` wrapper with the per-site value is the primary
|
||||
enforcement mechanism. It fires at the correct per-site threshold.
|
||||
|
||||
## Rationale
|
||||
|
||||
- The shared client architecture (ADR-017) means one connector timeout for all
|
||||
sites. Creating per-site clients would undermine connection pooling and
|
||||
increase resource usage.
|
||||
- A ceiling approach preserves the defense-in-depth benefit of two timeout
|
||||
layers while allowing per-site values to actually work.
|
||||
- 30s is a reasonable ceiling — no legitimate upstream connect should take
|
||||
longer than 30s. Sites that need a higher connect timeout can set the ceiling
|
||||
even higher if needed, but 30s covers all practical cases.
|
||||
- Removing the connector timeout (approach 2) removes the safety ceiling
|
||||
entirely. If `tokio::time::timeout` has a bug or is misapplied, TCP connects
|
||||
could hang indefinitely. The ceiling provides a backstop.
|
||||
- The HTTPS client uses `HttpsConnector<HttpConnector>`, which wraps the
|
||||
`HttpConnector`. The same ceiling applies to both HTTP and HTTPS clients.
|
||||
|
||||
## Consequences
|
||||
|
||||
**Positive:**
|
||||
- Per-site connect timeouts work as documented (ADR-015)
|
||||
- Maintains defense-in-depth with two timeout layers
|
||||
- No change to the shared client / connection pooling architecture
|
||||
- Simple implementation: change one constant
|
||||
|
||||
**Negative:**
|
||||
- The connector timeout no longer matches the default connect timeout (5s
|
||||
default vs. 30s ceiling). Operators who read the connector timeout might
|
||||
be confused — documentation must make the ceiling role clear.
|
||||
- If a site needs a connect timeout > 30s, the ceiling must be raised. This
|
||||
is unlikely in practice but creates a hidden upper bound.
|
||||
|
||||
## References
|
||||
|
||||
- [proxy.md](../proxy.md) — Upstream connection, per-site timeouts
|
||||
- ADR-015 — Per-site upstream timeouts with defaults
|
||||
- ADR-017 — Upstream connection defaults
|
||||
- Security Review C3 — Connect timeout silently capped at 5s
|
||||
Reference in New Issue
Block a user