Update architecture docs to address security review #003 findings

Add three ADRs (025-027) and update five spec documents to close gaps
identified in the security and bug review:

- ADR-025: Rate limiter IP source must be ConnectInfo only (C1 fix)
- ADR-026: Connector timeout ceiling of 30s for per-site timeouts (C3 fix)
- ADR-027: Admin socket resource limits — 5s timeout, 4096 byte line limit (W4 fix)

Spec changes:
- proxy.md: add rate limiter IP source section, URI error handling
  constraint, connector ceiling description, renumber sections
- operations.md: add ConnectInfo-only IP source, in-flight counter
  architectural requirement (C2), JSON format guarantee (C4), admin
  socket resource limits, 100ms drain polling interval
- config.md: fix http_port type u32→u16 (W12), tighten upstream host
  validation (W1), tighten ACME contact validation (W2), add
  X-Forwarded-Proto cross-reference, clarify alknet ADR-030 reference
- overview.md: fix ambiguous C1 reference, add ADR/OQ cross-references
- open-questions.md: update OQ-09 resolution, add OQ-13 (acme_contact
  Vec) and OQ-14 (eviction configurability)
- README.md: add ADR-025/026/027 and OQ-13/14, update doc statuses to draft

Also fix reviewer findings: alknet ADR-030 scope clarification, RFC 2616
reference updated to RFC 7230.
This commit is contained in:
2026-06-12 13:17:39 +00:00
parent 4f537c80d2
commit 80d1fd0fb3
9 changed files with 432 additions and 53 deletions

View File

@@ -1,5 +1,5 @@
---
status: reviewed
status: draft
last_updated: 2026-06-12
---
@@ -32,6 +32,12 @@ The rate limiter runs as axum middleware before the proxy handler. It uses a
token bucket algorithm per client IP, matching nginx's `limit_req burst`
semantics.
The client IP for rate limiting is determined **exclusively** from
`ConnectInfo<SocketAddr>` — the TCP peer address set before TLS handshake.
Client-supplied `X-Forwarded-For` headers must not be consulted because the
rate limiter runs before the proxy handler injects trusted headers. See
ADR-025.
Rate limits are global per-IP in Phase 1 (not per-site). A request from IP
address X counts against the same bucket regardless of which site it targets.
Per-site rate limits may be added in Phase 2.
@@ -142,6 +148,11 @@ ADR-024.
The `tracing-subscriber` layer configuration supports both simultaneously via
`Layer` composition.
Both output destinations must respect the `format` config value: when
`format = "json"`, both file and stdout output must use JSON formatting.
When `format = "text"`, both use text formatting. The format must not be
silently ignored in any output path (see Security Review C4).
### File Logging and fail2ban
File logging is the primary integration point for fail2ban. A log file on a
@@ -286,6 +297,11 @@ rationale.
one newline-terminated command, receives one newline-terminated JSON
response, then the server closes the connection.
- **Message framing**: Newline-delimited (`\n`). Responses end with `\n`.
- **Resource limits** (see ADR-027):
- Read timeout: 5 seconds. Connections that send no complete command within
5 seconds are closed. The timeout is logged at `debug` level.
- Line length limit: 4096 bytes. Connections that send more than 4096 bytes
without a newline are closed. The event is logged at `warn` level.
- **Commands**:
- `reload` — Re-read config file, validate, and swap DynamicConfig. Returns
`{"status": "ok"}` or `{"status": "error", "message": "..."}`.
@@ -309,9 +325,17 @@ On SIGTERM or SIGINT, the proxy performs a graceful shutdown:
2. **Close idle keep-alive connections** — Send `Connection: close` on any idle
connections in the keep-alive pool.
3. **Wait for in-flight requests** — Up to `shutdown_timeout_secs` (default: 30)
for active requests to complete. Server tasks are joined (not aborted) so
that in-flight requests can drain normally. Only after the timeout expires
are remaining tasks aborted.
for active requests to complete. The proxy tracks in-flight requests using
an atomic counter: each request **must** increment the counter when it
begins and decrement when it completes (via guard drop). The increment
must happen before the request task is spawned — if the counter is not
incremented, the drain logic is broken (see Security Review C2). During
drain, the proxy polls the counter every 100ms and exits early
when it reaches zero. If the timeout expires before all requests complete,
the proxy logs how many in-flight requests remain and proceeds to
force-close. Server tasks are joined (not aborted) so that in-flight
requests can drain normally. Only after the timeout expires are remaining
tasks aborted.
4. **Force-close remaining connections** — After the timeout, any remaining
connections are forcefully closed via TCP RST.
5. **Cancel background tasks** — ACME renewal tasks, rate limiter eviction task,
@@ -592,11 +616,13 @@ All design decisions are documented as ADRs in [decisions/](decisions/).
| [014](decisions/014-unix-socket-reload.md) | Unix domain socket config reload API | Programmatic reload with success/failure feedback |
| [020](decisions/020-container-deployment.md) | Container deployment model | Defense-in-depth via container isolation; file-primary logging |
| [024](decisions/024-ansi-disabled-logging.md) | ANSI-disabled logging | All log output uses `with_ansi(false)` for fail2ban and Docker compatibility |
| [025](decisions/025-rate-limiter-ip-source.md) | Rate limiter IP source | ConnectInfo only, never client-supplied X-Forwarded-For |
| [027](decisions/027-admin-socket-resource-limits.md) | Admin socket resource limits | 5s read timeout, 4096 byte line length limit |
## Open Questions
Open questions are tracked in [open-questions.md](open-questions.md). All
questions affecting this document have been resolved:
Open questions are tracked in [open-questions.md](open-questions.md). Key
questions affecting this document:
- ~~**OQ-03**: Should the health check endpoint be on a separate port?~~ (resolved
— ADR-013: separate local port, default 9900, localhost only)
@@ -605,4 +631,6 @@ questions affecting this document have been resolved:
9900 and admin socket only)
- ~~**OQ-12**: Should request access logging be mandatory or optional?~~ (resolved
— access logging is mandatory and always-on at `info` level; no configuration
option to disable it)
option to disable it)
- **OQ-14**: Should rate limiter eviction interval and max age be configurable?
(see [open-questions.md](open-questions.md))