diff --git a/docs/reviews/001-architecture-gap-review.md b/docs/reviews/001-architecture-gap-review.md new file mode 100644 index 0000000..e087daa --- /dev/null +++ b/docs/reviews/001-architecture-gap-review.md @@ -0,0 +1,926 @@ +--- +status: draft +last_updated: 2026-06-11 +reviewed_documents: + - overview.md + - proxy.md + - tls.md + - config.md + - operations.md + - decisions/001 through 020 + - open-questions.md +reviewer: architect +--- + +# Architecture Gap Review #001 + +## Purpose + +This document captures gaps, ambiguities, and inconsistencies found in the +architecture specifications before decomposition into implementation tasks. Each +finding is categorized by severity and structured as **Problem** → +**Suggestion** (or **Open Question** where no solution is yet known). + +## Severity Definitions + +| Severity | Meaning | +|----------|---------| +| **Critical** | Must resolve before implementation — would cause divergent implementations or security issues | +| **Warning** | Should resolve — missing specifications that could cause confusion or rework | +| **Suggestion** | Consider — improvements for clarity or completeness | + +--- + +## Critical Findings + +### C1. Per-Listener vs. Global Site Routing Contradiction + +**Documents**: config.md (lines 144–148), proxy.md (lines 76–83), config.md validation rule 7 (line 307) + +**Problem**: The spec contains contradictory statements about site routing: + +- config.md says "Each listener routes its own sites independently" (per-listener + routing) +- config.md also says "The `DynamicConfig` collects all sites across all + listeners for hot-reload via `ArcSwap`" (global collection) +- Validation rule 7 says "Site `host` values must be unique across all listeners + (no duplicate hostnames, even across different listeners). Duplicate hostnames + would create ambiguous routing — the proxy would not know which listener's + upstream to route a request to when the `Host` header matches multiple sites." + +The uniqueness constraint only makes sense if site lookup is global. If routing +is per-listener, the same hostname on different listeners is not ambiguous — +each listener only routes to its own sites. These two statements cannot both be +true simultaneously. + +**Suggestion**: Adopt **global routing** as the model. Rationale: + +- ArcSwap atomically swaps all routing at once — per-listener routers that each + read their own slice would need coordination +- Rate limiting and body limiting are global, so a global site lookup is already + happening +- Per-listener uniqueness constraint simplifies matching: one Host → one + upstream, always +- This matches how nginx `server_name` works — Host header always resolves to + one backend + +Relevant spec changes: +- config.md: rewrite the sites description to say "Sites are defined per-listener + in config but collected into a single global routing table in `DynamicConfig`. + Hostnames must be unique across all listeners." +- proxy.md: clarify that host-based routing uses a global lookup across all sites +- overview.md architecture diagram: show that all sites are in one `DynamicConfig` + accessible to all listeners +- The per-listener `[[listeners.sites]]` in TOML is purely a config organization + concern — the runtime behavior is global routing + +--- + +### C2. X-Forwarded-For Security Model Is Underspecified + +**Documents**: proxy.md (lines 87–98) + +**Problem**: The spec says X-Forwarded-For should "append the client IP to any +existing value" to "support chained proxies correctly." But this creates a +security vulnerability: any client can inject a fake `X-Forwarded-For` header, +and the proxy will append to it, making the header unreliable for upstream +services. + +The spec doesn't define: +- Whether the proxy should **trust** existing X-Forwarded-For headers +- Whether there's a concept of "trusted proxies" whose X-Forwarded-For values + should be preserved +- What IP is used as the "client IP" — is it always `ConnectInfo` + or does it consider X-Forwarded-For from trusted proxies? + +For a security-focused reverse proxy (the entire motivation is replacing nginx +due to CVEs), this is a critical gap. + +**Suggestion**: For Phase 1, since the proxy is the **outermost proxy** directly +facing the internet, there are no trusted proxies in front of it. Therefore: + +- **X-Forwarded-For**: Set to the client's IP from `ConnectInfo`. + Replace any existing value rather than appending. A client connecting directly + to the proxy has no legitimate X-Forwarded-For chain. +- **X-Real-IP**: Set to the client's IP from `ConnectInfo`. +- Document this as a Phase 1 decision: the proxy is an edge proxy, not a + middle proxy in a chain. If chained proxy support is needed in the future, + a "trusted proxies" configuration can be added. + +This should be documented as a new ADR or added to the proxy.md spec. + +--- + +### C3. Request Header Forwarding Rules Are Incomplete + +**Documents**: proxy.md (lines 100–109) + +**Problem**: The spec says "Copy the request method, headers, and body from the +original" but doesn't specify which headers should be **removed** before +forwarding. HTTP reverse proxies must handle hop-by-hop headers (per RFC 2616 +§13.5.1): `Connection`, `Keep-Alive`, `Transfer-Encoding`, `TE`, `Trailers`, +`Upgrade`, and `Proxy-*` headers. Blindly copying all headers can cause: + +- Connection header conflicts between client and upstream +- Transfer-Encoding mismatches (especially with chunked encoding) +- Security issues from Proxy-Authorization leakage + +The spec also doesn't specify whether the proxy adds a `Via` header (RFC 7230 +§5.7.1) or a `Server` header on responses. + +**Suggestion**: Add a "Header Handling" section to proxy.md that explicitly +lists: + +**Headers to remove before forwarding (hop-by-hop):** +- `Connection` +- `Keep-Alive` +- `Proxy-Authorization` +- `Proxy-Authenticate` +- `TE` +- `Trailers` +- `Transfer-Encoding` +- `Upgrade` + +**Headers to add/modify:** +- `X-Real-IP`: Set to client IP from `ConnectInfo` +- `X-Forwarded-For`: Set to client IP (replacing existing — see C2) +- `X-Forwarded-Proto`: Set to `https` or `http` based on listener port +- `Host`: Preserved as-is from original request + +**Headers NOT added in Phase 1:** +- `Via`: Not added. The proxy is an edge proxy and `Via` is primarily for + tracking proxy chains. Can be added in Phase 2 if needed. + +**Response headers:** +- Upstream response headers are forwarded as-is with the exception of + hop-by-hop headers listed above. +- The proxy does not add a `Server` header to responses. + +--- + +### C4. ACME Startup Failure Behavior Is Undefined + +**Documents**: tls.md (lines 66–109) + +**Problem**: The spec describes ACME certificate provisioning in detail but never +specifies what happens when ACME provisioning fails. Key undefined scenarios: + +- **First start, no cached cert, ACME unreachable**: The proxy cannot serve TLS + at all. Does it fail to start? Start without TLS? Start and serve TLS errors? +- **First start, no cached cert, ACME succeeds**: Normal case, already described. +- **Start with cached cert, ACME unreachable for renewal**: The proxy has a + valid cert but can't renew. Does it start? Log errors? Retry? +- **Renewal failure after startup**: The cert is near expiry and renewal fails. + What happens? + +Let's Encrypt unavailability is a real operational scenario (network outages, +rate limiting, DNS issues). + +**Suggestion**: Define ACME failure behavior explicitly: + +| Scenario | Behavior | +|----------|----------| +| First start, no cached cert, ACME fails | **Fail to start** with clear error message. Proxy cannot serve TLS without a certificate. | +| Start with cached cert, ACME renewal fails | **Start normally** with cached cert. Log error at `warn` level. `rustls-acme` retries per its built-in schedule. | +| Renewal failure after startup | **Continue serving existing cert**. Log error at `warn` level. `rustls-acme` retries per its built-in schedule. | +| Cached cert expired, renewal fails | **Fail to start** if cert is expired at startup. If cert expires during runtime, continue serving it (clients will receive certificate errors — this is the correct behavior, not silently dropping TLS). | + +Add a "Certificate Failure Behavior" section to tls.md documenting these +scenarios. The key principle: **never start without a valid TLS certificate**, +but **always continue serving if a valid cert exists**, even if renewal fails. + +--- + +### C5. Startup Sequence and Partial Failure Not Specified + +**Documents**: overview.md (architecture diagram), operations.md (systemd integration), config.md (validation) + +**Problem**: The startup sequence is not documented anywhere. With multiple +listeners, each with their own TLS configuration, there's a complex startup +dependency chain. Undefined scenarios: + +- If listener 1 binds successfully but listener 2 fails (port in use), does the + proxy partially start or fail entirely? +- If manual TLS cert files are missing, does startup fail before or after + binding ports? +- What order do components start in? (health check port, admin socket, HTTP + listeners, HTTPS listeners, ACME background tasks, rate limiter eviction task) +- The systemd unit uses `Type=notify` with `sd_notify`, but when exactly is + "ready" signaled? + +**Suggestion**: Add a "Startup Sequence" section to operations.md: + +1. **Parse and validate config** — If validation fails, exit with non-zero code + and log errors. No ports are bound. +2. **Initialize DynamicConfig** — Load sites, rate limits, body limits into + ArcSwap. +3. **Initialize shared state** — Rate limiter HashMap, hyper Client, tracing + subscriber. +4. **Bind health check port** (if enabled) — Fail-fast if bind fails. +5. **Bind admin socket** (if enabled) — Remove stale socket file first. Fail-fast + if bind fails. +6. **Bind all listener ports** — For each listener: bind HTTP port, bind HTTPS + port. If any bind fails, fail-fast and exit. All ports are bound before + proceeding. +7. **Load TLS configuration** — For each listener: load manual certs or + initialize ACME state machine. If manual cert loading fails, fail-fast and + exit. For ACME: if no cached cert exists and ACME provisioning fails, fail-fast + and exit (see C4). +8. **Start TCP listeners** — Begin accepting connections on all bound ports. +9. **Start background tasks** — ACME renewal tasks, rate limiter eviction task, + signal handler task. +10. **Signal readiness** — `sd_notify("READY=1")` to systemd. + +**Failure semantics**: **Fail-fast**. If any step fails, the process exits with +a non-zero code. The proxy does not partially start. All ports are bound before +any connections are accepted. + +--- + +### C6. Shared vs. Per-Listener Router Architecture Is Ambiguous + +**Documents**: tls.md (lines 38–50), proxy.md (lines 76–83), overview.md (lines 83–122) + +**Problem**: The architecture diagram shows each listener having its own "axum +router (per-listener sites)." However, the `ArcSwap` pattern +suggests a single global config that all routers read from. This raises an +unanswered design question: does each listener get its own `axum::Router` +instance, or do all listeners share a single `Router`? + +This affects: +- Rate limiter state sharing +- Site routing lookup logic +- Middleware ordering +- How `ConnectInfo` is available for X-Real-IP injection + +**Suggestion**: Each listener gets its own `axum::Router` instance. Rationale: + +- The architecture diagram already shows this model +- Each listener has its own TCP stream context, making `ConnectInfo` per-listener +- Each listener can have its own middleware stack (e.g., different body limits + in Phase 2) +- Shared state is achieved via `Arc>` and + `Arc>>` passed as axum State + +However, despite having per-listener Routers, **site routing is global** (see C1). +All routers read from the same `DynamicConfig` site table. Host uniqueness across +listeners ensures unambiguous routing. + +Update the architecture diagram and proxy.md to clarify: +- Each listener has its own `axum::Router` instance +- All routers share `Arc>` and + `Arc>>` via axum State +- Host-based routing uses a global lookup across all sites + +--- + +### C7. Rate Limiter State Behavior on Config Reload Is Undefined + +**Documents**: config.md (lines 155–168), operations.md (lines 30–48), ADR-008 (lines 38–39) + +**Problem**: `DynamicConfig` includes `rate_limit.requests_per_second` and +`rate_limit.burst`, which are hot-reloadable via ArcSwap. The rate limiter uses +an in-memory `HashMap`. When the rate limit parameters +change (e.g., from 10 req/s burst 20 to 20 req/s burst 40): + +- Do existing token buckets get their parameters updated immediately? +- Are existing buckets drained/reset to match the new burst capacity? +- Are they left unchanged with old parameters while only new IPs get new + parameters? +- Is the entire HashMap cleared on reload (losing all state)? + +This is not specified anywhere and directly affects production behavior. A +reload that clears all buckets creates a temporary rate-limiting gap; a reload +that doesn't update existing buckets creates inconsistent rate limiting. + +**Suggestion**: On config reload, existing token buckets **adopt the new +rate/burst parameters on their next request**. The bucket's current token count +is capped at the new burst maximum. The HashMap is NOT cleared — this avoids +creating a rate-limiting gap. Specifically: + +1. New `DynamicConfig` is swapped in via ArcSwap +2. On the next request from an existing IP, the rate limiter reads the current + `DynamicConfig` for rate/burst parameters +3. The token bucket refills using the new rate, and its capacity is set to the + new burst maximum +4. If the current token count exceeds the new burst maximum, it's capped to + the new burst maximum + +This means: no gap in rate limiting, no burst of uncontrolled traffic, and +parameters converge for all IPs within one request cycle. + +Document this explicitly in operations.md under the "Rate Limiting" section. + +--- + +## Warning Findings + +### W1. Admin Socket Wire Protocol Is Underspecified + +**Documents**: ADR-014 (lines 33–37), config.md (line 91), operations.md (lines 224–228) + +**Problem**: ADR-014 specifies that the admin socket accepts `reload` and +`status` commands returning JSON, but doesn't define: +- Connection lifecycle: Is each command a new connection, or can multiple + commands be sent on one connection? +- Message framing: How does the client know when the response is complete? +- Error cases: What happens with invalid commands, binary data, extremely long + lines? +- Concurrency: Can multiple clients connect simultaneously? +- Socket cleanup: What happens if the proxy starts and a stale socket file + exists from a previous crash? + +**Suggestion**: Specify the protocol explicitly: + +- **Connection lifecycle**: One command per connection. Client connects, sends + one newline-terminated command, receives one newline-terminated JSON response, + connection closes. +- **Message framing**: Newline-delimited (`\n`). The response ends with `\n`. + Client reads until `\n` and then the connection is closed by the server. +- **Error responses**: Unrecognized commands return + `{"status": "error", "message": "unknown command: "}`. Invalid/empty + input returns `{"status": "error", "message": "invalid input"}`. +- **Concurrency**: Multiple clients can connect simultaneously. Commands are + serialized (see W6 — reload serialization). +- **Socket cleanup**: The proxy removes any existing socket file at startup + before binding. If the file exists and another process is listening, the + proxy logs a warning and fails to start the admin socket (but continues + running — admin socket is non-critical). + +Add this to operations.md in the "Admin Socket" section. + +--- + +### W2. Host Header Port Disambiguation + +**Documents**: proxy.md (lines 76–83), config.md (line 138) + +**Problem**: The spec says routing uses axum's `Host` extractor and site `host` +values are hostnames like `"git.alk.dev"`. But the HTTP `Host` header can include +a port (e.g., `git.alk.dev:443`). The spec doesn't specify: + +- Whether the Host header port is stripped before matching +- Whether site `host` values should include ports +- What happens when a request arrives with `Host: 203.0.113.10` (IP address) + +**Suggestion**: +- Host header matching **strips the port** before comparison. The `Host` + extractor in axum already provides the hostname without port. +- Site `host` values must **not** include ports. Add a validation rule that + `host` values are valid hostnames (not IP addresses, not including ports, + lowercase only — matching the case-insensitive rule in S3). +- Requests with an IP address as the Host header (no matching site) receive a + 404 response, same as any unknown host. +- Add this to config.md validation rules and proxy.md request flow. + +--- + +### W3. HTTP Redirect Host Header Handling + +**Documents**: proxy.md (lines 134–138) + +**Problem**: The spec says the HTTP→HTTPS redirect "reads the `Host` header from +the incoming request and returns a 301 Permanent Redirect to the HTTPS +equivalent URL (preserving the path and query string)." But: + +- What if the Host header includes a non-standard port (e.g., + `Host: example.com:8080`)? +- Should the redirect use the listener's `https_port` if it's not 443? +- What if the incoming request has no Host header (HTTP/1.0)? + +**Suggestion**: +- The redirect URL is constructed as: + `https://{host}:{https_port}/{path}?{query}` + where the port is omitted if it's 443 (the default HTTPS port). +- If the incoming Host header includes a port, only the hostname portion is + used in the redirect URL. The `https_port` from the listener config is used. +- If the incoming request has no Host header, return `400 Bad Request`. +- Add this to proxy.md under "HTTP → HTTPS Redirect." + +--- + +### W4. Health Check HTTPS Endpoint Routing + +**Documents**: operations.md (lines 146–161), proxy.md (lines 28–29) + +**Problem**: Operations.md says `/health` is "also available on the main HTTPS +listener" but doesn't specify how this is implemented in the routing. The +proxy.md architecture diagram shows `/health → 200 OK` on the axum Router, but +this router does host-based routing. If `/health` is on the same router, it +needs to be matched before host-based routing, or it needs to be available on +any Host header. + +**Suggestion**: Specify that `/health` on the HTTPS listener is a **top-level +route** that matches regardless of Host header. It is evaluated before host-based +routing. A request to `GET /health` on any hostname returns `200 OK`. + +This means the axum Router has two layers: +1. Path-based routes: `/health` → 200 OK (regardless of Host) +2. Host-based routes: everything else → site lookup → proxy or 404 + +Add this to proxy.md under "Host-Based Routing." + +--- + +### W5. Config Reload Doesn't Specify Static Config Change Handling + +**Documents**: config.md (lines 155–168), ADR-008 + +**Problem**: The config file contains both StaticConfig and DynamicConfig. If +someone changes a static config field (like `bind_addr` or TLS mode) and sends +SIGHUP: +- Is the static config portion silently ignored? +- Is a warning logged? +- Is the entire reload rejected? + +The spec says "Changes [to StaticConfig] require a process restart" but +doesn't specify what happens when the config file contains both static and +dynamic changes during a reload. + +**Suggestion**: On reload: + +1. Read the entire config file from disk +2. Validate the **full** config (static + dynamic) for syntax and structural + correctness +3. If the full config fails validation, reject the reload and log the error. + The old DynamicConfig remains active. +4. If the full config passes validation, compare the static portion against the + currently running StaticConfig. If any static fields have changed, log a + warning listing the changed fields and noting that a restart is required for + those changes to take effect. +5. Swap the DynamicConfig via ArcSwap. + +This gives operators early feedback about config drift (warning on static changes) +while still applying dynamic changes immediately. Add this to config.md under +"Config Reload." + +--- + +### W6. Config Reload Race Condition With Simultaneous SIGHUP and Admin Socket + +**Documents**: config.md (lines 173–184), operations.md (lines 209–227) + +**Problem**: Both SIGHUP and the admin socket `reload` command converge on the +same code path. If a SIGHUP and an admin socket `reload` arrive nearly +simultaneously, two reloads could be in progress at the same time. Without +serialization, concurrent file reads and ArcSwap operations could race, +potentially: +- Applying an older config over a newer one +- Causing confusing log output (two CONFIG_RELOAD events for the same config) +- Admin socket returning stale status + +**Suggestion**: Serialize reload operations using a `tokio::sync::Mutex` on the +reload path. This ensures only one reload is in progress at a time. If a +second reload is requested while one is in progress, it waits for the first to +complete, then re-reads the file (getting the latest version) and proceeds. + +Document this in config.md under "Config Reload" and note it in ADR-008 or +ADR-014. + +--- + +### W7. Missing Validation: http_port Conflicts and Port Rules + +**Documents**: config.md (lines 296–314) + +**Problem**: Validation rule 3 states "Each listener's `bind_addr` and +`https_port` combination must be unique." But: + +- `http_port` conflicts are not checked (two listeners with the same + `bind_addr` and same `http_port` would fail at bind time) +- `http_port` and `https_port` could conflict within the same listener + (e.g., both set to 443) +- `https_port` must be required (cannot be 0) since TLS requires a port +- `http_port` can be 0 (disabled) but must be a valid port otherwise + +**Suggestion**: Add these validation rules: + +| Rule | Description | +|------|-------------| +| Each `bind_addr` + `http_port` combination must be unique | Prevents bind-time errors | +| Within a listener, `http_port` ≠ `https_port` | Prevents same-port conflict | +| `https_port` must be 1–65535 | TLS requires a port | +| `http_port` must be 0 (disabled) or 1–65535 | Valid port range or explicitly disabled | +| `health_check_port` must not conflict with any listener port | Prevents bind conflict | + +Update config.md validation section with these rules. + +--- + +### W8. Missing Validation: `upstream` Field Format + +**Documents**: config.md (lines 139–143) + +**Problem**: The `upstream` field is described as supporting Docker DNS, +loopback, LAN IPs, and tunnel endpoints, but there's no formal validation rule +for what format it accepts. Ambiguities: + +- A hostname without a port? (`gitea` — what port?) +- A URL with a scheme? (`http://gitea:3000` — conflicting with `upstream_scheme`) +- Just an IP address? (`10.0.0.5` — what port?) +- An IPv6 address? (`[::1]:3000`) + +Since `upstream_scheme` is a separate field, `upstream` should be `host:port` +without scheme, but this isn't explicitly validated. + +**Suggestion**: Add validation rules: + +- `upstream` must be in `host:port` format where `host` is a valid DNS name or + IP address (IPv4 or IPv6 in bracket notation), and `port` is a required + integer 1–65535 +- The `upstream_scheme` field handles the protocol — `upstream` must NOT include + a scheme prefix (no `http://` or `https://`) +- Examples of valid values: `gitea:3000`, `127.0.0.1:3000`, `[::1]:3000` +- Examples of invalid values: `gitea` (missing port), `http://gitea:3000` + (includes scheme), `10.0.0.5` (missing port) + +Update config.md validation section with these rules. + +--- + +### W9. TLS Handshake Failure Behavior Is Underspecified + +**Documents**: tls.md + +**Problem**: The spec details certificate provisioning and SNI selection but +never describes what happens when a TLS handshake fails. Possible failure +scenarios: + +- Client sends an SNI hostname that doesn't match any certificate (manual mode) +- Client doesn't send SNI at all +- TLS protocol version is too old (TLS 1.0/1.1) +- Cipher suite negotiation fails +- Certificate has expired (manual mode) + +**Suggestion**: Add a "TLS Error Handling" section to tls.md: + +| Scenario | Behavior | +|----------|----------| +| SNI hostname doesn't match any certificate (manual mode) | TLS handshake fails with `unrecognized_name` alert. Log at `warn` level with client IP and SNI hostname. | +| No SNI extension sent by client | TLS handshake fails with `handshake_failure` alert. Log at `warn` level with client IP. | +| Unsupported TLS version (1.0/1.1) | TLS handshake fails with `protocol_version` alert. Log at `info` level. | +| Cipher suite negotiation fails | TLS handshake fails with `handshake_failure` alert. Log at `info` level with client IP. | +| Certificate expired (manual mode) | Connection fails. Log at `error` level. Proxy continues serving other connections. | +| ACME renewal failure | Continue serving existing certificate. Log at `warn` level. Retry per rustls-acme schedule. | + +--- + +### W10. IPv6 Rate Limiting Not Addressed + +**Documents**: operations.md (lines 30–48), ADR-006 + +**Problem**: Rate limiting is described as "per IP" but doesn't address IPv6. +An attacker with a /64 IPv6 prefix has 2^64 unique addresses, completely +bypassing per-IP rate limiting. + +**Suggestion**: Specify IPv6 rate limiting behavior: + +- IPv4 addresses are rate-limited per individual address (`/32`) +- IPv6 addresses are rate-limited per `/64` prefix (per RFC 4941 and common + practice). All addresses in the same `/64` share the same token bucket. +- The rate limiter stores `IpAddr` internally but normalizes IPv6 addresses to + their `/64` prefix before lookup. + +Add this to operations.md under "Rate Limiting" and note it in ADR-006. + +--- + +### W11. Graceful Shutdown Doesn't Specify Connection Draining Behavior + +**Documents**: operations.md (lines 205–234), ADR-009 + +**Problem**: The graceful shutdown description says "stop accepting new +connections, wait up to 30 seconds for in-flight requests to complete" but +doesn't specify: + +- Does "stop accepting" mean closing the listening socket, or rejecting new + connections? +- What happens to idle keep-alive connections? +- After the timeout, how are remaining connections terminated? +- What happens to ACME background tasks? + +**Suggestion**: Specify the shutdown sequence explicitly: + +1. **Signal received** (SIGTERM/SIGINT) +2. **Stop accepting new connections** — Close all listening sockets. No new + connections are accepted. +3. **Close idle keep-alive connections** — Send `Connection: close` header on + any idle connections in the keep-alive pool. +4. **Wait for in-flight requests** — Up to `shutdown_timeout_secs` (default 30) + for active requests to complete. +5. **Force-close remaining connections** — After timeout, TCP RST on any + remaining connections. +6. **Cancel background tasks** — ACME renewal tasks, rate limiter eviction + task, admin socket listener are all cancelled. +7. **Exit with code 0** + +Add this to operations.md under "Graceful Shutdown." + +--- + +### W12. Response Error Body Content Is Not Specified + +**Documents**: proxy.md (lines 122–131) + +**Problem**: The error handling table specifies HTTP status codes (502, 504, 413, +429, 404) but doesn't specify the response body content. For a security-focused +reverse proxy, the error response body matters: + +- Should error responses include HTML, plain text, or JSON? +- Should they include the proxy's version/identity? +- Should they include upstream error details? + +**Suggestion**: Specify that error responses are minimal plain text, following +this format: + +- Response body: Plain text with a brief message (e.g., `Bad Gateway`, `Not + Found`) +- No proxy version or identity information +- No upstream error details +- Content-Type: `text/plain; charset=utf-8` + +Example error responses: +- 404: `Not Found` +- 413: `Payload Too Large` +- 429: `Too Many Requests` +- 502: `Bad Gateway` +- 504: `Gateway Timeout` + +Add a "Error Responses" section to proxy.md. + +--- + +### W13. `upstream_scheme` HTTPS Upstream TLS Behavior + +**Documents**: config.md (line 140), ADR-017 + +**Problem**: `upstream_scheme` accepts `"http"` or `"https"`, but when +`upstream_scheme = "https"`, the spec doesn't specify how the proxy validates +the upstream's TLS certificate. Does it trust all certs? Use the system CA +bundle? Pin specific CAs? + +If the proxy trusts all upstream certs, it negates the security benefits of TLS +termination. + +**Suggestion**: When `upstream_scheme = "https"`: + +- The proxy uses the system's native TLS root certificates (via `rustls` + root cert store) to validate the upstream's certificate. +- No certificate pinning or custom CA support in Phase 1. +- Certificate validation failures result in a 502 Bad Gateway response. +- `upstream_scheme` values are case-sensitive (lowercase only). + +Add this to proxy.md under "Upstream Connection" and to config.md validation +rules. + +--- + +### W14. `allow_wildcard_bind` CLI Flag vs Config File Precedence + +**Documents**: config.md (line 299), ADR-016, ADR-020 + +**Problem**: The `allow_wildcard_bind` setting can be enabled either in the +config file or via CLI flag, but the merge semantics aren't specified. What +happens when: + +- Config file has `allow_wildcard_bind = false` but CLI flag is + `--allow-wildcard-bind`? +- Config file has `allow_wildcard_bind = true` but no CLI flag? +- Both are set? + +**Suggestion**: CLI flags take precedence over config file settings. Specifically: +`allow_wildcard_bind` is `true` if EITHER the config file sets it to `true` OR +the `--allow-wildcard-bind` CLI flag is present. This is an OR relationship — +either source enables it. + +Document this explicitly in config.md and ADR-016. + +--- + +### W15. ADR-010 Phase 2 List Is Stale + +**Documents**: ADR-010 (lines 51–53) + +**Problem**: ADR-010 lists "Per-site upstream timeouts" as a Phase 2 feature, +but ADR-015 already moved per-site upstream timeouts into Phase 1. The Phase 2 +list in ADR-010 is misleading for implementers. + +**Suggestion**: Update ADR-010's Phase 2 list to remove "Per-site upstream +timeouts" and add a note that it was moved to Phase 1 via ADR-015. + +--- + +### W16. Docker Compose Example Doesn't Show Multi-Listener Setup + +**Documents**: operations.md (lines 326–379) + +**Problem**: The Docker Compose example only publishes ports for a single +listener. The architecture supports multiple listeners on different IPs, but the +example doesn't demonstrate how to publish multiple listeners' ports. + +**Suggestion**: Add a multi-listener Docker Compose example showing how to +publish ports for multiple listeners. Document that each listener's +`https_port` and `http_port` must be published separately. + +--- + +### W17. LoggingConfig Is Static But `log_file_path` Might Need Runtime Changes + +**Documents**: config.md (lines 95–101), ADR-008 + +**Problem**: The entire `LoggingConfig` including `log_file_path` is under +StaticConfig, meaning the log file path cannot be changed without a restart. +This isn't explicitly called out as a trade-off. + +**Suggestion**: Add an explicit note in config.md that `LoggingConfig` (including +`log_file_path`) is static and requires a restart to change. Document the +rationale: log file path changes require reopening file handles, which is +complex and low-value for Phase 1. Log rotation (Phase 2) will be handled +differently (signal-based or built-in rotation). + +--- + +## Suggestions + +### S1. Add a Data Flow Diagram Showing Request Path With Data Structures + +**Documents**: proxy.md + +**Suggestion**: Add a more detailed data flow diagram showing: +- `DynamicConfig` → site lookup (how Host header maps to SiteConfig) +- SiteConfig → timeout selection (upstream_connect_timeout_secs, + upstream_request_timeout_secs) +- How `hyper::Request` is constructed from the original request + proxy headers +- How the `hyper::Client` dispatches and streams the response back + +This would help implementers understand state flow without guessing. + +--- + +### S2. Document ConnectInfo Propagation + +**Documents**: proxy.md (line 93) + +**Suggestion**: Add a note explaining how `ConnectInfo` is populated. +With `tokio-rustls`, the proxy needs to extract the peer address from the +`TlsStream` and inject it as `ConnectInfo` via axum's server +configuration. This is an implementation detail that should be noted since it +affects both the TLS layer and the proxy handler. + +Specifically: when accepting a TLS connection, the proxy extracts +`TcpStream::peer_addr()` before wrapping in `TlsStream`, and provides it to +axum via `axum::ServiceExt::into_make_service_with_connect_info::()`. + +--- + +### S3. Specify That Host Matching Is Case-Insensitive + +**Documents**: proxy.md (lines 76–83), config.md (line 138) + +**Suggestion**: RFC 7230 §2.7.3 specifies that host comparison should be +case-insensitive. The spec should explicitly state: + +- Host matching is case-insensitive (`Git.Alk.Dev` matches `git.alk.dev`) +- `host` values in config are normalized to lowercase during validation +- The Host header is normalized to lowercase before matching + +--- + +### S4. Consider Adding Request ID / Correlation ID + +**Documents**: proxy.md, operations.md + +**Suggestion**: The logging format includes `duration_ms` but no request ID. +Adding a `X-Request-Id` header (UUID) generated per request and included in both +the forwarded request and the access log would significantly improve operational +debugging. This is a common reverse proxy practice. + +If this is deferred to Phase 2, note it explicitly as a Phase 2 candidate. + +--- + +### S5. Clarify Response Streaming Behavior + +**Documents**: proxy.md (line 69) + +**Suggestion**: The spec says "Stream response back to the client" but doesn't +clarify: +- Are upstream response bodies streamed chunk-by-chunk, or buffered entirely? +- What happens if the client disconnects while the upstream is still sending? +- What happens if the upstream disconnects while streaming a response? + +Recommend explicitly stating: +- Responses are streamed chunk-by-chunk (not buffered) +- If the client disconnects, the upstream connection is closed and the event + is logged at `debug` level +- If the upstream disconnects mid-stream, the client receives whatever data + was already sent and the connection is closed + +--- + +### S6. Specify Rate Limit Bucket Semantics More Precisely + +**Documents**: operations.md (lines 25–27), config.md (lines 130–132) + +**Suggestion**: The spec says "10 requests/second with burst of 20" but doesn't +clarify exact token bucket semantics: +- When the bucket is empty and a request arrives: immediately rejected (429) +- The burst parameter: maximum bucket capacity (20 tokens). A burst of 20 + means 20 requests can be accepted in a rapid succession before rate limiting + kicks in. +- `requests_per_second = 10` means 1 token is added every 100ms (not 10 tokens + every 1 second) + +These details affect how nginx `limit_req burst=20 nodelay` maps to the +implementation. The `nodelay` equivalent means requests are immediately rejected +when the bucket is empty — they are not queued. + +--- + +### S7. Explicitly Note That Config File Watching Is Out of Scope + +**Documents**: config.md (lines 173–184) + +**Suggestion**: The spec uses SIGHUP and admin socket for reload but doesn't +mention file system watching (inotify/FSEvents). This is intentional — SIGHUP +and admin socket are sufficient. Add an explicit note in config.md that +automatic file watching (inotify, fsnotify, etc.) is out of scope for Phase 1. +Config reload is triggered explicitly by SIGHUP or admin socket command. + +--- + +### S8. Specify That All Paths on Known Hosts Are Forwarded + +**Documents**: proxy.md (lines 76–83) + +**Suggestion**: The spec says the proxy "either proxies to the upstream or +returns 404" based on Host matching, but doesn't explicitly state that ALL paths +and query strings on a known host are forwarded to the upstream without +filtering. Make this explicit: "The proxy does not filter or restrict paths. All +paths and query strings on a known host are forwarded to the upstream." + +--- + +### S9. Reference `shutdown_timeout_secs` in Shutdown Description + +**Documents**: config.md (line 92), operations.md (lines 230–234) + +**Suggestion**: operations.md says "wait up to 30 seconds" but the actual value +comes from `shutdown_timeout_secs` in StaticConfig. Make the connection +explicit: "wait up to `shutdown_timeout_secs` (default: 30s) for in-flight +requests to complete." + +--- + +### S10. Add "Not in Phase 1" Sections + +**Documents**: All documents + +**Suggestion**: Several features are mentioned as out of scope (HTTP/2 proxying, +WebSocket, load balancing, per-site rate limits, per-site body limits), but +these are scattered. Add a consolidated "Not in Phase 1" section to each +document listing what is explicitly excluded, to prevent feature creep and +clarify scope for implementers. + +--- + +### S11. Add a Consolidated Defaults Table for All Config Fields + +**Documents**: config.md + +**Suggestion**: While defaults are mentioned inline, a consolidated table of all +config fields with their types, defaults, and whether they're required would +reduce implementation ambiguity. Some fields have implied defaults that aren't +explicitly stated (e.g., what's the default for `acme_directory`? Is it +`"production"`? What happens if it's omitted?). + +Create a table like: + +| Field | Type | Default | Required | +|-------|------|---------|----------| +| `allow_wildcard_bind` | `bool` | `false` | No | +| `health_check_port` | `u16` | `9900` | No | +| `admin_socket_path` | `String` | `/run/reverse-proxy/admin.sock` | No | +| `listeners[].http_port` | `u16` | `80` | No | +| `listeners[].https_port` | `u16` | `443` | No | +| `listeners[].tls.acme_directory` | `"production"` or `"staging"` | `"production"` | No | +| ... | ... | ... | ... | + +--- + +## Summary Statistics + +| Severity | Count | Status | +|----------|-------|--------| +| Critical | 7 | Needs resolution before implementation | +| Warning | 17 | Should resolve — mostly quick spec clarifications | +| Suggestion | 11 | Consider for clarity | + +## Recommended Resolution Order + +1. **Critical findings C1–C7** — These involve design decisions that would + cause divergent implementations. Resolve first. +2. **Warning findings W1–W17** — These are mostly spec clarifications that can + be resolved quickly by adding text to existing documents. +3. **Suggestions S1–S11** — These improve clarity but won't block + implementation. \ No newline at end of file