Triage implementation review findings and update architecture specs

Analyzed 29 findings from the implementation review (002-implementation-review.md)
and identified 8 architecture-level concerns requiring spec changes:

Architecture gaps addressed:
- C2: Added acme_contact field to config.md, tls.md, and operations.md.
  Let's Encrypt requires a contact email for production; the spec was missing
  this required field.
- C4: Added StaticConfig drift tracking requirement to config.md reload
  section. ConfigReloadHandle must update its stored StaticConfig after each
  successful reload to prevent stale warnings.
- W1: Updated shutdown sequence in operations.md to specify that server tasks
  should be joined (not aborted) during the drain window.
- W5: Added health check path collision note to proxy.md.
- W13: Clarified that access logging is always-on in operations.md.
- W14: Updated X-Forwarded-Proto description in proxy.md to clarify that it
  is always 'https' since the HTTP listener redirects rather than proxies.

New open questions added:
- OQ-08: Should /health use a less common path to avoid upstream collision?
- OQ-09: How should upstream_connect_timeout_secs be enforced?
- OQ-10: Should ACME contact email be a required config field?
- OQ-11: How should X-Forwarded-Proto be derived per-listener?
- OQ-12: Should request access logging be mandatory or optional?

The remaining 21 findings are implementation-level bugs, code quality issues,
or Phase 2 improvements that don't require architecture spec changes.
This commit is contained in:
2026-06-11 15:04:09 +00:00
parent 5478df7ab7
commit 68d27c4789
6 changed files with 135 additions and 10 deletions

View File

@@ -67,6 +67,11 @@ See [open-questions.md](open-questions.md) for the full tracker.
| ~~OQ-05~~ | ~~Should the proxy bind to multiple addresses?~~ | ~~low~~ | **resolved** (single bind_addr sufficient) | | ~~OQ-05~~ | ~~Should the proxy bind to multiple addresses?~~ | ~~low~~ | **resolved** (single bind_addr sufficient) |
| ~~OQ-06~~ | ~~Should upstream timeouts be configurable per-site?~~ | ~~low~~ | **resolved** (ADR-015) | | ~~OQ-06~~ | ~~Should upstream timeouts be configurable per-site?~~ | ~~low~~ | **resolved** (ADR-015) |
| ~~OQ-07~~ | ~~Should per-site TLS overrides be supported for mixed ACME/manual domains?~~ | ~~low~~ | **resolved** (ADR-019) | | ~~OQ-07~~ | ~~Should per-site TLS overrides be supported for mixed ACME/manual domains?~~ | ~~low~~ | **resolved** (ADR-019) |
| OQ-08 | Should the `/health` path use a less common endpoint to avoid upstream collision? | medium | open |
| OQ-09 | How should `upstream_connect_timeout_secs` be enforced? | medium | open |
| OQ-10 | Should ACME contact email be a required config field? | high | open |
| OQ-11 | How should `X-Forwarded-Proto` be derived per-listener? | medium | open |
| OQ-12 | Should request access logging be mandatory or optional? | high | open |
## Document Lifecycle ## Document Lifecycle

View File

@@ -116,6 +116,7 @@ will be handled via signal-based or built-in rotation.
| `tls.acme_domains` | `Vec<String>` | Domains for ACME SAN certificate (ACME mode only) | | `tls.acme_domains` | `Vec<String>` | Domains for ACME SAN certificate (ACME mode only) |
| `tls.acme_cache_dir` | `String` | ACME state cache directory | | `tls.acme_cache_dir` | `String` | ACME state cache directory |
| `tls.acme_directory` | `"production"` or `"staging"` | Let's Encrypt directory | | `tls.acme_directory` | `"production"` or `"staging"` | Let's Encrypt directory |
| `tls.acme_contact` | `String` | Contact email for ACME registration (e.g., `"mailto:admin@example.com"`). Required for production; Let's Encrypt rejects registrations without a contact email. See OQ-10. |
| `tls.cert_path` | `String` | Certificate file path (manual mode only) | | `tls.cert_path` | `String` | Certificate file path (manual mode only) |
| `tls.key_path` | `String` | Private key file path (manual mode only) | | `tls.key_path` | `String` | Private key file path (manual mode only) |
@@ -174,6 +175,7 @@ Phase 2.
| `listeners[].http_port` | `u16` | `80` | No | | `listeners[].http_port` | `u16` | `80` | No |
| `listeners[].https_port` | `u16` | `443` | No | | `listeners[].https_port` | `u16` | `443` | No |
| `listeners[].tls.acme_directory` | `String` | `"production"` | No | | `listeners[].tls.acme_directory` | `String` | `"production"` | No |
| `listeners[].tls.acme_contact` | `String` | — | Yes (ACME mode only) |
| `sites[].upstream_scheme` | `String` | `"http"` | No | | `sites[].upstream_scheme` | `String` | `"http"` | No |
| `sites[].upstream_connect_timeout_secs` | `u64` | `5` | No | | `sites[].upstream_connect_timeout_secs` | `u64` | `5` | No |
| `sites[].upstream_request_timeout_secs` | `u64` | `60` | No | | `sites[].upstream_request_timeout_secs` | `u64` | `60` | No |
@@ -236,6 +238,13 @@ effect. This gives operators early feedback about config drift.
Only the DynamicConfig portion is swapped via ArcSwap. StaticConfig changes Only the DynamicConfig portion is swapped via ArcSwap. StaticConfig changes
require a process restart to take effect. require a process restart to take effect.
**Important**: The `ConfigReloadHandle` must track the last-known StaticConfig
so that it can correctly detect changes on subsequent reloads. After each
successful reload, the stored StaticConfig is updated with the new value (via
`ArcSwap<StaticConfig>` or similar interior mutability). This prevents stale
warnings: if the same static config change is present on two consecutive
reloads, the operator should see the warning only once, not on every reload.
### Reload Serialization ### Reload Serialization
Reload operations are serialized using a `tokio::sync::Mutex` on the reload Reload operations are serialized using a `tokio::sync::Mutex` on the reload
@@ -290,6 +299,7 @@ mode = "acme"
acme_domains = ["git.alk.dev"] acme_domains = ["git.alk.dev"]
acme_cache_dir = "/var/lib/reverse-proxy/acme-cache-git" acme_cache_dir = "/var/lib/reverse-proxy/acme-cache-git"
acme_directory = "production" acme_directory = "production"
acme_contact = "mailto:admin@alk.dev"
[[listeners.sites]] [[listeners.sites]]
host = "git.alk.dev" host = "git.alk.dev"
@@ -347,6 +357,7 @@ mode = "acme"
acme_domains = ["git.alk.dev", "alk.dev"] acme_domains = ["git.alk.dev", "alk.dev"]
acme_cache_dir = "/var/lib/reverse-proxy/acme-cache" acme_cache_dir = "/var/lib/reverse-proxy/acme-cache"
acme_directory = "production" acme_directory = "production"
acme_contact = "mailto:admin@alk.dev"
[[listeners.sites]] [[listeners.sites]]
host = "git.alk.dev" host = "git.alk.dev"
@@ -391,6 +402,9 @@ On startup, the config is validated:
`10.0.0.5` (missing port). The `upstream_scheme` field handles the protocol. `10.0.0.5` (missing port). The `upstream_scheme` field handles the protocol.
18. `upstream_scheme` values are case-sensitive: only `"http"` or `"https"` 18. `upstream_scheme` values are case-sensitive: only `"http"` or `"https"`
(lowercase). Default is `"http"`. (lowercase). Default is `"http"`.
19. In ACME mode, `tls.acme_contact` must be a valid `mailto:` URI
(e.g., `"mailto:admin@example.com"`). Let's Encrypt requires a contact
email for production certificate requests.
On SIGHUP reload, the same validation applies. If the new config fails On SIGHUP reload, the same validation applies. If the new config fails
validation, the reload is rejected and the old config remains active. An error validation, the reload is rejected and the old config remains active. An error

View File

@@ -87,7 +87,97 @@ last_updated: 2026-06-11
- **Origin**: [proxy.md](proxy.md) - **Origin**: [proxy.md](proxy.md)
- **Status**: resolved - **Status**: resolved
- **Priority**: low - **Priority**: low
- **Resolution**: Yes. Per-site upstream timeouts with sensible defaults (5s - **Resolution**: Resolved by ADR-015. Per-site upstream timeout overrides with
connect, 60s request). Optional fields in SiteConfig that override global sensible defaults (5s connect, 60s request). Optional fields in SiteConfig
defaults when specified. See ADR-015. that override global defaults when specified.
- **Cross-references**: ADR-015, ADR-017 - **Cross-references**: ADR-015, ADR-017
### OQ-08: Should the `/health` path use a less common endpoint to avoid upstream collision?
- **Origin**: Implementation review finding W5, [proxy.md](proxy.md)
- **Status**: open
- **Priority**: medium
- **Resolution**: None yet. The proxy currently intercepts `GET /health` on all
hosts before host-based routing, which means any upstream application that
uses `/health` for its own health checks will have those requests silently
intercepted. Options: (1) Use a less common path like `/__health` or
`/healthz`; (2) Only intercept `/health` when the Host header doesn't match
any known site (fallthrough); (3) Make the health check path configurable
via `StaticConfig`. Option 1 is simplest for Phase 1. Option 3 is most
flexible long-term. The architecture spec (proxy.md, ADR-013) currently
specifies `/health` as a top-level route regardless of Host.
- **Cross-references**: ADR-013
### OQ-09: How should `upstream_connect_timeout_secs` be enforced?
- **Origin**: Implementation review finding W4, ADR-015, ADR-017
- **Status**: open
- **Priority**: medium
- **Resolution**: None yet. The architecture (ADR-015, ADR-017) specifies a
5-second default connect timeout separate from the request timeout, and
`SiteConfig` includes `upstream_connect_timeout_secs`. However, the
implementation only applies `upstream_request_timeout_secs` as a blanket
timeout covering the entire exchange. The hyper client handles TCP connect
internally, making a two-phase timeout harder to implement without custom
connect logic. Need to decide: (1) implement a two-phase timeout using
`tokio::time::timeout` for connect phase then request phase; (2) configure
the hyper client's `connect_timeout` parameter; or (3) accept the current
behavior for Phase 1 and add connect timeout enforcement in Phase 2.
- **Cross-references**: ADR-015, ADR-017
## Configuration
### OQ-10: Should ACME contact email be a required config field?
- **Origin**: Implementation review finding C2, [tls.md](tls.md), [config.md](config.md)
- **Status**: open
- **Priority**: high
- **Resolution**: None yet. Let's Encrypt requires a contact email for production
certificate requests. The current architecture spec does not include an
`acme_contact` field in `TlsConfig` or `ListenerConfig`. Without it, ACME
registration with Let's Encrypt production will fail. Options: (1) Add a
required `acme_contact` field to the TLS config within each `[[listeners]]`
entry that uses ACME mode; (2) Add a global `acme_contact` field shared
across all ACME listeners. Per-listener is more flexible but adds config
noise. Global is simpler for typical deployments. Need to update config.md
and tls.md.
- **Cross-references**: ADR-004
### OQ-11: How should `X-Forwarded-Proto` be derived per-listener?
- **Origin**: Implementation review finding W14, [proxy.md](proxy.md)
- **Status**: open
- **Priority**: medium
- **Resolution**: None yet. The architecture spec (proxy.md) states
`X-Forwarded-Proto` should be "determined by which listener port received the
request" — `https` for requests on the listener's `https_port`, `http` for
requests on the listener's `http_port`. The implementation hardcodes
`is_https: true` in `ProxyState`. For a TLS-terminating reverse proxy this
is correct (all TLS connections arrive on the HTTPS port), but the HTTP
redirect listener should set `X-Forwarded-Proto: https` since it redirects to
HTTPS. Need to clarify: (1) The HTTPS listener always sets `X-Forwarded-Proto:
https` (correct, since it terminates TLS); (2) The HTTP redirect listener
sends a 301 redirect and does NOT proxy, so `X-Forwarded-Proto` on the
redirect response is not applicable. The hardcoded behavior is correct but
should be documented.
- **Cross-references**: ADR-021
## Operations
### OQ-12: Should request access logging be mandatory or optional?
- **Origin**: Implementation review finding W13, [operations.md](operations.md)
- **Status**: open
- **Priority**: high
- **Resolution**: None yet. The architecture spec (operations.md) defines an
access log format (`REQUEST client_ip=... host=... method=... path=...
status=... upstream=... duration_ms=...`) and a `log_request!` macro, but
the implementation does not emit access logs. Without request-level logging,
the proxy is operationally blind — there is no observability into traffic,
response codes, or upstream latency. This also blocks fail2ban integration
for access-log-based jails. The question is whether to: (1) Make access
logging mandatory (always-on at `info` level); (2) Make it configurable
(e.g., `access_log` boolean in `LoggingConfig`); or (3) Tie it to the
existing `log_file_path` setting. The architecture spec implies it's always
on.
- **Cross-references**: ADR-007

View File

@@ -107,7 +107,10 @@ All logs use `tracing` with structured fields. The proxy outputs two types of
log entries: log entries:
1. **Access logs**: Every proxied request is logged at `info` level with 1. **Access logs**: Every proxied request is logged at `info` level with
structured fields. structured fields. Access logging is **always-on** — it is the primary
observability mechanism for the proxy and is required for fail2ban
integration. There is no configuration option to disable access logging
(see OQ-12).
``` ```
REQUEST client_ip=203.0.113.50 host=git.alk.dev method=GET path=/user/repo status=200 upstream=127.0.0.1:3000 duration_ms=45 REQUEST client_ip=203.0.113.50 host=git.alk.dev method=GET path=/user/repo status=200 upstream=127.0.0.1:3000 duration_ms=45
@@ -293,7 +296,9 @@ On SIGTERM or SIGINT, the proxy performs a graceful shutdown:
2. **Close idle keep-alive connections** — Send `Connection: close` on any idle 2. **Close idle keep-alive connections** — Send `Connection: close` on any idle
connections in the keep-alive pool. connections in the keep-alive pool.
3. **Wait for in-flight requests** — Up to `shutdown_timeout_secs` (default: 30) 3. **Wait for in-flight requests** — Up to `shutdown_timeout_secs` (default: 30)
for active requests to complete. for active requests to complete. Server tasks are joined (not aborted) so
that in-flight requests can drain normally. Only after the timeout expires
are remaining tasks aborted.
4. **Force-close remaining connections** — After the timeout, any remaining 4. **Force-close remaining connections** — After the timeout, any remaining
connections are forcefully closed via TCP RST. connections are forcefully closed via TCP RST.
5. **Cancel background tasks** — ACME renewal tasks, rate limiter eviction task, 5. **Cancel background tasks** — ACME renewal tasks, rate limiter eviction task,
@@ -477,6 +482,7 @@ mode = "acme"
acme_domains = ["git.example.com"] acme_domains = ["git.example.com"]
acme_cache_dir = "/var/lib/reverse-proxy/acme-cache" acme_cache_dir = "/var/lib/reverse-proxy/acme-cache"
acme_directory = "production" acme_directory = "production"
acme_contact = "mailto:admin@example.com"
[[listeners.sites]] [[listeners.sites]]
host = "git.example.com" host = "git.example.com"

View File

@@ -95,6 +95,12 @@ The `/health` path is a special case: it matches regardless of the `Host`
header and is evaluated before host-based routing. A `GET /health` request on header and is evaluated before host-based routing. A `GET /health` request on
any hostname returns `200 OK` with an empty body. any hostname returns `200 OK` with an empty body.
**Note**: This means any upstream application that uses `/health` for its own
health checks will have those requests silently intercepted by the proxy and
will never reach the upstream. If this is a concern, the health check path
should be changed to a less common path (e.g., `/__health` or `/healthz`) or
made configurable. See OQ-08.
### 2. Proxy Header Injection ### 2. Proxy Header Injection
Headers are injected before forwarding. The proxy is an **edge proxy** — it Headers are injected before forwarding. The proxy is an **edge proxy** — it
@@ -107,7 +113,7 @@ existing `X-Forwarded-For` headers from the client cannot be trusted.
| `Host` | Original request `Host` header | Preserved as-is | | `Host` | Original request `Host` header | Preserved as-is |
| `X-Real-IP` | `ConnectInfo<SocketAddr>` remote IP | Set to client's IP address | | `X-Real-IP` | `ConnectInfo<SocketAddr>` remote IP | Set to client's IP address |
| `X-Forwarded-For` | `ConnectInfo<SocketAddr>` remote IP | **Replaced**, not appended. The proxy is the edge proxy — there are no trusted proxies upstream, so existing `X-Forwarded-For` values from the client cannot be trusted. | | `X-Forwarded-For` | `ConnectInfo<SocketAddr>` remote IP | **Replaced**, not appended. The proxy is the edge proxy — there are no trusted proxies upstream, so existing `X-Forwarded-For` values from the client cannot be trusted. |
| `X-Forwarded-Proto` | Determined by which listener port received the request | `https` for requests on the listener's `https_port`, `http` for requests on the listener's `http_port` | | `X-Forwarded-Proto` | Determined by which listener port received the request | `https` for requests on the listener's `https_port`, `http` for requests on the listener's `http_port`. Note: since the TLS-terminating listener only receives HTTPS connections, this is always `"https"` in practice. The HTTP redirect listener sends a 301 redirect rather than proxying, so `X-Forwarded-Proto` is not set there. See OQ-11. |
**ConnectInfo propagation**: `ConnectInfo<SocketAddr>` is populated by **ConnectInfo propagation**: `ConnectInfo<SocketAddr>` is populated by
extracting `TcpStream::peer_addr()` before wrapping the connection in extracting `TcpStream::peer_addr()` before wrapping the connection in

View File

@@ -77,16 +77,19 @@ no deploy hooks.
listed in that listener's `acme_domains`. Let's Encrypt will issue a listed in that listener's `acme_domains`. Let's Encrypt will issue a
certificate covering those domains (a single SAN certificate or a certificate covering those domains (a single SAN certificate or a
single-domain certificate, depending on how many domains are listed). single-domain certificate, depending on how many domains are listed).
3. The ACME state machine runs as a background tokio task per listener, 3. The `acme_contact` field provides a contact email address (as a `mailto:`
URI) required by Let's Encrypt for production certificate requests. Without
a contact email, Let's Encrypt production API returns a 400-level error.
4. The ACME state machine runs as a background tokio task per listener,
handling: handling:
- Account registration with Let's Encrypt - Account registration with Let's Encrypt
- Certificate ordering - Certificate ordering
- TLS-ALPN-01 challenge (or HTTP-01 challenge) - TLS-ALPN-01 challenge (or HTTP-01 challenge)
- Certificate issuance - Certificate issuance
- Certificate renewal (automatic, ~30 days before expiry) - Certificate renewal (automatic, ~30 days before expiry)
4. `ResolvesServerCertAcme` is a rustls `ResolvesServerCert` implementation 5. `ResolvesServerCertAcme` is a rustls `ResolvesServerCert` implementation
that automatically serves the ACME-provisioned certificate. that automatically serves the ACME-provisioned certificate.
5. When a new certificate is issued, the resolver updates atomically — no 6. When a new certificate is issued, the resolver updates atomically — no
restart or signal handling needed. restart or signal handling needed.
**Configuration (within a `[[listeners]]` entry):** **Configuration (within a `[[listeners]]` entry):**
@@ -100,6 +103,7 @@ mode = "acme"
acme_domains = ["git.alk.dev", "alk.dev"] acme_domains = ["git.alk.dev", "alk.dev"]
acme_cache_dir = "/var/lib/reverse-proxy/acme-cache" acme_cache_dir = "/var/lib/reverse-proxy/acme-cache"
acme_directory = "production" # or "staging" for testing acme_directory = "production" # or "staging" for testing
acme_contact = "mailto:admin@alk.dev" # Required for Let's Encrypt production
``` ```
**Cache directory:** The `DirCache` from rustls-acme persists ACME account data, **Cache directory:** The `DirCache` from rustls-acme persists ACME account data,