Triage implementation review findings and update architecture specs
Analyzed 29 findings from the implementation review (002-implementation-review.md) and identified 8 architecture-level concerns requiring spec changes: Architecture gaps addressed: - C2: Added acme_contact field to config.md, tls.md, and operations.md. Let's Encrypt requires a contact email for production; the spec was missing this required field. - C4: Added StaticConfig drift tracking requirement to config.md reload section. ConfigReloadHandle must update its stored StaticConfig after each successful reload to prevent stale warnings. - W1: Updated shutdown sequence in operations.md to specify that server tasks should be joined (not aborted) during the drain window. - W5: Added health check path collision note to proxy.md. - W13: Clarified that access logging is always-on in operations.md. - W14: Updated X-Forwarded-Proto description in proxy.md to clarify that it is always 'https' since the HTTP listener redirects rather than proxies. New open questions added: - OQ-08: Should /health use a less common path to avoid upstream collision? - OQ-09: How should upstream_connect_timeout_secs be enforced? - OQ-10: Should ACME contact email be a required config field? - OQ-11: How should X-Forwarded-Proto be derived per-listener? - OQ-12: Should request access logging be mandatory or optional? The remaining 21 findings are implementation-level bugs, code quality issues, or Phase 2 improvements that don't require architecture spec changes.
This commit is contained in:
@@ -67,6 +67,11 @@ See [open-questions.md](open-questions.md) for the full tracker.
|
|||||||
| ~~OQ-05~~ | ~~Should the proxy bind to multiple addresses?~~ | ~~low~~ | **resolved** (single bind_addr sufficient) |
|
| ~~OQ-05~~ | ~~Should the proxy bind to multiple addresses?~~ | ~~low~~ | **resolved** (single bind_addr sufficient) |
|
||||||
| ~~OQ-06~~ | ~~Should upstream timeouts be configurable per-site?~~ | ~~low~~ | **resolved** (ADR-015) |
|
| ~~OQ-06~~ | ~~Should upstream timeouts be configurable per-site?~~ | ~~low~~ | **resolved** (ADR-015) |
|
||||||
| ~~OQ-07~~ | ~~Should per-site TLS overrides be supported for mixed ACME/manual domains?~~ | ~~low~~ | **resolved** (ADR-019) |
|
| ~~OQ-07~~ | ~~Should per-site TLS overrides be supported for mixed ACME/manual domains?~~ | ~~low~~ | **resolved** (ADR-019) |
|
||||||
|
| OQ-08 | Should the `/health` path use a less common endpoint to avoid upstream collision? | medium | open |
|
||||||
|
| OQ-09 | How should `upstream_connect_timeout_secs` be enforced? | medium | open |
|
||||||
|
| OQ-10 | Should ACME contact email be a required config field? | high | open |
|
||||||
|
| OQ-11 | How should `X-Forwarded-Proto` be derived per-listener? | medium | open |
|
||||||
|
| OQ-12 | Should request access logging be mandatory or optional? | high | open |
|
||||||
|
|
||||||
## Document Lifecycle
|
## Document Lifecycle
|
||||||
|
|
||||||
|
|||||||
@@ -116,6 +116,7 @@ will be handled via signal-based or built-in rotation.
|
|||||||
| `tls.acme_domains` | `Vec<String>` | Domains for ACME SAN certificate (ACME mode only) |
|
| `tls.acme_domains` | `Vec<String>` | Domains for ACME SAN certificate (ACME mode only) |
|
||||||
| `tls.acme_cache_dir` | `String` | ACME state cache directory |
|
| `tls.acme_cache_dir` | `String` | ACME state cache directory |
|
||||||
| `tls.acme_directory` | `"production"` or `"staging"` | Let's Encrypt directory |
|
| `tls.acme_directory` | `"production"` or `"staging"` | Let's Encrypt directory |
|
||||||
|
| `tls.acme_contact` | `String` | Contact email for ACME registration (e.g., `"mailto:admin@example.com"`). Required for production; Let's Encrypt rejects registrations without a contact email. See OQ-10. |
|
||||||
| `tls.cert_path` | `String` | Certificate file path (manual mode only) |
|
| `tls.cert_path` | `String` | Certificate file path (manual mode only) |
|
||||||
| `tls.key_path` | `String` | Private key file path (manual mode only) |
|
| `tls.key_path` | `String` | Private key file path (manual mode only) |
|
||||||
|
|
||||||
@@ -174,6 +175,7 @@ Phase 2.
|
|||||||
| `listeners[].http_port` | `u16` | `80` | No |
|
| `listeners[].http_port` | `u16` | `80` | No |
|
||||||
| `listeners[].https_port` | `u16` | `443` | No |
|
| `listeners[].https_port` | `u16` | `443` | No |
|
||||||
| `listeners[].tls.acme_directory` | `String` | `"production"` | No |
|
| `listeners[].tls.acme_directory` | `String` | `"production"` | No |
|
||||||
|
| `listeners[].tls.acme_contact` | `String` | — | Yes (ACME mode only) |
|
||||||
| `sites[].upstream_scheme` | `String` | `"http"` | No |
|
| `sites[].upstream_scheme` | `String` | `"http"` | No |
|
||||||
| `sites[].upstream_connect_timeout_secs` | `u64` | `5` | No |
|
| `sites[].upstream_connect_timeout_secs` | `u64` | `5` | No |
|
||||||
| `sites[].upstream_request_timeout_secs` | `u64` | `60` | No |
|
| `sites[].upstream_request_timeout_secs` | `u64` | `60` | No |
|
||||||
@@ -236,6 +238,13 @@ effect. This gives operators early feedback about config drift.
|
|||||||
Only the DynamicConfig portion is swapped via ArcSwap. StaticConfig changes
|
Only the DynamicConfig portion is swapped via ArcSwap. StaticConfig changes
|
||||||
require a process restart to take effect.
|
require a process restart to take effect.
|
||||||
|
|
||||||
|
**Important**: The `ConfigReloadHandle` must track the last-known StaticConfig
|
||||||
|
so that it can correctly detect changes on subsequent reloads. After each
|
||||||
|
successful reload, the stored StaticConfig is updated with the new value (via
|
||||||
|
`ArcSwap<StaticConfig>` or similar interior mutability). This prevents stale
|
||||||
|
warnings: if the same static config change is present on two consecutive
|
||||||
|
reloads, the operator should see the warning only once, not on every reload.
|
||||||
|
|
||||||
### Reload Serialization
|
### Reload Serialization
|
||||||
|
|
||||||
Reload operations are serialized using a `tokio::sync::Mutex` on the reload
|
Reload operations are serialized using a `tokio::sync::Mutex` on the reload
|
||||||
@@ -290,6 +299,7 @@ mode = "acme"
|
|||||||
acme_domains = ["git.alk.dev"]
|
acme_domains = ["git.alk.dev"]
|
||||||
acme_cache_dir = "/var/lib/reverse-proxy/acme-cache-git"
|
acme_cache_dir = "/var/lib/reverse-proxy/acme-cache-git"
|
||||||
acme_directory = "production"
|
acme_directory = "production"
|
||||||
|
acme_contact = "mailto:admin@alk.dev"
|
||||||
|
|
||||||
[[listeners.sites]]
|
[[listeners.sites]]
|
||||||
host = "git.alk.dev"
|
host = "git.alk.dev"
|
||||||
@@ -347,6 +357,7 @@ mode = "acme"
|
|||||||
acme_domains = ["git.alk.dev", "alk.dev"]
|
acme_domains = ["git.alk.dev", "alk.dev"]
|
||||||
acme_cache_dir = "/var/lib/reverse-proxy/acme-cache"
|
acme_cache_dir = "/var/lib/reverse-proxy/acme-cache"
|
||||||
acme_directory = "production"
|
acme_directory = "production"
|
||||||
|
acme_contact = "mailto:admin@alk.dev"
|
||||||
|
|
||||||
[[listeners.sites]]
|
[[listeners.sites]]
|
||||||
host = "git.alk.dev"
|
host = "git.alk.dev"
|
||||||
@@ -391,6 +402,9 @@ On startup, the config is validated:
|
|||||||
`10.0.0.5` (missing port). The `upstream_scheme` field handles the protocol.
|
`10.0.0.5` (missing port). The `upstream_scheme` field handles the protocol.
|
||||||
18. `upstream_scheme` values are case-sensitive: only `"http"` or `"https"`
|
18. `upstream_scheme` values are case-sensitive: only `"http"` or `"https"`
|
||||||
(lowercase). Default is `"http"`.
|
(lowercase). Default is `"http"`.
|
||||||
|
19. In ACME mode, `tls.acme_contact` must be a valid `mailto:` URI
|
||||||
|
(e.g., `"mailto:admin@example.com"`). Let's Encrypt requires a contact
|
||||||
|
email for production certificate requests.
|
||||||
|
|
||||||
On SIGHUP reload, the same validation applies. If the new config fails
|
On SIGHUP reload, the same validation applies. If the new config fails
|
||||||
validation, the reload is rejected and the old config remains active. An error
|
validation, the reload is rejected and the old config remains active. An error
|
||||||
|
|||||||
@@ -87,7 +87,97 @@ last_updated: 2026-06-11
|
|||||||
- **Origin**: [proxy.md](proxy.md)
|
- **Origin**: [proxy.md](proxy.md)
|
||||||
- **Status**: resolved
|
- **Status**: resolved
|
||||||
- **Priority**: low
|
- **Priority**: low
|
||||||
- **Resolution**: Yes. Per-site upstream timeouts with sensible defaults (5s
|
- **Resolution**: Resolved by ADR-015. Per-site upstream timeout overrides with
|
||||||
connect, 60s request). Optional fields in SiteConfig that override global
|
sensible defaults (5s connect, 60s request). Optional fields in SiteConfig
|
||||||
defaults when specified. See ADR-015.
|
that override global defaults when specified.
|
||||||
- **Cross-references**: ADR-015, ADR-017
|
- **Cross-references**: ADR-015, ADR-017
|
||||||
|
|
||||||
|
### OQ-08: Should the `/health` path use a less common endpoint to avoid upstream collision?
|
||||||
|
|
||||||
|
- **Origin**: Implementation review finding W5, [proxy.md](proxy.md)
|
||||||
|
- **Status**: open
|
||||||
|
- **Priority**: medium
|
||||||
|
- **Resolution**: None yet. The proxy currently intercepts `GET /health` on all
|
||||||
|
hosts before host-based routing, which means any upstream application that
|
||||||
|
uses `/health` for its own health checks will have those requests silently
|
||||||
|
intercepted. Options: (1) Use a less common path like `/__health` or
|
||||||
|
`/healthz`; (2) Only intercept `/health` when the Host header doesn't match
|
||||||
|
any known site (fallthrough); (3) Make the health check path configurable
|
||||||
|
via `StaticConfig`. Option 1 is simplest for Phase 1. Option 3 is most
|
||||||
|
flexible long-term. The architecture spec (proxy.md, ADR-013) currently
|
||||||
|
specifies `/health` as a top-level route regardless of Host.
|
||||||
|
- **Cross-references**: ADR-013
|
||||||
|
|
||||||
|
### OQ-09: How should `upstream_connect_timeout_secs` be enforced?
|
||||||
|
|
||||||
|
- **Origin**: Implementation review finding W4, ADR-015, ADR-017
|
||||||
|
- **Status**: open
|
||||||
|
- **Priority**: medium
|
||||||
|
- **Resolution**: None yet. The architecture (ADR-015, ADR-017) specifies a
|
||||||
|
5-second default connect timeout separate from the request timeout, and
|
||||||
|
`SiteConfig` includes `upstream_connect_timeout_secs`. However, the
|
||||||
|
implementation only applies `upstream_request_timeout_secs` as a blanket
|
||||||
|
timeout covering the entire exchange. The hyper client handles TCP connect
|
||||||
|
internally, making a two-phase timeout harder to implement without custom
|
||||||
|
connect logic. Need to decide: (1) implement a two-phase timeout using
|
||||||
|
`tokio::time::timeout` for connect phase then request phase; (2) configure
|
||||||
|
the hyper client's `connect_timeout` parameter; or (3) accept the current
|
||||||
|
behavior for Phase 1 and add connect timeout enforcement in Phase 2.
|
||||||
|
- **Cross-references**: ADR-015, ADR-017
|
||||||
|
|
||||||
|
## Configuration
|
||||||
|
|
||||||
|
### OQ-10: Should ACME contact email be a required config field?
|
||||||
|
|
||||||
|
- **Origin**: Implementation review finding C2, [tls.md](tls.md), [config.md](config.md)
|
||||||
|
- **Status**: open
|
||||||
|
- **Priority**: high
|
||||||
|
- **Resolution**: None yet. Let's Encrypt requires a contact email for production
|
||||||
|
certificate requests. The current architecture spec does not include an
|
||||||
|
`acme_contact` field in `TlsConfig` or `ListenerConfig`. Without it, ACME
|
||||||
|
registration with Let's Encrypt production will fail. Options: (1) Add a
|
||||||
|
required `acme_contact` field to the TLS config within each `[[listeners]]`
|
||||||
|
entry that uses ACME mode; (2) Add a global `acme_contact` field shared
|
||||||
|
across all ACME listeners. Per-listener is more flexible but adds config
|
||||||
|
noise. Global is simpler for typical deployments. Need to update config.md
|
||||||
|
and tls.md.
|
||||||
|
- **Cross-references**: ADR-004
|
||||||
|
|
||||||
|
### OQ-11: How should `X-Forwarded-Proto` be derived per-listener?
|
||||||
|
|
||||||
|
- **Origin**: Implementation review finding W14, [proxy.md](proxy.md)
|
||||||
|
- **Status**: open
|
||||||
|
- **Priority**: medium
|
||||||
|
- **Resolution**: None yet. The architecture spec (proxy.md) states
|
||||||
|
`X-Forwarded-Proto` should be "determined by which listener port received the
|
||||||
|
request" — `https` for requests on the listener's `https_port`, `http` for
|
||||||
|
requests on the listener's `http_port`. The implementation hardcodes
|
||||||
|
`is_https: true` in `ProxyState`. For a TLS-terminating reverse proxy this
|
||||||
|
is correct (all TLS connections arrive on the HTTPS port), but the HTTP
|
||||||
|
redirect listener should set `X-Forwarded-Proto: https` since it redirects to
|
||||||
|
HTTPS. Need to clarify: (1) The HTTPS listener always sets `X-Forwarded-Proto:
|
||||||
|
https` (correct, since it terminates TLS); (2) The HTTP redirect listener
|
||||||
|
sends a 301 redirect and does NOT proxy, so `X-Forwarded-Proto` on the
|
||||||
|
redirect response is not applicable. The hardcoded behavior is correct but
|
||||||
|
should be documented.
|
||||||
|
- **Cross-references**: ADR-021
|
||||||
|
|
||||||
|
## Operations
|
||||||
|
|
||||||
|
### OQ-12: Should request access logging be mandatory or optional?
|
||||||
|
|
||||||
|
- **Origin**: Implementation review finding W13, [operations.md](operations.md)
|
||||||
|
- **Status**: open
|
||||||
|
- **Priority**: high
|
||||||
|
- **Resolution**: None yet. The architecture spec (operations.md) defines an
|
||||||
|
access log format (`REQUEST client_ip=... host=... method=... path=...
|
||||||
|
status=... upstream=... duration_ms=...`) and a `log_request!` macro, but
|
||||||
|
the implementation does not emit access logs. Without request-level logging,
|
||||||
|
the proxy is operationally blind — there is no observability into traffic,
|
||||||
|
response codes, or upstream latency. This also blocks fail2ban integration
|
||||||
|
for access-log-based jails. The question is whether to: (1) Make access
|
||||||
|
logging mandatory (always-on at `info` level); (2) Make it configurable
|
||||||
|
(e.g., `access_log` boolean in `LoggingConfig`); or (3) Tie it to the
|
||||||
|
existing `log_file_path` setting. The architecture spec implies it's always
|
||||||
|
on.
|
||||||
|
- **Cross-references**: ADR-007
|
||||||
@@ -107,7 +107,10 @@ All logs use `tracing` with structured fields. The proxy outputs two types of
|
|||||||
log entries:
|
log entries:
|
||||||
|
|
||||||
1. **Access logs**: Every proxied request is logged at `info` level with
|
1. **Access logs**: Every proxied request is logged at `info` level with
|
||||||
structured fields.
|
structured fields. Access logging is **always-on** — it is the primary
|
||||||
|
observability mechanism for the proxy and is required for fail2ban
|
||||||
|
integration. There is no configuration option to disable access logging
|
||||||
|
(see OQ-12).
|
||||||
|
|
||||||
```
|
```
|
||||||
REQUEST client_ip=203.0.113.50 host=git.alk.dev method=GET path=/user/repo status=200 upstream=127.0.0.1:3000 duration_ms=45
|
REQUEST client_ip=203.0.113.50 host=git.alk.dev method=GET path=/user/repo status=200 upstream=127.0.0.1:3000 duration_ms=45
|
||||||
@@ -293,7 +296,9 @@ On SIGTERM or SIGINT, the proxy performs a graceful shutdown:
|
|||||||
2. **Close idle keep-alive connections** — Send `Connection: close` on any idle
|
2. **Close idle keep-alive connections** — Send `Connection: close` on any idle
|
||||||
connections in the keep-alive pool.
|
connections in the keep-alive pool.
|
||||||
3. **Wait for in-flight requests** — Up to `shutdown_timeout_secs` (default: 30)
|
3. **Wait for in-flight requests** — Up to `shutdown_timeout_secs` (default: 30)
|
||||||
for active requests to complete.
|
for active requests to complete. Server tasks are joined (not aborted) so
|
||||||
|
that in-flight requests can drain normally. Only after the timeout expires
|
||||||
|
are remaining tasks aborted.
|
||||||
4. **Force-close remaining connections** — After the timeout, any remaining
|
4. **Force-close remaining connections** — After the timeout, any remaining
|
||||||
connections are forcefully closed via TCP RST.
|
connections are forcefully closed via TCP RST.
|
||||||
5. **Cancel background tasks** — ACME renewal tasks, rate limiter eviction task,
|
5. **Cancel background tasks** — ACME renewal tasks, rate limiter eviction task,
|
||||||
@@ -477,6 +482,7 @@ mode = "acme"
|
|||||||
acme_domains = ["git.example.com"]
|
acme_domains = ["git.example.com"]
|
||||||
acme_cache_dir = "/var/lib/reverse-proxy/acme-cache"
|
acme_cache_dir = "/var/lib/reverse-proxy/acme-cache"
|
||||||
acme_directory = "production"
|
acme_directory = "production"
|
||||||
|
acme_contact = "mailto:admin@example.com"
|
||||||
|
|
||||||
[[listeners.sites]]
|
[[listeners.sites]]
|
||||||
host = "git.example.com"
|
host = "git.example.com"
|
||||||
|
|||||||
@@ -95,6 +95,12 @@ The `/health` path is a special case: it matches regardless of the `Host`
|
|||||||
header and is evaluated before host-based routing. A `GET /health` request on
|
header and is evaluated before host-based routing. A `GET /health` request on
|
||||||
any hostname returns `200 OK` with an empty body.
|
any hostname returns `200 OK` with an empty body.
|
||||||
|
|
||||||
|
**Note**: This means any upstream application that uses `/health` for its own
|
||||||
|
health checks will have those requests silently intercepted by the proxy and
|
||||||
|
will never reach the upstream. If this is a concern, the health check path
|
||||||
|
should be changed to a less common path (e.g., `/__health` or `/healthz`) or
|
||||||
|
made configurable. See OQ-08.
|
||||||
|
|
||||||
### 2. Proxy Header Injection
|
### 2. Proxy Header Injection
|
||||||
|
|
||||||
Headers are injected before forwarding. The proxy is an **edge proxy** — it
|
Headers are injected before forwarding. The proxy is an **edge proxy** — it
|
||||||
@@ -107,7 +113,7 @@ existing `X-Forwarded-For` headers from the client cannot be trusted.
|
|||||||
| `Host` | Original request `Host` header | Preserved as-is |
|
| `Host` | Original request `Host` header | Preserved as-is |
|
||||||
| `X-Real-IP` | `ConnectInfo<SocketAddr>` remote IP | Set to client's IP address |
|
| `X-Real-IP` | `ConnectInfo<SocketAddr>` remote IP | Set to client's IP address |
|
||||||
| `X-Forwarded-For` | `ConnectInfo<SocketAddr>` remote IP | **Replaced**, not appended. The proxy is the edge proxy — there are no trusted proxies upstream, so existing `X-Forwarded-For` values from the client cannot be trusted. |
|
| `X-Forwarded-For` | `ConnectInfo<SocketAddr>` remote IP | **Replaced**, not appended. The proxy is the edge proxy — there are no trusted proxies upstream, so existing `X-Forwarded-For` values from the client cannot be trusted. |
|
||||||
| `X-Forwarded-Proto` | Determined by which listener port received the request | `https` for requests on the listener's `https_port`, `http` for requests on the listener's `http_port` |
|
| `X-Forwarded-Proto` | Determined by which listener port received the request | `https` for requests on the listener's `https_port`, `http` for requests on the listener's `http_port`. Note: since the TLS-terminating listener only receives HTTPS connections, this is always `"https"` in practice. The HTTP redirect listener sends a 301 redirect rather than proxying, so `X-Forwarded-Proto` is not set there. See OQ-11. |
|
||||||
|
|
||||||
**ConnectInfo propagation**: `ConnectInfo<SocketAddr>` is populated by
|
**ConnectInfo propagation**: `ConnectInfo<SocketAddr>` is populated by
|
||||||
extracting `TcpStream::peer_addr()` before wrapping the connection in
|
extracting `TcpStream::peer_addr()` before wrapping the connection in
|
||||||
|
|||||||
@@ -77,16 +77,19 @@ no deploy hooks.
|
|||||||
listed in that listener's `acme_domains`. Let's Encrypt will issue a
|
listed in that listener's `acme_domains`. Let's Encrypt will issue a
|
||||||
certificate covering those domains (a single SAN certificate or a
|
certificate covering those domains (a single SAN certificate or a
|
||||||
single-domain certificate, depending on how many domains are listed).
|
single-domain certificate, depending on how many domains are listed).
|
||||||
3. The ACME state machine runs as a background tokio task per listener,
|
3. The `acme_contact` field provides a contact email address (as a `mailto:`
|
||||||
|
URI) required by Let's Encrypt for production certificate requests. Without
|
||||||
|
a contact email, Let's Encrypt production API returns a 400-level error.
|
||||||
|
4. The ACME state machine runs as a background tokio task per listener,
|
||||||
handling:
|
handling:
|
||||||
- Account registration with Let's Encrypt
|
- Account registration with Let's Encrypt
|
||||||
- Certificate ordering
|
- Certificate ordering
|
||||||
- TLS-ALPN-01 challenge (or HTTP-01 challenge)
|
- TLS-ALPN-01 challenge (or HTTP-01 challenge)
|
||||||
- Certificate issuance
|
- Certificate issuance
|
||||||
- Certificate renewal (automatic, ~30 days before expiry)
|
- Certificate renewal (automatic, ~30 days before expiry)
|
||||||
4. `ResolvesServerCertAcme` is a rustls `ResolvesServerCert` implementation
|
5. `ResolvesServerCertAcme` is a rustls `ResolvesServerCert` implementation
|
||||||
that automatically serves the ACME-provisioned certificate.
|
that automatically serves the ACME-provisioned certificate.
|
||||||
5. When a new certificate is issued, the resolver updates atomically — no
|
6. When a new certificate is issued, the resolver updates atomically — no
|
||||||
restart or signal handling needed.
|
restart or signal handling needed.
|
||||||
|
|
||||||
**Configuration (within a `[[listeners]]` entry):**
|
**Configuration (within a `[[listeners]]` entry):**
|
||||||
@@ -100,6 +103,7 @@ mode = "acme"
|
|||||||
acme_domains = ["git.alk.dev", "alk.dev"]
|
acme_domains = ["git.alk.dev", "alk.dev"]
|
||||||
acme_cache_dir = "/var/lib/reverse-proxy/acme-cache"
|
acme_cache_dir = "/var/lib/reverse-proxy/acme-cache"
|
||||||
acme_directory = "production" # or "staging" for testing
|
acme_directory = "production" # or "staging" for testing
|
||||||
|
acme_contact = "mailto:admin@alk.dev" # Required for Let's Encrypt production
|
||||||
```
|
```
|
||||||
|
|
||||||
**Cache directory:** The `DirCache` from rustls-acme persists ACME account data,
|
**Cache directory:** The `DirCache` from rustls-acme persists ACME account data,
|
||||||
|
|||||||
Reference in New Issue
Block a user