Architecture updates based on gaps discovered during live deployment testing:
- ADR-023: HTTP/2 client-facing support via ALPN-based protocol detection.
The spec previously said HTTP/2 was out of scope, but the deployment
revealed that modern browsers negotiate HTTP/2 via ALPN. The proxy now
correctly detects the negotiated ALPN protocol and uses the appropriate
HTTP server builder (http2::Builder for h2, auto::Builder for http/1.1).
Upstream connections remain HTTP/1.1. Host resolution now falls back to
URI host for HTTP/2 :authority pseudo-headers.
- ADR-024: ANSI-disabled logging. All tracing-subscriber layers now use
with_ansi(false) to prevent ANSI escape codes in log output, which broke
fail2ban regex matching in Docker deployments. Also documents the fail2ban
regex anchor fix (^RATE_LIMIT → RATE_LIMIT).
Bug fixes found by architecture review:
- Fix missing ALPN protocols in manual TLS mode. build_manual_server_config
and build_multi_domain_server_config did not set alpn_protocols, meaning
manual TLS mode could not support HTTP/2. Added h2 and http/1.1 ALPN
entries to both functions (acme-tls/1 only in ACME mode).
- Fix missing with_ansi(false) in JSON log format. The init_json function
with file output did not disable ANSI on stdout or file layers, which would
break fail2ban in production JSON logging mode.
Other spec updates:
- All document statuses updated from draft to reviewed
- proxy.md: documented Server header removal, upstream HTTPS client,
two-phase timeout enforcement, HTTP/2 host resolution, connect timeout
- tls.md: documented ALPN configuration differing by mode (ACME vs manual)
- overview.md: added HTTP/2 client-facing support to scope, updated crate
deps (hyper-rustls, rustls-native-certs, hyper-util), clarified out-of-scope
- config.md: fixed http_port type (u16→u32) to match implementation, added
ANSI-disabled note for LoggingConfig
- operations.md: documented ANSI-disabled logging, fail2ban regex anchor
- open-questions.md: updated OQ-09 resolution (connect timeout fully
implemented), OQ-10 (C2 bug is fixed)
- Add with_ansi(false) to all tracing_subscriber fmt layers so log
output (both stdout and file) is plain text without escape codes.
This is critical for Docker deployments and fail2ban log parsing.
- Remove ^ anchor from fail2ban failregex since log lines have a
timestamp/level prefix before RATE_LIMIT.
Two changes to properly support HTTP/2 clients:
1. server.rs: Detect ALPN protocol after TLS handshake and use
hyper::server::conn::http2::Builder for H2 connections instead
of the auto::Builder which failed to detect HTTP/2 over TLS.
The auto::Builder's ReadVersion mechanism doesn't work reliably
with tokio-rustls TlsStreams. For H1 connections, continue using
auto::Builder with upgrade support.
2. handler.rs: Fallback to URI host when Host header is missing.
In HTTP/2, the host is conveyed via :authority pseudo-header which
hyper represents as the URI host, not a Host header.
All #[allow(dead_code)] annotations on now-used items have been removed
(acceptor.rs, acme.rs, config.rs, static_config.rs). #[non_exhaustive]
added to TlsMode, ProxyError, AdminSocketError, and ValidationError
with wildcard match arms in main.rs for the non-exhaustive enums.
Implement two-phase timeout in proxy_handler:
- Inner timeout uses per-site upstream_connect_timeout_secs (default 5s)
for the connect + first-byte phase
- Outer timeout uses upstream_request_timeout_secs (default 60s) for the
full request/response cycle
- Set connect_timeout on HttpConnector for both HTTP and HTTPS clients
(default 5s) to enforce TCP-level connect timeouts
- Use wrap_connector for HTTPS client to apply connect_timeout on the
underlying HttpConnector
- Add Ok(Err(_)) handler for connect timeout returning 504 Gateway Timeout
- Replace handle.abort() for HTTPS server tasks with timeout-based join,
allowing in-flight requests to drain before forceful shutdown
- Add shutdown_rx to start_admin_socket with tokio::select! for clean
accept loop exit and Unix socket file cleanup on shutdown
- Add shutdown_rx to start_eviction_task with tokio::select! for
cancellable eviction loop
- Add shutdown channel to spawn_acme_state for cancellable ACME state
machine via tokio::select!
- Pass Arc<GracefulShutdown> through setup_tls to ACME state machine
- Move GracefulShutdown creation before admin socket and TLS setup
- Update integration test for new start_eviction_task signature
The main code changes were already committed (3f2550f), but test config
TOML strings in cli.rs, admin/socket.rs, shutdown.rs, and
integration_test.rs still needed the new acme_contact field to pass
validation rule 19.
Add log_request! calls for every proxied request (success, 4xx/5xx from
upstream, 502/504 errors) and log_upstream_error! calls for upstream
connection failures and timeouts. Duration is tracked from request entry
to response using std::time::Instant.
- Remove health_handler and /health early return from proxy_handler
- Remove /health route from proxy_router (now just fallback)
- Remove is_https field from ProxyState struct
- Remove is_https parameter from inject_proxy_headers, hardcode https
- Add comment explaining why X-Forwarded-Proto is always https
- Remove health_path_returns_200 and health_with_unknown_host tests
- Update all inject_proxy_headers test calls to remove is_https param
- Remove inject_proxy_headers_sets_x_forwarded_proto_http test
Extract strip_port_from_host into shared utils module and update normalize_host to properly strip brackets from IPv6 addresses like [::1]:443 -> ::1 instead of incorrectly using split(':').next().
Change ConfigReloadHandle.static_config from StaticConfig to ArcSwap<StaticConfig>
so that after each reload, the stored static config is updated with the new value.
This prevents repeated stale warnings about the same static config fields on
every reload.
The init_creates_log_directory_and_file test called init() which sets a
global tracing subscriber. When tests run in parallel, other tests may
have already set the subscriber, causing init() to return an error and
the test to fail. Now the test tolerates the 'already set' error while
still asserting the log file is created.
Change http_port type from u16 to u32 to allow out-of-range values to be
caught by validation. Add HttpPortInvalid error variant and validation check
for http_port > 65535. Add test for http_port=65536 producing HttpPortInvalid.
http_port=0 (disabled) remains valid per existing test.
Resolve OQ-08 through OQ-12 after reviewing implementation findings:
- OQ-08: Remove /health route from the main HTTPS listener entirely.
Health checking belongs on port 9900 and admin socket only, not on
the public-facing proxy. This eliminates upstream collision problems
and special-case routing logic. (ADR-022)
- OQ-09: Not an architectural unknown — ADR-015 already decided on a
separate connect timeout. The implementation gap is a known issue.
- OQ-10: Not an open question — acme_contact is already specified as
required in config.md. The empty contact list is bug C2.
- OQ-11: Hardcoded is_https=true is correct for a TLS-terminating
proxy. HTTP listener redirects, doesn't proxy. Just needs a comment.
- OQ-12: Access logging is already specified as mandatory/always-on in
operations.md. Missing log_request! calls are bug W13.
Updated docs: proxy.md, operations.md, overview.md, config.md,
open-questions.md, README.md, ADR-013. Created ADR-022.
Analyzed 29 findings from the implementation review (002-implementation-review.md)
and identified 8 architecture-level concerns requiring spec changes:
Architecture gaps addressed:
- C2: Added acme_contact field to config.md, tls.md, and operations.md.
Let's Encrypt requires a contact email for production; the spec was missing
this required field.
- C4: Added StaticConfig drift tracking requirement to config.md reload
section. ConfigReloadHandle must update its stored StaticConfig after each
successful reload to prevent stale warnings.
- W1: Updated shutdown sequence in operations.md to specify that server tasks
should be joined (not aborted) during the drain window.
- W5: Added health check path collision note to proxy.md.
- W13: Clarified that access logging is always-on in operations.md.
- W14: Updated X-Forwarded-Proto description in proxy.md to clarify that it
is always 'https' since the HTTP listener redirects rather than proxies.
New open questions added:
- OQ-08: Should /health use a less common path to avoid upstream collision?
- OQ-09: How should upstream_connect_timeout_secs be enforced?
- OQ-10: Should ACME contact email be a required config field?
- OQ-11: How should X-Forwarded-Proto be derived per-listener?
- OQ-12: Should request access logging be mandatory or optional?
The remaining 21 findings are implementation-level bugs, code quality issues,
or Phase 2 improvements that don't require architecture spec changes.
W13: No request access logging - log_request! macro defined but never called
W14: is_https hardcoded to true on ProxyState - X-Forwarded-Proto always https
S9: Rate limiting silently bypassed when no client IP found
S10: Integration test TOML has [[listeners.listeners.sites]] typo
S11: No Server response header added by proxy (upstream's is stripped)
- Replace determine_if_https() with ProxyState.is_https field so X-Forwarded-Proto
reflects the listener's protocol instead of guessing from the Host header
- Return ProxyError::BadGateway with host/upstream context for non-connect upstream
errors instead of bare StatusCode::BAD_GATEWAY
- Implement InFlightCounter with RAII guard for tracking in-flight connections
- Add drain_in_flight() to wait for connections to complete on shutdown, with
configurable timeout before forcing exit
- Mark review/core-components and review/integration-readiness as complete