Resolve 5 open questions, add 7 ADRs for previously undocumented decisions

Resolve open questions:
- OQ-01: Restrict cipher suites to match nginx scope (4 ECDHE-AES-GCM
  suites for TLS 1.2 + all TLS 1.3 suites) — ADR-012
- OQ-03: Health check on separate local port (default 9900, localhost
  only) — ADR-013
- OQ-04: Add Unix domain socket admin API for config reload alongside
  SIGHUP, with structured success/failure responses — ADR-014
- OQ-06: Per-site upstream timeouts with defaults (5s connect, 60s
  request), overridable in SiteConfig — ADR-015

Document previously undocumented decisions flagged by architecture review:
- ADR-016: Explicit bind address requirement (reject 0.0.0.0)
- ADR-017: Upstream connection defaults (HTTP/1.1, no redirects, pooling)
- ADR-018: 100 MB body size limit (matches nginx, Gitea compatibility)

OQ-07 (per-site TLS overrides) remains open for future consideration.

Spec updates:
- config.md: add health_check_port, admin_socket_path, per-site timeout
  fields, update TOML example and validation rules
- proxy.md: reference ADR-015/017/018 for timeouts, connection defaults,
  and body limit decisions
- tls.md: replace OQ-01 cipher suite section with ADR-012 decision
- operations.md: add local health check port section, admin socket reload
- overview.md: update Phase 1 scope with new features, add ADR references
- open-questions.md: resolve OQ-01/03/04/06, keep OQ-07 open
This commit is contained in:
2026-06-11 09:07:36 +00:00
parent 7efc142406
commit 9a2352e61c
14 changed files with 613 additions and 89 deletions

View File

@@ -0,0 +1,83 @@
# ADR-012: Restrict Cipher Suites to Match nginx Scope
## Status
Accepted
## Context
Our current nginx configuration explicitly restricts cipher suites to four
ECDHE-AES-GCM suites:
```
ECDHE-ECDSA-AES128-GCM-SHA256
ECDHE-RSA-AES128-GCM-SHA256
ECDHE-ECDSA-AES256-GCM-SHA384
ECDHE-RSA-AES256-GCM-SHA384
```
rustls 0.23 with the `aws_lc_rs` crypto provider defaults to a conservative
set that excludes all weak ciphers (no SHA-1, no 3DES, no RC4, no CBC-mode
suites, no RSA key exchange). The rustls defaults include these four suites
plus TLS 1.3 suites (which nginx also allows via `TLSv1.3`).
The question was whether to accept the wider rustls defaults or restrict to
the same scope as our current nginx configuration (see OQ-01).
## Decision
Explicitly restrict cipher suites to match the same scope as our current nginx
configuration: the four ECDHE-AES-GCM suites for TLS 1.2, plus all TLS 1.3
suites. This is slightly more restrictive than rustls defaults but preserves
compatibility with all modern clients.
The restricted set in rustls terms:
**TLS 1.2 (explicitly selected):**
- `TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256`
- `TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256`
- `TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384`
- `TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384`
**TLS 1.3 (all default suites):**
- `TLS_AES_128_GCM_SHA256`
- `TLS_AES_256_GCM_SHA384`
- `TLS_CHACHA20_POLY1305_SHA256`
This is configured by building a `CryptoProvider` with a custom
`cipher_suite` list and passing it to `ServerConfig::builder_with_provider()`.
## Rationale
- Maintains behavioral parity with our current nginx configuration during
migration — no client that worked with nginx should see different TLS
behavior
- The four TLS 1.2 suites are the same ones nginx allows, providing a known
security baseline
- TLS 1.3 suites are all modern and secure — restricting them provides no
meaningful security benefit and reduces compatibility
- rustls defaults include ChaCha20-Poly1305 suites for TLS 1.2 which nginx does
not; these are cryptographically sound but represent a wider scope than our
current nginx config allows
- Explicit configuration means the cipher list is documented and auditable
- If compatibility issues arise, expanding the list is straightforward; the
reverse (restricting after deployment) risks breaking existing clients
## Consequences
**Positive:**
- Behavioral parity with current nginx TLS configuration
- Explicitly auditable cipher list
- No client that currently works will see different TLS behavior
- Matches security review expectations
**Negative:**
- Slightly more restrictive than rustls defaults — excludes ChaCha20-Poly1305
for TLS 1.2 and AES-CCM suites (rarely used)
- Must update the cipher list when deprecating TLS 1.2 in the future
- Custom `CryptoProvider` construction is slightly more code than using defaults
## References
- [tls.md](../tls.md)
- OQ-01 (now resolved)

View File

@@ -0,0 +1,69 @@
# ADR-013: Health Check on Separate Local Port
## Status
Accepted
## Context
The health check endpoint (`/health`) needs to be accessible for monitoring
without requiring TLS. Currently the design places it on the main HTTPS
listener, which means:
1. TLS handshake must succeed for the health check to respond
2. External monitoring tools need to handle TLS
3. A TLS configuration error would make the health check unreachable, creating
a false-negative monitoring signal
Three options were considered (see OQ-03):
1. **Main HTTPS listener only**: Simplest, but TLS config errors make health
checks unreachable
2. **Separate unencrypted port on localhost**: Simple, works with standard
monitoring tools, but health checks bypass TLS
3. **Admin port with its own listener**: Most flexible but adds complexity
## Decision
Add a configurable health check port that binds to `127.0.0.1` only (localhost),
serving `/health` over plain HTTP. This is a separate listener from the main
HTTP and HTTPS listeners.
The port is configurable via `health_check_port` in StaticConfig. Setting it
to `0` (default) disables the separate health check listener, and `/health`
remains available on the main HTTPS listener as a fallback.
## Rationale
- A local-only health check port is the standard pattern for reverse proxies
and service meshes (envoy, haproxy, k8s health probes all use this pattern)
- Health checks should work even when TLS is misconfigured — that's the whole
point of monitoring
- Binding to `127.0.0.1` only means the health check is not exposed to the
internet — only local monitoring tools (systemd, scripts, load balancers on
the same host) can reach it
- Configurable port allows different deployment scenarios (some monitoring runs
on different ports)
- Disabling via `health_check_port = 0` keeps the main HTTPS `/health` endpoint
available for cases where a separate port isn't needed
- When this project is folded into alknet, the health check will use alknet's
existing patterns, making the separate port unnecessary in that context
## Consequences
**Positive:**
- Health checks work even when TLS is misconfigured
- Standard pattern that monitoring tools expect
- Not exposed to the internet (localhost only)
- Configurable — can be disabled if not needed
- systemd can use it for `NotifyAccess` readiness checks
**Negative:**
- Additional listener to manage (minimal complexity)
- Two health check endpoints exist when the separate port is enabled (the
local one and the HTTPS one) — monitoring should prefer the local one
## References
- [operations.md](../operations.md)
- OQ-03 (now resolved)

View File

@@ -0,0 +1,76 @@
# ADR-014: Unix Domain Socket Config Reload API
## Status
Accepted
## Context
The proxy supports config reload via SIGHUP (ADR-009). SIGHUP is simple and
well-understood, but has limitations:
1. No feedback — the sender doesn't know if the reload succeeded or failed
2. No structured input — you can only signal "reload", not specify which parts
to reload or pass validation context
3. Requires process signal permissions — not all deployment tools can send
signals
A Unix domain socket API would allow programmatic config reload with
success/failure status, enabling integration with CI/CD pipelines, admin
tools, and automated configuration management.
## Decision
Add a Unix domain socket API for config reload alongside SIGHUP. The socket
accepts commands and returns structured responses.
The socket path is configurable via `admin_socket_path` in StaticConfig
(default: `/run/reverse-proxy/admin.sock`). Setting it to empty string
disables the admin socket.
Initial commands:
- `reload` — Re-read config file, validate, and swap DynamicConfig. Returns
`{"status": "ok"}` or `{"status": "error", "message": "..."}`.
- `status` — Return basic process info (uptime, config load time, site count).
Returns `{"status": "ok", "uptime_secs": 1234, "sites": 2}`.
Future commands (not in Phase 1, but the protocol supports extension):
- `metrics` — Return Prometheus-compatible metrics
- `shutdown` — Graceful shutdown command
## Rationale
- Providing reload feedback is operationally valuable — CI/CD pipelines can
verify config changes before proceeding
- The implementation cost is low — a Unix domain socket listener is ~50 lines
of tokio code, and the command protocol is simple
- SIGHUP is retained as a fallback for environments where socket access is
inconvenient
- This pattern will integrate naturally with alknet's admin interface when the
projects merge
- Unix domain sockets are filesystem-permission-based, providing access control
without additional authentication
- The socket path is configurable, allowing deployment-specific paths
## Consequences
**Positive:**
- Config reload with success/failure feedback
- Programmatic integration with CI/CD and admin tools
- Structured response format enables automation
- SIGHUP still works as fallback
- Natural path to future admin commands
**Negative:**
- Additional listener and command parsing logic (~100-150 lines)
- Socket file management (cleanup on startup, stale socket detection)
- One more config option (`admin_socket_path`)
## References
- [operations.md](../operations.md)
- [config.md](../config.md)
- ADR-009 (signal handling — SIGHUP retained as fallback)
- OQ-04 (now resolved)

View File

@@ -0,0 +1,69 @@
# ADR-015: Per-Site Upstream Timeouts with Defaults
## Status
Accepted
## Context
The proxy forwards requests to upstream services. Connection and request
timeouts affect reliability — too short and legitimate slow responses fail,
too long and the proxy is vulnerable to resource exhaustion from stalled
connections.
Phase 1 initially specified global timeout defaults (5s connect, 60s request)
for all upstreams. However, different upstream services have different latency
profiles:
- Gitea (git.alk.dev): Git push operations can be slow for large repos; 60s
may be insufficient for clone/push operations with large pack files
- Deno/Fresh (alk.dev): Fast responses expected; 60s is generous
Global timeouts don't accommodate these differences. Per-site timeout
configuration allows tuning for each upstream without affecting others.
## Decision
Add optional per-site upstream timeout configuration to `SiteConfig`. When not
specified, sensible defaults are used.
**SiteConfig additions:**
| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `upstream_connect_timeout_secs` | `u64` | `5` | TCP connection timeout in seconds |
| `upstream_request_timeout_secs` | `u64` | `60` | Full request timeout in seconds |
These are part of `DynamicConfig` (hot-reloadable via ArcSwap) since they
affect per-request behavior and should not require a restart to change.
## Rationale
- Different upstreams genuinely have different latency profiles — Gitea pushes
with large pack files need more time than a fast static site
- Defaults of 5s connect and 60s request match common reverse proxy conventions
(nginx defaults: 60s, haproxy defaults: 30s connect, 60s server)
- Making these per-site rather than global allows tuning without side effects
- Per-site overrides in DynamicConfig means timeout changes don't require
restarts
- The defaults are reasonable for most services; explicit configuration is only
needed for outliers
## Consequences
**Positive:**
- Each upstream can be tuned independently
- Defaults work for most cases — explicit configuration is optional
- Hot-reloadable (part of DynamicConfig)
- Consistent with how other reverse proxies handle timeouts
**Negative:**
- Two more fields per site in config (mitigated by sensible defaults)
- Per-site timeout means the proxy must look up per-request config for each
upstream connection (already required for routing, so no additional overhead)
## References
- [proxy.md](../proxy.md)
- [config.md](../config.md)
- OQ-06 (now resolved)

View File

@@ -0,0 +1,53 @@
# ADR-016: Explicit Bind Address Requirement
## Status
Accepted
## Context
The proxy's `bind_addr` configuration field specifies the IP address to bind
to. The default for most network services is `0.0.0.0` (bind all interfaces),
which means the service is accessible on every network interface, including
public-facing ones.
For a reverse proxy that terminates TLS and handles security-sensitive traffic,
binding to all interfaces is risky. The proxy should only listen on the
intended interface(s) — typically a specific public IP for a single-server
deployment.
## Decision
The `bind_addr` field must be an explicit IP address. `0.0.0.0` is rejected
during config validation. The proxy will not start if `bind_addr` is `0.0.0.0`.
## Rationale
- A reverse proxy terminating TLS is a security boundary — it should not be
accidentally exposed on unintended interfaces
- Single-server deployments have a specific public IP; binding to it
deliberately is the correct operational practice
- `0.0.0.0` is often a default in example configurations and can be deployed
without the operator realizing the service is accessible on all interfaces
- Rejecting it at validation time prevents this common mistake
- If a deployment genuinely needs to bind all interfaces, `0.0.0.0` can be
overridden with an explicit flag, but this should be a deliberate choice
- This matches the principle of explicit over implicit for security-sensitive
configuration
## Consequences
**Positive:**
- Prevents accidental exposure on unintended network interfaces
- Forces operators to be deliberate about which interface the proxy serves
- Config validation catches the mistake before deployment
**Negative:**
- Not suitable for deployments that genuinely need to bind all interfaces
(mitigated by explicit override if needed in the future)
- Slightly more configuration required (operator must know their public IP)
## References
- [config.md](../config.md)
- [overview.md](../overview.md)

View File

@@ -0,0 +1,66 @@
# ADR-017: Upstream Connection Defaults
## Status
Accepted
## Context
The proxy forwards requests to upstream services using a shared hyper `Client`.
The client's configuration affects all upstream connections: timeouts, redirect
handling, and connection pooling. These are significant design choices that
affect production behavior and should be traceable.
## Decision
Configure the hyper `Client` with the following defaults:
| Setting | Value | Rationale |
|---------|-------|-----------|
| Connect timeout | 5 seconds | Prevents hanging on unreachable upstreams |
| Request timeout | 60 seconds (default) | Per-site override available (ADR-015) |
| Redirect following | Disabled | Proxies should not follow redirects — they pass response status to the client |
| Connection pooling | Enabled (hyper default) | Reuses connections to the same upstream for performance |
| HTTP version | HTTP/1.1 only | Upstream connections use HTTP/1.1; HTTP/2 proxying is out of scope |
The connect timeout of 5 seconds applies to establishing a TCP connection to
the upstream. The request timeout of 60 seconds applies to the full
request/response cycle and can be overridden per-site (ADR-015).
## Rationale
- **5s connect timeout**: Matches common reverse proxy conventions. Long
enough for remote upstreams with network latency, short enough to fail fast
on unreachable services.
- **60s request timeout**: Generous default that accommodates Gitea push
operations with large pack files. Per-site overrides allow tuning for faster
upstreams.
- **No redirect following**: Standard reverse proxy behavior. If an upstream
returns a 3xx, the proxy streams it to the client. Following redirects would
break the client's expectation and could create security issues (redirect
loops, credential forwarding).
- **Connection pooling**: Hyper's default connection reuse improves performance
for repeated requests to the same upstream without additional configuration.
- **HTTP/1.1 only**: The proxy communicates with upstreams over plain HTTP
(loopback). HTTP/2 proxying is explicitly out of scope (see overview.md).
Upstreams needing HTTP/2+ run their own servers (e.g., api.alk.dev).
## Consequences
**Positive:**
- Explicit, documented defaults that match reverse proxy conventions
- Per-site timeout overrides available for specific needs
- No redirect following prevents common proxy pitfalls
- Connection pooling provides good performance by default
**Negative:**
- 60s default may be too long for fast upstreams (mitigated by per-site
overrides in ADR-015)
- HTTP/1.1 only for upstream means no HTTP/2 multiplexing to backends (out of
scope per project design)
## References
- [proxy.md](../proxy.md)
- ADR-015 (per-site upstream timeouts)
- ADR-002 (custom proxy handler)

View File

@@ -0,0 +1,53 @@
# ADR-018: Request Body Size Limit
## Status
Accepted
## Context
The proxy enforces a maximum request body size to protect against resource
exhaustion attacks. The default limit must balance security with usability.
Gitea push operations can involve large Git pack files. The current nginx
configuration uses `client_max_body_size 100m`, and Gitea's documentation
recommends allowing up to 100 MB for push operations.
## Decision
Set the default request body size limit to 100 MB (104,857,600 bytes),
matching our current nginx configuration. The limit is global in Phase 1
(configurable via `body.limit_bytes` in DynamicConfig).
## Rationale
- 100 MB matches the current nginx `client_max_body_size 100m`, ensuring
behavioral parity during migration
- Gitea push operations with large repositories regularly exceed 50 MB
- 100 MB is large enough for any legitimate Git operation while still
providing protection against resource exhaustion (a 100 MB body is not
enough to exhaust memory on modern servers, but prevents unbounded uploads)
- The limit is configurable — operators can reduce it for deployments that
don't need large uploads
- In Phase 2, per-site limits will allow different limits for different
upstreams (e.g., a lower limit for alk.dev, the current limit for
git.alk.dev)
## Consequences
**Positive:**
- Behavioral parity with current nginx configuration
- Gitea push operations work without configuration changes
- Configurable for deployments with different needs
**Negative:**
- 100 MB is a generous default — some deployments may want a lower limit
(mitigated by configurability)
- Global limit means all sites share the same maximum (mitigated by Phase 2
per-site limits)
## References
- [proxy.md](../proxy.md)
- [config.md](../config.md)
- nginx `client_max_body_size` documentation