Resolve 5 open questions, add 7 ADRs for previously undocumented decisions
Resolve open questions: - OQ-01: Restrict cipher suites to match nginx scope (4 ECDHE-AES-GCM suites for TLS 1.2 + all TLS 1.3 suites) — ADR-012 - OQ-03: Health check on separate local port (default 9900, localhost only) — ADR-013 - OQ-04: Add Unix domain socket admin API for config reload alongside SIGHUP, with structured success/failure responses — ADR-014 - OQ-06: Per-site upstream timeouts with defaults (5s connect, 60s request), overridable in SiteConfig — ADR-015 Document previously undocumented decisions flagged by architecture review: - ADR-016: Explicit bind address requirement (reject 0.0.0.0) - ADR-017: Upstream connection defaults (HTTP/1.1, no redirects, pooling) - ADR-018: 100 MB body size limit (matches nginx, Gitea compatibility) OQ-07 (per-site TLS overrides) remains open for future consideration. Spec updates: - config.md: add health_check_port, admin_socket_path, per-site timeout fields, update TOML example and validation rules - proxy.md: reference ADR-015/017/018 for timeouts, connection defaults, and body limit decisions - tls.md: replace OQ-01 cipher suite section with ADR-012 decision - operations.md: add local health check port section, admin socket reload - overview.md: update Phase 1 scope with new features, add ADR references - open-questions.md: resolve OQ-01/03/04/06, keep OQ-07 open
This commit is contained in:
83
docs/architecture/decisions/012-cipher-suite-restriction.md
Normal file
83
docs/architecture/decisions/012-cipher-suite-restriction.md
Normal file
@@ -0,0 +1,83 @@
|
||||
# ADR-012: Restrict Cipher Suites to Match nginx Scope
|
||||
|
||||
## Status
|
||||
|
||||
Accepted
|
||||
|
||||
## Context
|
||||
|
||||
Our current nginx configuration explicitly restricts cipher suites to four
|
||||
ECDHE-AES-GCM suites:
|
||||
|
||||
```
|
||||
ECDHE-ECDSA-AES128-GCM-SHA256
|
||||
ECDHE-RSA-AES128-GCM-SHA256
|
||||
ECDHE-ECDSA-AES256-GCM-SHA384
|
||||
ECDHE-RSA-AES256-GCM-SHA384
|
||||
```
|
||||
|
||||
rustls 0.23 with the `aws_lc_rs` crypto provider defaults to a conservative
|
||||
set that excludes all weak ciphers (no SHA-1, no 3DES, no RC4, no CBC-mode
|
||||
suites, no RSA key exchange). The rustls defaults include these four suites
|
||||
plus TLS 1.3 suites (which nginx also allows via `TLSv1.3`).
|
||||
|
||||
The question was whether to accept the wider rustls defaults or restrict to
|
||||
the same scope as our current nginx configuration (see OQ-01).
|
||||
|
||||
## Decision
|
||||
|
||||
Explicitly restrict cipher suites to match the same scope as our current nginx
|
||||
configuration: the four ECDHE-AES-GCM suites for TLS 1.2, plus all TLS 1.3
|
||||
suites. This is slightly more restrictive than rustls defaults but preserves
|
||||
compatibility with all modern clients.
|
||||
|
||||
The restricted set in rustls terms:
|
||||
|
||||
**TLS 1.2 (explicitly selected):**
|
||||
- `TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256`
|
||||
- `TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256`
|
||||
- `TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384`
|
||||
- `TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384`
|
||||
|
||||
**TLS 1.3 (all default suites):**
|
||||
- `TLS_AES_128_GCM_SHA256`
|
||||
- `TLS_AES_256_GCM_SHA384`
|
||||
- `TLS_CHACHA20_POLY1305_SHA256`
|
||||
|
||||
This is configured by building a `CryptoProvider` with a custom
|
||||
`cipher_suite` list and passing it to `ServerConfig::builder_with_provider()`.
|
||||
|
||||
## Rationale
|
||||
|
||||
- Maintains behavioral parity with our current nginx configuration during
|
||||
migration — no client that worked with nginx should see different TLS
|
||||
behavior
|
||||
- The four TLS 1.2 suites are the same ones nginx allows, providing a known
|
||||
security baseline
|
||||
- TLS 1.3 suites are all modern and secure — restricting them provides no
|
||||
meaningful security benefit and reduces compatibility
|
||||
- rustls defaults include ChaCha20-Poly1305 suites for TLS 1.2 which nginx does
|
||||
not; these are cryptographically sound but represent a wider scope than our
|
||||
current nginx config allows
|
||||
- Explicit configuration means the cipher list is documented and auditable
|
||||
- If compatibility issues arise, expanding the list is straightforward; the
|
||||
reverse (restricting after deployment) risks breaking existing clients
|
||||
|
||||
## Consequences
|
||||
|
||||
**Positive:**
|
||||
- Behavioral parity with current nginx TLS configuration
|
||||
- Explicitly auditable cipher list
|
||||
- No client that currently works will see different TLS behavior
|
||||
- Matches security review expectations
|
||||
|
||||
**Negative:**
|
||||
- Slightly more restrictive than rustls defaults — excludes ChaCha20-Poly1305
|
||||
for TLS 1.2 and AES-CCM suites (rarely used)
|
||||
- Must update the cipher list when deprecating TLS 1.2 in the future
|
||||
- Custom `CryptoProvider` construction is slightly more code than using defaults
|
||||
|
||||
## References
|
||||
|
||||
- [tls.md](../tls.md)
|
||||
- OQ-01 (now resolved)
|
||||
69
docs/architecture/decisions/013-health-check-port.md
Normal file
69
docs/architecture/decisions/013-health-check-port.md
Normal file
@@ -0,0 +1,69 @@
|
||||
# ADR-013: Health Check on Separate Local Port
|
||||
|
||||
## Status
|
||||
|
||||
Accepted
|
||||
|
||||
## Context
|
||||
|
||||
The health check endpoint (`/health`) needs to be accessible for monitoring
|
||||
without requiring TLS. Currently the design places it on the main HTTPS
|
||||
listener, which means:
|
||||
|
||||
1. TLS handshake must succeed for the health check to respond
|
||||
2. External monitoring tools need to handle TLS
|
||||
3. A TLS configuration error would make the health check unreachable, creating
|
||||
a false-negative monitoring signal
|
||||
|
||||
Three options were considered (see OQ-03):
|
||||
|
||||
1. **Main HTTPS listener only**: Simplest, but TLS config errors make health
|
||||
checks unreachable
|
||||
2. **Separate unencrypted port on localhost**: Simple, works with standard
|
||||
monitoring tools, but health checks bypass TLS
|
||||
3. **Admin port with its own listener**: Most flexible but adds complexity
|
||||
|
||||
## Decision
|
||||
|
||||
Add a configurable health check port that binds to `127.0.0.1` only (localhost),
|
||||
serving `/health` over plain HTTP. This is a separate listener from the main
|
||||
HTTP and HTTPS listeners.
|
||||
|
||||
The port is configurable via `health_check_port` in StaticConfig. Setting it
|
||||
to `0` (default) disables the separate health check listener, and `/health`
|
||||
remains available on the main HTTPS listener as a fallback.
|
||||
|
||||
## Rationale
|
||||
|
||||
- A local-only health check port is the standard pattern for reverse proxies
|
||||
and service meshes (envoy, haproxy, k8s health probes all use this pattern)
|
||||
- Health checks should work even when TLS is misconfigured — that's the whole
|
||||
point of monitoring
|
||||
- Binding to `127.0.0.1` only means the health check is not exposed to the
|
||||
internet — only local monitoring tools (systemd, scripts, load balancers on
|
||||
the same host) can reach it
|
||||
- Configurable port allows different deployment scenarios (some monitoring runs
|
||||
on different ports)
|
||||
- Disabling via `health_check_port = 0` keeps the main HTTPS `/health` endpoint
|
||||
available for cases where a separate port isn't needed
|
||||
- When this project is folded into alknet, the health check will use alknet's
|
||||
existing patterns, making the separate port unnecessary in that context
|
||||
|
||||
## Consequences
|
||||
|
||||
**Positive:**
|
||||
- Health checks work even when TLS is misconfigured
|
||||
- Standard pattern that monitoring tools expect
|
||||
- Not exposed to the internet (localhost only)
|
||||
- Configurable — can be disabled if not needed
|
||||
- systemd can use it for `NotifyAccess` readiness checks
|
||||
|
||||
**Negative:**
|
||||
- Additional listener to manage (minimal complexity)
|
||||
- Two health check endpoints exist when the separate port is enabled (the
|
||||
local one and the HTTPS one) — monitoring should prefer the local one
|
||||
|
||||
## References
|
||||
|
||||
- [operations.md](../operations.md)
|
||||
- OQ-03 (now resolved)
|
||||
76
docs/architecture/decisions/014-unix-socket-reload.md
Normal file
76
docs/architecture/decisions/014-unix-socket-reload.md
Normal file
@@ -0,0 +1,76 @@
|
||||
# ADR-014: Unix Domain Socket Config Reload API
|
||||
|
||||
## Status
|
||||
|
||||
Accepted
|
||||
|
||||
## Context
|
||||
|
||||
The proxy supports config reload via SIGHUP (ADR-009). SIGHUP is simple and
|
||||
well-understood, but has limitations:
|
||||
|
||||
1. No feedback — the sender doesn't know if the reload succeeded or failed
|
||||
2. No structured input — you can only signal "reload", not specify which parts
|
||||
to reload or pass validation context
|
||||
3. Requires process signal permissions — not all deployment tools can send
|
||||
signals
|
||||
|
||||
A Unix domain socket API would allow programmatic config reload with
|
||||
success/failure status, enabling integration with CI/CD pipelines, admin
|
||||
tools, and automated configuration management.
|
||||
|
||||
## Decision
|
||||
|
||||
Add a Unix domain socket API for config reload alongside SIGHUP. The socket
|
||||
accepts commands and returns structured responses.
|
||||
|
||||
The socket path is configurable via `admin_socket_path` in StaticConfig
|
||||
(default: `/run/reverse-proxy/admin.sock`). Setting it to empty string
|
||||
disables the admin socket.
|
||||
|
||||
Initial commands:
|
||||
|
||||
- `reload` — Re-read config file, validate, and swap DynamicConfig. Returns
|
||||
`{"status": "ok"}` or `{"status": "error", "message": "..."}`.
|
||||
- `status` — Return basic process info (uptime, config load time, site count).
|
||||
Returns `{"status": "ok", "uptime_secs": 1234, "sites": 2}`.
|
||||
|
||||
Future commands (not in Phase 1, but the protocol supports extension):
|
||||
|
||||
- `metrics` — Return Prometheus-compatible metrics
|
||||
- `shutdown` — Graceful shutdown command
|
||||
|
||||
## Rationale
|
||||
|
||||
- Providing reload feedback is operationally valuable — CI/CD pipelines can
|
||||
verify config changes before proceeding
|
||||
- The implementation cost is low — a Unix domain socket listener is ~50 lines
|
||||
of tokio code, and the command protocol is simple
|
||||
- SIGHUP is retained as a fallback for environments where socket access is
|
||||
inconvenient
|
||||
- This pattern will integrate naturally with alknet's admin interface when the
|
||||
projects merge
|
||||
- Unix domain sockets are filesystem-permission-based, providing access control
|
||||
without additional authentication
|
||||
- The socket path is configurable, allowing deployment-specific paths
|
||||
|
||||
## Consequences
|
||||
|
||||
**Positive:**
|
||||
- Config reload with success/failure feedback
|
||||
- Programmatic integration with CI/CD and admin tools
|
||||
- Structured response format enables automation
|
||||
- SIGHUP still works as fallback
|
||||
- Natural path to future admin commands
|
||||
|
||||
**Negative:**
|
||||
- Additional listener and command parsing logic (~100-150 lines)
|
||||
- Socket file management (cleanup on startup, stale socket detection)
|
||||
- One more config option (`admin_socket_path`)
|
||||
|
||||
## References
|
||||
|
||||
- [operations.md](../operations.md)
|
||||
- [config.md](../config.md)
|
||||
- ADR-009 (signal handling — SIGHUP retained as fallback)
|
||||
- OQ-04 (now resolved)
|
||||
69
docs/architecture/decisions/015-per-site-timeouts.md
Normal file
69
docs/architecture/decisions/015-per-site-timeouts.md
Normal file
@@ -0,0 +1,69 @@
|
||||
# ADR-015: Per-Site Upstream Timeouts with Defaults
|
||||
|
||||
## Status
|
||||
|
||||
Accepted
|
||||
|
||||
## Context
|
||||
|
||||
The proxy forwards requests to upstream services. Connection and request
|
||||
timeouts affect reliability — too short and legitimate slow responses fail,
|
||||
too long and the proxy is vulnerable to resource exhaustion from stalled
|
||||
connections.
|
||||
|
||||
Phase 1 initially specified global timeout defaults (5s connect, 60s request)
|
||||
for all upstreams. However, different upstream services have different latency
|
||||
profiles:
|
||||
|
||||
- Gitea (git.alk.dev): Git push operations can be slow for large repos; 60s
|
||||
may be insufficient for clone/push operations with large pack files
|
||||
- Deno/Fresh (alk.dev): Fast responses expected; 60s is generous
|
||||
|
||||
Global timeouts don't accommodate these differences. Per-site timeout
|
||||
configuration allows tuning for each upstream without affecting others.
|
||||
|
||||
## Decision
|
||||
|
||||
Add optional per-site upstream timeout configuration to `SiteConfig`. When not
|
||||
specified, sensible defaults are used.
|
||||
|
||||
**SiteConfig additions:**
|
||||
|
||||
| Field | Type | Default | Description |
|
||||
|-------|------|---------|-------------|
|
||||
| `upstream_connect_timeout_secs` | `u64` | `5` | TCP connection timeout in seconds |
|
||||
| `upstream_request_timeout_secs` | `u64` | `60` | Full request timeout in seconds |
|
||||
|
||||
These are part of `DynamicConfig` (hot-reloadable via ArcSwap) since they
|
||||
affect per-request behavior and should not require a restart to change.
|
||||
|
||||
## Rationale
|
||||
|
||||
- Different upstreams genuinely have different latency profiles — Gitea pushes
|
||||
with large pack files need more time than a fast static site
|
||||
- Defaults of 5s connect and 60s request match common reverse proxy conventions
|
||||
(nginx defaults: 60s, haproxy defaults: 30s connect, 60s server)
|
||||
- Making these per-site rather than global allows tuning without side effects
|
||||
- Per-site overrides in DynamicConfig means timeout changes don't require
|
||||
restarts
|
||||
- The defaults are reasonable for most services; explicit configuration is only
|
||||
needed for outliers
|
||||
|
||||
## Consequences
|
||||
|
||||
**Positive:**
|
||||
- Each upstream can be tuned independently
|
||||
- Defaults work for most cases — explicit configuration is optional
|
||||
- Hot-reloadable (part of DynamicConfig)
|
||||
- Consistent with how other reverse proxies handle timeouts
|
||||
|
||||
**Negative:**
|
||||
- Two more fields per site in config (mitigated by sensible defaults)
|
||||
- Per-site timeout means the proxy must look up per-request config for each
|
||||
upstream connection (already required for routing, so no additional overhead)
|
||||
|
||||
## References
|
||||
|
||||
- [proxy.md](../proxy.md)
|
||||
- [config.md](../config.md)
|
||||
- OQ-06 (now resolved)
|
||||
53
docs/architecture/decisions/016-explicit-bind-address.md
Normal file
53
docs/architecture/decisions/016-explicit-bind-address.md
Normal file
@@ -0,0 +1,53 @@
|
||||
# ADR-016: Explicit Bind Address Requirement
|
||||
|
||||
## Status
|
||||
|
||||
Accepted
|
||||
|
||||
## Context
|
||||
|
||||
The proxy's `bind_addr` configuration field specifies the IP address to bind
|
||||
to. The default for most network services is `0.0.0.0` (bind all interfaces),
|
||||
which means the service is accessible on every network interface, including
|
||||
public-facing ones.
|
||||
|
||||
For a reverse proxy that terminates TLS and handles security-sensitive traffic,
|
||||
binding to all interfaces is risky. The proxy should only listen on the
|
||||
intended interface(s) — typically a specific public IP for a single-server
|
||||
deployment.
|
||||
|
||||
## Decision
|
||||
|
||||
The `bind_addr` field must be an explicit IP address. `0.0.0.0` is rejected
|
||||
during config validation. The proxy will not start if `bind_addr` is `0.0.0.0`.
|
||||
|
||||
## Rationale
|
||||
|
||||
- A reverse proxy terminating TLS is a security boundary — it should not be
|
||||
accidentally exposed on unintended interfaces
|
||||
- Single-server deployments have a specific public IP; binding to it
|
||||
deliberately is the correct operational practice
|
||||
- `0.0.0.0` is often a default in example configurations and can be deployed
|
||||
without the operator realizing the service is accessible on all interfaces
|
||||
- Rejecting it at validation time prevents this common mistake
|
||||
- If a deployment genuinely needs to bind all interfaces, `0.0.0.0` can be
|
||||
overridden with an explicit flag, but this should be a deliberate choice
|
||||
- This matches the principle of explicit over implicit for security-sensitive
|
||||
configuration
|
||||
|
||||
## Consequences
|
||||
|
||||
**Positive:**
|
||||
- Prevents accidental exposure on unintended network interfaces
|
||||
- Forces operators to be deliberate about which interface the proxy serves
|
||||
- Config validation catches the mistake before deployment
|
||||
|
||||
**Negative:**
|
||||
- Not suitable for deployments that genuinely need to bind all interfaces
|
||||
(mitigated by explicit override if needed in the future)
|
||||
- Slightly more configuration required (operator must know their public IP)
|
||||
|
||||
## References
|
||||
|
||||
- [config.md](../config.md)
|
||||
- [overview.md](../overview.md)
|
||||
@@ -0,0 +1,66 @@
|
||||
# ADR-017: Upstream Connection Defaults
|
||||
|
||||
## Status
|
||||
|
||||
Accepted
|
||||
|
||||
## Context
|
||||
|
||||
The proxy forwards requests to upstream services using a shared hyper `Client`.
|
||||
The client's configuration affects all upstream connections: timeouts, redirect
|
||||
handling, and connection pooling. These are significant design choices that
|
||||
affect production behavior and should be traceable.
|
||||
|
||||
## Decision
|
||||
|
||||
Configure the hyper `Client` with the following defaults:
|
||||
|
||||
| Setting | Value | Rationale |
|
||||
|---------|-------|-----------|
|
||||
| Connect timeout | 5 seconds | Prevents hanging on unreachable upstreams |
|
||||
| Request timeout | 60 seconds (default) | Per-site override available (ADR-015) |
|
||||
| Redirect following | Disabled | Proxies should not follow redirects — they pass response status to the client |
|
||||
| Connection pooling | Enabled (hyper default) | Reuses connections to the same upstream for performance |
|
||||
| HTTP version | HTTP/1.1 only | Upstream connections use HTTP/1.1; HTTP/2 proxying is out of scope |
|
||||
|
||||
The connect timeout of 5 seconds applies to establishing a TCP connection to
|
||||
the upstream. The request timeout of 60 seconds applies to the full
|
||||
request/response cycle and can be overridden per-site (ADR-015).
|
||||
|
||||
## Rationale
|
||||
|
||||
- **5s connect timeout**: Matches common reverse proxy conventions. Long
|
||||
enough for remote upstreams with network latency, short enough to fail fast
|
||||
on unreachable services.
|
||||
- **60s request timeout**: Generous default that accommodates Gitea push
|
||||
operations with large pack files. Per-site overrides allow tuning for faster
|
||||
upstreams.
|
||||
- **No redirect following**: Standard reverse proxy behavior. If an upstream
|
||||
returns a 3xx, the proxy streams it to the client. Following redirects would
|
||||
break the client's expectation and could create security issues (redirect
|
||||
loops, credential forwarding).
|
||||
- **Connection pooling**: Hyper's default connection reuse improves performance
|
||||
for repeated requests to the same upstream without additional configuration.
|
||||
- **HTTP/1.1 only**: The proxy communicates with upstreams over plain HTTP
|
||||
(loopback). HTTP/2 proxying is explicitly out of scope (see overview.md).
|
||||
Upstreams needing HTTP/2+ run their own servers (e.g., api.alk.dev).
|
||||
|
||||
## Consequences
|
||||
|
||||
**Positive:**
|
||||
- Explicit, documented defaults that match reverse proxy conventions
|
||||
- Per-site timeout overrides available for specific needs
|
||||
- No redirect following prevents common proxy pitfalls
|
||||
- Connection pooling provides good performance by default
|
||||
|
||||
**Negative:**
|
||||
- 60s default may be too long for fast upstreams (mitigated by per-site
|
||||
overrides in ADR-015)
|
||||
- HTTP/1.1 only for upstream means no HTTP/2 multiplexing to backends (out of
|
||||
scope per project design)
|
||||
|
||||
## References
|
||||
|
||||
- [proxy.md](../proxy.md)
|
||||
- ADR-015 (per-site upstream timeouts)
|
||||
- ADR-002 (custom proxy handler)
|
||||
53
docs/architecture/decisions/018-body-size-limit.md
Normal file
53
docs/architecture/decisions/018-body-size-limit.md
Normal file
@@ -0,0 +1,53 @@
|
||||
# ADR-018: Request Body Size Limit
|
||||
|
||||
## Status
|
||||
|
||||
Accepted
|
||||
|
||||
## Context
|
||||
|
||||
The proxy enforces a maximum request body size to protect against resource
|
||||
exhaustion attacks. The default limit must balance security with usability.
|
||||
|
||||
Gitea push operations can involve large Git pack files. The current nginx
|
||||
configuration uses `client_max_body_size 100m`, and Gitea's documentation
|
||||
recommends allowing up to 100 MB for push operations.
|
||||
|
||||
## Decision
|
||||
|
||||
Set the default request body size limit to 100 MB (104,857,600 bytes),
|
||||
matching our current nginx configuration. The limit is global in Phase 1
|
||||
(configurable via `body.limit_bytes` in DynamicConfig).
|
||||
|
||||
## Rationale
|
||||
|
||||
- 100 MB matches the current nginx `client_max_body_size 100m`, ensuring
|
||||
behavioral parity during migration
|
||||
- Gitea push operations with large repositories regularly exceed 50 MB
|
||||
- 100 MB is large enough for any legitimate Git operation while still
|
||||
providing protection against resource exhaustion (a 100 MB body is not
|
||||
enough to exhaust memory on modern servers, but prevents unbounded uploads)
|
||||
- The limit is configurable — operators can reduce it for deployments that
|
||||
don't need large uploads
|
||||
- In Phase 2, per-site limits will allow different limits for different
|
||||
upstreams (e.g., a lower limit for alk.dev, the current limit for
|
||||
git.alk.dev)
|
||||
|
||||
## Consequences
|
||||
|
||||
**Positive:**
|
||||
- Behavioral parity with current nginx configuration
|
||||
- Gitea push operations work without configuration changes
|
||||
- Configurable for deployments with different needs
|
||||
|
||||
**Negative:**
|
||||
- 100 MB is a generous default — some deployments may want a lower limit
|
||||
(mitigated by configurability)
|
||||
- Global limit means all sites share the same maximum (mitigated by Phase 2
|
||||
per-site limits)
|
||||
|
||||
## References
|
||||
|
||||
- [proxy.md](../proxy.md)
|
||||
- [config.md](../config.md)
|
||||
- nginx `client_max_body_size` documentation
|
||||
Reference in New Issue
Block a user