Resolve 5 open questions, add 7 ADRs for previously undocumented decisions

Resolve open questions:
- OQ-01: Restrict cipher suites to match nginx scope (4 ECDHE-AES-GCM
  suites for TLS 1.2 + all TLS 1.3 suites) — ADR-012
- OQ-03: Health check on separate local port (default 9900, localhost
  only) — ADR-013
- OQ-04: Add Unix domain socket admin API for config reload alongside
  SIGHUP, with structured success/failure responses — ADR-014
- OQ-06: Per-site upstream timeouts with defaults (5s connect, 60s
  request), overridable in SiteConfig — ADR-015

Document previously undocumented decisions flagged by architecture review:
- ADR-016: Explicit bind address requirement (reject 0.0.0.0)
- ADR-017: Upstream connection defaults (HTTP/1.1, no redirects, pooling)
- ADR-018: 100 MB body size limit (matches nginx, Gitea compatibility)

OQ-07 (per-site TLS overrides) remains open for future consideration.

Spec updates:
- config.md: add health_check_port, admin_socket_path, per-site timeout
  fields, update TOML example and validation rules
- proxy.md: reference ADR-015/017/018 for timeouts, connection defaults,
  and body limit decisions
- tls.md: replace OQ-01 cipher suite section with ADR-012 decision
- operations.md: add local health check port section, admin socket reload
- overview.md: update Phase 1 scope with new features, add ADR references
- open-questions.md: resolve OQ-01/03/04/06, keep OQ-07 open
This commit is contained in:
2026-06-11 09:07:36 +00:00
parent 7efc142406
commit 9a2352e61c
14 changed files with 613 additions and 89 deletions

View File

@@ -43,6 +43,13 @@ certificate via ACME.
| [009](decisions/009-signal-handling.md) | Signal Handling Strategy | Accepted |
| [010](decisions/010-multi-site-phase1.md) | Multi-Site Support in Phase 1 | Accepted |
| [011](decisions/011-multi-domain-tls.md) | Multi-Domain TLS Configuration | Accepted |
| [012](decisions/012-cipher-suite-restriction.md) | Restrict Cipher Suites to nginx Scope | Accepted |
| [013](decisions/013-health-check-port.md) | Health Check on Separate Local Port | Accepted |
| [014](decisions/014-unix-socket-reload.md) | Unix Domain Socket Config Reload API | Accepted |
| [015](decisions/015-per-site-timeouts.md) | Per-Site Upstream Timeouts with Defaults | Accepted |
| [016](decisions/016-explicit-bind-address.md) | Explicit Bind Address Requirement | Accepted |
| [017](decisions/017-upstream-connection-defaults.md) | Upstream Connection Defaults | Accepted |
| [018](decisions/018-body-size-limit.md) | Request Body Size Limit | Accepted |
## Open Questions
@@ -50,12 +57,12 @@ See [open-questions.md](open-questions.md) for the full tracker.
| OQ | Question | Priority | Status |
|----|----------|----------|--------|
| OQ-01 | Should cipher suites be restricted beyond rustls defaults? | medium | open |
| ~~OQ-01~~ | ~~Should cipher suites be restricted beyond rustls defaults?~~ | ~~medium~~ | **resolved** (ADR-012) |
| ~~OQ-02~~ | ~~What log format should fail2ban consume?~~ | ~~high~~ | **resolved** (ADR-007) |
| OQ-03 | Should the health check endpoint be on a separate port? | low | open |
| OQ-04 | Config reload: SIGHUP only or also Unix socket API? | low | open |
| ~~OQ-03~~ | ~~Should the health check endpoint be on a separate port?~~ | ~~low~~ | **resolved** (ADR-013) |
| ~~OQ-04~~ | ~~Config reload: SIGHUP only or also Unix socket API?~~ | ~~low~~ | **resolved** (ADR-014) |
| ~~OQ-05~~ | ~~Should the proxy bind to multiple addresses?~~ | ~~low~~ | **resolved** (single bind_addr sufficient) |
| OQ-06 | Should upstream timeouts be configurable per-site? | low | open |
| ~~OQ-06~~ | ~~Should upstream timeouts be configurable per-site?~~ | ~~low~~ | **resolved** (ADR-015) |
| OQ-07 | Should per-site TLS overrides be supported for mixed ACME/manual domains? | low | open |
## Document Lifecycle
@@ -63,6 +70,6 @@ See [open-questions.md](open-questions.md) for the full tracker.
| Status | Meaning | Transitions |
|--------|---------|-------------|
| `draft` | Under active development. May change significantly. | → `reviewed` when open questions are resolved |
| `reviewed` | Architecture is final. Implementation may begin. | → `stable` when implementation is complete |
| `reviewed` | Architecture is final. Implementation may begin. Changes require review. | → `stable` when implementation is complete |
| `stable` | Locked. Changes require review and may warrant an ADR. | → `deprecated` when superseded |
| `deprecated` | Superseded. Kept for reference. | Removed when no longer referenced |

View File

@@ -38,7 +38,9 @@ config.toml
│ bind_addr │ │ sites[] │
│ http_port │ │ rate_limit │
│ https_port │ │ body_limit │
tls.mode │ │ proxy_headers │
health_check_port │ │ proxy_headers │
│ admin_socket_path │ │ │
│ tls.mode │ │ ← ArcSwap → │
│ tls.acme_domains │ │ │
│ tls.cert_path │ │ ← ArcSwap → │
│ tls.key_path │ │ ConfigReloadHandle │
@@ -59,9 +61,11 @@ Immutable after startup. Changes require a process restart.
| Field | Type | Description |
|-------|------|-------------|
| `bind_addr` | `String` | IP address to bind to (must be explicit, no `0.0.0.0`) |
| `bind_addr` | `String` | IP address to bind to (must be explicit, no `0.0.0.0`; see ADR-016) |
| `http_port` | `u16` | Port for HTTP→HTTPS redirect (default: `80`; set to `0` to disable) |
| `https_port` | `u16` | Port for TLS listener (default: `443`) |
| `health_check_port` | `u16` | Port for local health check endpoint (default: `9900`; set to `0` to disable; see ADR-013) |
| `admin_socket_path` | `String` | Unix domain socket path for admin API (default: `/run/reverse-proxy/admin.sock`; empty string to disable; see ADR-014) |
| `tls.mode` | `"acme"` or `"manual"` | Certificate provisioning mode |
| `tls.acme_domains` | `Vec<String>` | Domains for ACME SAN certificate (ACME mode only) |
| `tls.acme_cache_dir` | `String` | ACME state cache directory |
@@ -95,6 +99,8 @@ connections immediately.
| `host` | `String` | Hostname to match (e.g., `"git.alk.dev"`) |
| `upstream` | `String` | Upstream address (e.g., `"127.0.0.1:3000"`) |
| `upstream_scheme` | `"http"` or `"https"` | Protocol for upstream connection (default: `"http"`) |
| `upstream_connect_timeout_secs` | `u64` | TCP connect timeout in seconds (default: `5`; see ADR-015, ADR-017) |
| `upstream_request_timeout_secs` | `u64` | Full request timeout in seconds (default: `60`; see ADR-015, ADR-017) | |
**Why these are dynamic:** See ADR-008 for the rationale. Site definitions
and rate limits are per-request concerns that should not require restarting
@@ -120,17 +126,23 @@ behind this split.
### Reload Trigger
The initial implementation uses SIGHUP as the reload trigger. When the process
receives SIGHUP:
Config reload is triggered by two mechanisms:
1. **SIGHUP**: Re-reads the config file, validates, and swaps DynamicConfig if
valid. Simple and well-understood, but provides no feedback on success or
failure.
2. **Admin socket**: The `reload` command via the admin Unix domain socket
performs the same action as SIGHUP but returns a structured response
indicating success or failure with an error message. See ADR-014 for
details.
Both mechanisms converge on the same code path:
1. Re-read the config file from disk
2. Deserialize into `DynamicConfig`
3. Validate (check upstream reachability is optional)
4. Call `ConfigReloadHandle::reload(new_config)`
Future implementations could add a Unix domain socket API or HTTP endpoint for
config reload, but SIGHUP is sufficient for Phase 1.
## TOML Config Format
```toml
@@ -140,6 +152,8 @@ config reload, but SIGHUP is sufficient for Phase 1.
bind_addr = "203.0.113.10" # Replace with actual bind address
http_port = 80
https_port = 443
health_check_port = 9900 # Local health check (0 to disable)
admin_socket_path = "/run/reverse-proxy/admin.sock" # Empty string to disable
[server.tls]
mode = "acme" # "acme" or "manual"
@@ -167,6 +181,8 @@ limit_bytes = 104857600 # 100 MB
host = "git.alk.dev"
upstream = "127.0.0.1:3000"
upstream_scheme = "http"
# upstream_connect_timeout_secs = 5 # Default: 5s
# upstream_request_timeout_secs = 60 # Default: 60s
[[sites]]
host = "alk.dev"
@@ -205,13 +221,17 @@ All design decisions are documented as ADRs in [decisions/](decisions/).
| [008](decisions/008-static-dynamic-config-split.md) | Static/dynamic config split | Immutable StaticConfig, hot-reloadable DynamicConfig via ArcSwap |
| [010](decisions/010-multi-site-phase1.md) | Multi-site in Phase 1 | Multiple domains from initial release |
| [011](decisions/011-multi-domain-tls.md) | Multi-domain TLS config | Single SAN certificate covering all domains |
| [013](decisions/013-health-check-port.md) | Health check on separate local port | Localhost-only HTTP health check, configurable port |
| [014](decisions/014-unix-socket-reload.md) | Unix domain socket config reload API | Programmatic reload with success/failure feedback |
| [015](decisions/015-per-site-timeouts.md) | Per-site upstream timeouts with defaults | 5s connect / 60s request defaults, per-site overrides |
| [016](decisions/016-explicit-bind-address.md) | Explicit bind address required | Rejects `0.0.0.0` to prevent accidental exposure |
## Open Questions
Open questions are tracked in [open-questions.md](open-questions.md). Key
questions affecting this document:
- **OQ-04**: Should config reload support a Unix domain socket API in addition
to SIGHUP? (open)
- ~~**OQ-04**: Should config reload support a Unix domain socket API in addition
to SIGHUP?~~ (resolved — ADR-014: Unix domain socket admin API added)
- **OQ-07**: Should per-site TLS overrides be supported for mixed ACME/manual
domains? (open)

View File

@@ -0,0 +1,83 @@
# ADR-012: Restrict Cipher Suites to Match nginx Scope
## Status
Accepted
## Context
Our current nginx configuration explicitly restricts cipher suites to four
ECDHE-AES-GCM suites:
```
ECDHE-ECDSA-AES128-GCM-SHA256
ECDHE-RSA-AES128-GCM-SHA256
ECDHE-ECDSA-AES256-GCM-SHA384
ECDHE-RSA-AES256-GCM-SHA384
```
rustls 0.23 with the `aws_lc_rs` crypto provider defaults to a conservative
set that excludes all weak ciphers (no SHA-1, no 3DES, no RC4, no CBC-mode
suites, no RSA key exchange). The rustls defaults include these four suites
plus TLS 1.3 suites (which nginx also allows via `TLSv1.3`).
The question was whether to accept the wider rustls defaults or restrict to
the same scope as our current nginx configuration (see OQ-01).
## Decision
Explicitly restrict cipher suites to match the same scope as our current nginx
configuration: the four ECDHE-AES-GCM suites for TLS 1.2, plus all TLS 1.3
suites. This is slightly more restrictive than rustls defaults but preserves
compatibility with all modern clients.
The restricted set in rustls terms:
**TLS 1.2 (explicitly selected):**
- `TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256`
- `TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256`
- `TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384`
- `TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384`
**TLS 1.3 (all default suites):**
- `TLS_AES_128_GCM_SHA256`
- `TLS_AES_256_GCM_SHA384`
- `TLS_CHACHA20_POLY1305_SHA256`
This is configured by building a `CryptoProvider` with a custom
`cipher_suite` list and passing it to `ServerConfig::builder_with_provider()`.
## Rationale
- Maintains behavioral parity with our current nginx configuration during
migration — no client that worked with nginx should see different TLS
behavior
- The four TLS 1.2 suites are the same ones nginx allows, providing a known
security baseline
- TLS 1.3 suites are all modern and secure — restricting them provides no
meaningful security benefit and reduces compatibility
- rustls defaults include ChaCha20-Poly1305 suites for TLS 1.2 which nginx does
not; these are cryptographically sound but represent a wider scope than our
current nginx config allows
- Explicit configuration means the cipher list is documented and auditable
- If compatibility issues arise, expanding the list is straightforward; the
reverse (restricting after deployment) risks breaking existing clients
## Consequences
**Positive:**
- Behavioral parity with current nginx TLS configuration
- Explicitly auditable cipher list
- No client that currently works will see different TLS behavior
- Matches security review expectations
**Negative:**
- Slightly more restrictive than rustls defaults — excludes ChaCha20-Poly1305
for TLS 1.2 and AES-CCM suites (rarely used)
- Must update the cipher list when deprecating TLS 1.2 in the future
- Custom `CryptoProvider` construction is slightly more code than using defaults
## References
- [tls.md](../tls.md)
- OQ-01 (now resolved)

View File

@@ -0,0 +1,69 @@
# ADR-013: Health Check on Separate Local Port
## Status
Accepted
## Context
The health check endpoint (`/health`) needs to be accessible for monitoring
without requiring TLS. Currently the design places it on the main HTTPS
listener, which means:
1. TLS handshake must succeed for the health check to respond
2. External monitoring tools need to handle TLS
3. A TLS configuration error would make the health check unreachable, creating
a false-negative monitoring signal
Three options were considered (see OQ-03):
1. **Main HTTPS listener only**: Simplest, but TLS config errors make health
checks unreachable
2. **Separate unencrypted port on localhost**: Simple, works with standard
monitoring tools, but health checks bypass TLS
3. **Admin port with its own listener**: Most flexible but adds complexity
## Decision
Add a configurable health check port that binds to `127.0.0.1` only (localhost),
serving `/health` over plain HTTP. This is a separate listener from the main
HTTP and HTTPS listeners.
The port is configurable via `health_check_port` in StaticConfig. Setting it
to `0` (default) disables the separate health check listener, and `/health`
remains available on the main HTTPS listener as a fallback.
## Rationale
- A local-only health check port is the standard pattern for reverse proxies
and service meshes (envoy, haproxy, k8s health probes all use this pattern)
- Health checks should work even when TLS is misconfigured — that's the whole
point of monitoring
- Binding to `127.0.0.1` only means the health check is not exposed to the
internet — only local monitoring tools (systemd, scripts, load balancers on
the same host) can reach it
- Configurable port allows different deployment scenarios (some monitoring runs
on different ports)
- Disabling via `health_check_port = 0` keeps the main HTTPS `/health` endpoint
available for cases where a separate port isn't needed
- When this project is folded into alknet, the health check will use alknet's
existing patterns, making the separate port unnecessary in that context
## Consequences
**Positive:**
- Health checks work even when TLS is misconfigured
- Standard pattern that monitoring tools expect
- Not exposed to the internet (localhost only)
- Configurable — can be disabled if not needed
- systemd can use it for `NotifyAccess` readiness checks
**Negative:**
- Additional listener to manage (minimal complexity)
- Two health check endpoints exist when the separate port is enabled (the
local one and the HTTPS one) — monitoring should prefer the local one
## References
- [operations.md](../operations.md)
- OQ-03 (now resolved)

View File

@@ -0,0 +1,76 @@
# ADR-014: Unix Domain Socket Config Reload API
## Status
Accepted
## Context
The proxy supports config reload via SIGHUP (ADR-009). SIGHUP is simple and
well-understood, but has limitations:
1. No feedback — the sender doesn't know if the reload succeeded or failed
2. No structured input — you can only signal "reload", not specify which parts
to reload or pass validation context
3. Requires process signal permissions — not all deployment tools can send
signals
A Unix domain socket API would allow programmatic config reload with
success/failure status, enabling integration with CI/CD pipelines, admin
tools, and automated configuration management.
## Decision
Add a Unix domain socket API for config reload alongside SIGHUP. The socket
accepts commands and returns structured responses.
The socket path is configurable via `admin_socket_path` in StaticConfig
(default: `/run/reverse-proxy/admin.sock`). Setting it to empty string
disables the admin socket.
Initial commands:
- `reload` — Re-read config file, validate, and swap DynamicConfig. Returns
`{"status": "ok"}` or `{"status": "error", "message": "..."}`.
- `status` — Return basic process info (uptime, config load time, site count).
Returns `{"status": "ok", "uptime_secs": 1234, "sites": 2}`.
Future commands (not in Phase 1, but the protocol supports extension):
- `metrics` — Return Prometheus-compatible metrics
- `shutdown` — Graceful shutdown command
## Rationale
- Providing reload feedback is operationally valuable — CI/CD pipelines can
verify config changes before proceeding
- The implementation cost is low — a Unix domain socket listener is ~50 lines
of tokio code, and the command protocol is simple
- SIGHUP is retained as a fallback for environments where socket access is
inconvenient
- This pattern will integrate naturally with alknet's admin interface when the
projects merge
- Unix domain sockets are filesystem-permission-based, providing access control
without additional authentication
- The socket path is configurable, allowing deployment-specific paths
## Consequences
**Positive:**
- Config reload with success/failure feedback
- Programmatic integration with CI/CD and admin tools
- Structured response format enables automation
- SIGHUP still works as fallback
- Natural path to future admin commands
**Negative:**
- Additional listener and command parsing logic (~100-150 lines)
- Socket file management (cleanup on startup, stale socket detection)
- One more config option (`admin_socket_path`)
## References
- [operations.md](../operations.md)
- [config.md](../config.md)
- ADR-009 (signal handling — SIGHUP retained as fallback)
- OQ-04 (now resolved)

View File

@@ -0,0 +1,69 @@
# ADR-015: Per-Site Upstream Timeouts with Defaults
## Status
Accepted
## Context
The proxy forwards requests to upstream services. Connection and request
timeouts affect reliability — too short and legitimate slow responses fail,
too long and the proxy is vulnerable to resource exhaustion from stalled
connections.
Phase 1 initially specified global timeout defaults (5s connect, 60s request)
for all upstreams. However, different upstream services have different latency
profiles:
- Gitea (git.alk.dev): Git push operations can be slow for large repos; 60s
may be insufficient for clone/push operations with large pack files
- Deno/Fresh (alk.dev): Fast responses expected; 60s is generous
Global timeouts don't accommodate these differences. Per-site timeout
configuration allows tuning for each upstream without affecting others.
## Decision
Add optional per-site upstream timeout configuration to `SiteConfig`. When not
specified, sensible defaults are used.
**SiteConfig additions:**
| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `upstream_connect_timeout_secs` | `u64` | `5` | TCP connection timeout in seconds |
| `upstream_request_timeout_secs` | `u64` | `60` | Full request timeout in seconds |
These are part of `DynamicConfig` (hot-reloadable via ArcSwap) since they
affect per-request behavior and should not require a restart to change.
## Rationale
- Different upstreams genuinely have different latency profiles — Gitea pushes
with large pack files need more time than a fast static site
- Defaults of 5s connect and 60s request match common reverse proxy conventions
(nginx defaults: 60s, haproxy defaults: 30s connect, 60s server)
- Making these per-site rather than global allows tuning without side effects
- Per-site overrides in DynamicConfig means timeout changes don't require
restarts
- The defaults are reasonable for most services; explicit configuration is only
needed for outliers
## Consequences
**Positive:**
- Each upstream can be tuned independently
- Defaults work for most cases — explicit configuration is optional
- Hot-reloadable (part of DynamicConfig)
- Consistent with how other reverse proxies handle timeouts
**Negative:**
- Two more fields per site in config (mitigated by sensible defaults)
- Per-site timeout means the proxy must look up per-request config for each
upstream connection (already required for routing, so no additional overhead)
## References
- [proxy.md](../proxy.md)
- [config.md](../config.md)
- OQ-06 (now resolved)

View File

@@ -0,0 +1,53 @@
# ADR-016: Explicit Bind Address Requirement
## Status
Accepted
## Context
The proxy's `bind_addr` configuration field specifies the IP address to bind
to. The default for most network services is `0.0.0.0` (bind all interfaces),
which means the service is accessible on every network interface, including
public-facing ones.
For a reverse proxy that terminates TLS and handles security-sensitive traffic,
binding to all interfaces is risky. The proxy should only listen on the
intended interface(s) — typically a specific public IP for a single-server
deployment.
## Decision
The `bind_addr` field must be an explicit IP address. `0.0.0.0` is rejected
during config validation. The proxy will not start if `bind_addr` is `0.0.0.0`.
## Rationale
- A reverse proxy terminating TLS is a security boundary — it should not be
accidentally exposed on unintended interfaces
- Single-server deployments have a specific public IP; binding to it
deliberately is the correct operational practice
- `0.0.0.0` is often a default in example configurations and can be deployed
without the operator realizing the service is accessible on all interfaces
- Rejecting it at validation time prevents this common mistake
- If a deployment genuinely needs to bind all interfaces, `0.0.0.0` can be
overridden with an explicit flag, but this should be a deliberate choice
- This matches the principle of explicit over implicit for security-sensitive
configuration
## Consequences
**Positive:**
- Prevents accidental exposure on unintended network interfaces
- Forces operators to be deliberate about which interface the proxy serves
- Config validation catches the mistake before deployment
**Negative:**
- Not suitable for deployments that genuinely need to bind all interfaces
(mitigated by explicit override if needed in the future)
- Slightly more configuration required (operator must know their public IP)
## References
- [config.md](../config.md)
- [overview.md](../overview.md)

View File

@@ -0,0 +1,66 @@
# ADR-017: Upstream Connection Defaults
## Status
Accepted
## Context
The proxy forwards requests to upstream services using a shared hyper `Client`.
The client's configuration affects all upstream connections: timeouts, redirect
handling, and connection pooling. These are significant design choices that
affect production behavior and should be traceable.
## Decision
Configure the hyper `Client` with the following defaults:
| Setting | Value | Rationale |
|---------|-------|-----------|
| Connect timeout | 5 seconds | Prevents hanging on unreachable upstreams |
| Request timeout | 60 seconds (default) | Per-site override available (ADR-015) |
| Redirect following | Disabled | Proxies should not follow redirects — they pass response status to the client |
| Connection pooling | Enabled (hyper default) | Reuses connections to the same upstream for performance |
| HTTP version | HTTP/1.1 only | Upstream connections use HTTP/1.1; HTTP/2 proxying is out of scope |
The connect timeout of 5 seconds applies to establishing a TCP connection to
the upstream. The request timeout of 60 seconds applies to the full
request/response cycle and can be overridden per-site (ADR-015).
## Rationale
- **5s connect timeout**: Matches common reverse proxy conventions. Long
enough for remote upstreams with network latency, short enough to fail fast
on unreachable services.
- **60s request timeout**: Generous default that accommodates Gitea push
operations with large pack files. Per-site overrides allow tuning for faster
upstreams.
- **No redirect following**: Standard reverse proxy behavior. If an upstream
returns a 3xx, the proxy streams it to the client. Following redirects would
break the client's expectation and could create security issues (redirect
loops, credential forwarding).
- **Connection pooling**: Hyper's default connection reuse improves performance
for repeated requests to the same upstream without additional configuration.
- **HTTP/1.1 only**: The proxy communicates with upstreams over plain HTTP
(loopback). HTTP/2 proxying is explicitly out of scope (see overview.md).
Upstreams needing HTTP/2+ run their own servers (e.g., api.alk.dev).
## Consequences
**Positive:**
- Explicit, documented defaults that match reverse proxy conventions
- Per-site timeout overrides available for specific needs
- No redirect following prevents common proxy pitfalls
- Connection pooling provides good performance by default
**Negative:**
- 60s default may be too long for fast upstreams (mitigated by per-site
overrides in ADR-015)
- HTTP/1.1 only for upstream means no HTTP/2 multiplexing to backends (out of
scope per project design)
## References
- [proxy.md](../proxy.md)
- ADR-015 (per-site upstream timeouts)
- ADR-002 (custom proxy handler)

View File

@@ -0,0 +1,53 @@
# ADR-018: Request Body Size Limit
## Status
Accepted
## Context
The proxy enforces a maximum request body size to protect against resource
exhaustion attacks. The default limit must balance security with usability.
Gitea push operations can involve large Git pack files. The current nginx
configuration uses `client_max_body_size 100m`, and Gitea's documentation
recommends allowing up to 100 MB for push operations.
## Decision
Set the default request body size limit to 100 MB (104,857,600 bytes),
matching our current nginx configuration. The limit is global in Phase 1
(configurable via `body.limit_bytes` in DynamicConfig).
## Rationale
- 100 MB matches the current nginx `client_max_body_size 100m`, ensuring
behavioral parity during migration
- Gitea push operations with large repositories regularly exceed 50 MB
- 100 MB is large enough for any legitimate Git operation while still
providing protection against resource exhaustion (a 100 MB body is not
enough to exhaust memory on modern servers, but prevents unbounded uploads)
- The limit is configurable — operators can reduce it for deployments that
don't need large uploads
- In Phase 2, per-site limits will allow different limits for different
upstreams (e.g., a lower limit for alk.dev, the current limit for
git.alk.dev)
## Consequences
**Positive:**
- Behavioral parity with current nginx configuration
- Gitea push operations work without configuration changes
- Configurable for deployments with different needs
**Negative:**
- 100 MB is a generous default — some deployments may want a lower limit
(mitigated by configurability)
- Global limit means all sites share the same maximum (mitigated by Phase 2
per-site limits)
## References
- [proxy.md](../proxy.md)
- [config.md](../config.md)
- nginx `client_max_body_size` documentation

View File

@@ -7,19 +7,15 @@ last_updated: 2026-06-11
## TLS
### OQ-01: Should cipher suites be restricted beyond rustls defaults?
### ~~OQ-01: Should cipher suites be restricted beyond rustls defaults?~~
- **Origin**: [tls.md](tls.md)
- **Status**: open
- **Status**: resolved
- **Priority**: medium
- **Context**: Our current nginx config explicitly restricts cipher suites to
four ECDHE-AES-GCM suites. rustls 0.23 with `aws_lc_rs` defaults to a
conservative set that excludes all weak ciphers (no SHA-1, no 3DES, no RC4,
no CBC-mode suites, no RSA key exchange). The defaults include TLS 1.3 suites
which nginx also allows. Restricting further would reduce compatibility with
older clients; not restricting means accepting a wider (but still safe) set
than the current nginx config.
- **Cross-references**: ADR-005
- **Resolution**: Restrict cipher suites to match the nginx scope: four
ECDHE-AES-GCM suites for TLS 1.2 plus all TLS 1.3 suites. This provides
behavioral parity during migration. See ADR-012.
- **Cross-references**: ADR-005, ADR-012
### ~~OQ-02: What log format should fail2ban consume?~~
@@ -47,31 +43,28 @@ last_updated: 2026-06-11
## Logging and Monitoring
### OQ-03: Should the health check endpoint be on a separate port?
### ~~OQ-03: Should the health check endpoint be on a separate port?~~
- **Origin**: [operations.md](operations.md)
- **Status**: open
- **Status**: resolved
- **Priority**: low
- **Context**: Currently the health check is on the main HTTPS listener at
`/health`. Alternatives: (a) separate unencrypted port for health checks
(simpler for load balancers but less secure), (b) admin port with its own
listener (more complex but isolates operational traffic), (c) on the main
listener (simplest, proposed approach). For a single-server deployment behind
no external load balancer, the main listener is fine.
- **Cross-references**: None
- **Resolution**: Add a configurable local health check port (default: 9900)
bound to `127.0.0.1` only. Health checks work even when TLS is misconfigured.
The main HTTPS `/health` endpoint remains available as a fallback. See
ADR-013.
- **Cross-references**: ADR-013
## Configuration
### OQ-04: Should config reload support a Unix domain socket API in addition to SIGHUP?
### ~~OQ-04: Should config reload support a Unix domain socket API in addition to SIGHUP?~~
- **Origin**: [config.md](config.md)
- **Status**: open
- **Status**: resolved
- **Priority**: low
- **Context**: Phase 1 uses SIGHUP for config reload, which is simple and proven.
A Unix domain socket API would allow programmatic reload (e.g., from an admin
tool or CI/CD pipeline) and could return success/failure status. This adds
complexity and is not needed for Phase 1.
- **Cross-references**: None
- **Resolution**: Yes. Add a Unix domain socket admin API alongside SIGHUP.
The socket accepts a `reload` command and returns structured success/failure
responses. SIGHUP is retained as a fallback. See ADR-014.
- **Cross-references**: ADR-014
## Deployment
@@ -84,17 +77,16 @@ last_updated: 2026-06-11
explicit IP address (not `0.0.0.0`). Multi-address binding is not needed for
this single-server deployment. If needed in the future, `bind_addr` could be
extended to an array. See config.md for the `bind_addr` field.
- **Cross-references**: None
- **Cross-references**: ADR-016
## Proxy
### OQ-06: Should upstream timeouts be configurable per-site?
### ~~OQ-06: Should upstream timeouts be configurable per-site?~~
- **Origin**: [proxy.md](proxy.md)
- **Status**: open
- **Status**: resolved
- **Priority**: low
- **Context**: Phase 1 uses global defaults (5s connect timeout, 60s request
timeout) for all upstream connections. Per-site timeout configuration would
allow tuning for different upstream services (e.g., a slow database-backed
API vs. a fast static site). Not needed for Phase 1 with a single upstream.
- **Cross-references**: None
- **Resolution**: Yes. Per-site upstream timeouts with sensible defaults (5s
connect, 60s request). Optional fields in SiteConfig that override global
defaults when specified. See ADR-015.
- **Cross-references**: ADR-015, ADR-017

View File

@@ -109,23 +109,30 @@ Configurable via `log_level` in StaticConfig.
## Health Check
### Endpoint
### Local Health Check Port
The primary health check endpoint is served on a separate local port (default:
9900), bound to `127.0.0.1` only. This ensures health checks work even when TLS
is misconfigured. See ADR-013 for the rationale.
```
GET /health → 200 OK (empty body)
GET http://127.0.0.1:9900/health → 200 OK (empty body)
```
The health check endpoint is accessible on the main HTTPS listener. It returns
200 if the process is alive and serving requests.
The port is configurable via `health_check_port` in StaticConfig. Setting it
to `0` disables the separate health check listener.
**Limitation**: Since `/health` is served over TLS, it cannot detect TLS
configuration errors that prevent the TLS handshake from completing. External
monitoring should also check TCP connectivity to port 443 independently.
### HTTPS Health Check (Fallback)
When the local health check port is enabled, `/health` is also available on the
main HTTPS listener for cases where TLS-level health verification is desired.
External monitoring should prefer the local health check for liveness checks
and can use the HTTPS endpoint for TLS verification.
### What It Checks
- Process is running and the tokio runtime is responsive
- TLS listener is accepting connections
- TLS listener is accepting connections (HTTPS endpoint only)
- Config is loaded (StaticConfig and DynamicConfig are initialized)
It does **not** check upstream reachability. The health check answers "is the
@@ -180,13 +187,22 @@ The proxy handles three signals via `signal-hook` (see [ADR-009](decisions/009-s
- **SIGTERM / SIGINT**: Graceful shutdown. Stop accepting new connections, wait
for in-flight requests to complete (up to a configurable timeout), then exit.
- **SIGHUP**: Config reload. Re-read the config file, validate, and swap
DynamicConfig if valid.
DynamicConfig if valid. No feedback on success or failure.
- **Admin socket reload**: Send `reload` command via the Unix domain socket
(default: `/run/reverse-proxy/admin.sock`). Returns structured response
indicating success or failure. See ADR-014 for details.
### SIGHUP for Config Reload
SIGHUP triggers config reload (see [config.md](config.md) for details). The
process does not exit on SIGHUP.
### Admin Socket for Config Reload
The admin Unix domain socket provides programmatic config reload with feedback.
This is useful for CI/CD pipelines and automation tools. See ADR-014 for the
command protocol.
### Timeout
In-flight requests have a configurable shutdown timeout (default: 30 seconds).
@@ -242,10 +258,13 @@ All design decisions are documented as ADRs in [decisions/](decisions/).
| [006](decisions/006-rate-limiting-approach.md) | Token bucket rate limiting | In-memory per-IP token bucket matching nginx burst semantics |
| [007](decisions/007-custom-log-format.md) | Custom structured log format | key=value pairs with RATE_LIMIT prefix for fail2ban |
| [009](decisions/009-signal-handling.md) | Signal handling strategy | signal-hook for SIGTERM/SIGINT/SIGHUP |
| [013](decisions/013-health-check-port.md) | Health check on separate local port | Localhost-only HTTP health check, configurable port |
| [014](decisions/014-unix-socket-reload.md) | Unix domain socket config reload API | Programmatic reload with success/failure feedback |
## Open Questions
Open questions are tracked in [open-questions.md](open-questions.md). Key
questions affecting this document:
- **OQ-03**: Should the health check endpoint be on a separate port? (open)
- ~~**OQ-03**: Should the health check endpoint be on a separate port?~~ (resolved
— ADR-013: separate local port, default 9900, localhost only)

View File

@@ -40,22 +40,24 @@ details.
- **Phase 1**: Multi-site reverse proxy with TLS termination
- TLS termination with ACME (Let's Encrypt) multi-domain certificate management
- Manual certificate paths as fallback mode
- Cipher suite restriction matching nginx scope (ECDHE-AES-GCM + TLS 1.3)
- HTTP → HTTPS redirect
- Host-based routing to multiple upstream services
- Reverse proxy to Gitea at `127.0.0.1:3000` (git.alk.dev)
- Reverse proxy to Deno/Fresh container for alk.dev (simple pass-through)
- Proxy header injection (Host, X-Real-IP, X-Forwarded-For, X-Forwarded-Proto)
- Per-site upstream timeouts with sensible defaults (5s connect, 60s request)
- Request rate limiting with fail2ban-compatible logging (global per-IP)
- 100 MB body size limit (global)
- Configurable bind address (no `0.0.0.0` default)
- Health check endpoint
- Configurable bind address (must be explicit, no `0.0.0.0`)
- Local health check endpoint on separate port (default: 9900, localhost only)
- Unix domain socket admin API for config reload with feedback
- Graceful shutdown (SIGTERM handling)
- Systemd unit file
- Dual licensing: MIT OR Apache-2.0
- **Phase 2**: Operational hardening
- Per-site rate limits and body limits
- Per-site upstream timeouts
- Metrics endpoint (Prometheus-compatible)
- Connection limits and timeouts
- Log rotation
@@ -63,7 +65,6 @@ details.
- **Phase 3**: Future enhancements
- Wildcard subdomain support
- Per-site TLS overrides (manual certs for specific domains)
- Unix domain socket config reload API
### Out of Scope
@@ -168,12 +169,19 @@ All design decisions are documented as ADRs in [decisions/](decisions/).
| [009](decisions/009-signal-handling.md) | Signal handling strategy | signal-hook for SIGTERM/SIGINT/SIGHUP |
| [010](decisions/010-multi-site-phase1.md) | Multi-site in Phase 1 | Multiple domains from initial release; avoids config migration later |
| [011](decisions/011-multi-domain-tls.md) | Multi-domain TLS config | Single SAN certificate covering all domains via rustls-acme |
| [012](decisions/012-cipher-suite-restriction.md) | Restrict cipher suites | Match nginx scope: ECDHE-AES-GCM for TLS 1.2, all TLS 1.3 |
| [013](decisions/013-health-check-port.md) | Health check on separate local port | Localhost-only HTTP health check, configurable port |
| [014](decisions/014-unix-socket-reload.md) | Unix domain socket config reload API | Programmatic reload with success/failure feedback |
| [015](decisions/015-per-site-timeouts.md) | Per-site upstream timeouts with defaults | 5s connect / 60s request defaults, per-site overrides |
| [016](decisions/016-explicit-bind-address.md) | Explicit bind address required | Rejects `0.0.0.0` to prevent accidental exposure |
| [017](decisions/017-upstream-connection-defaults.md) | Upstream connection defaults | HTTP/1.1, no redirects, connection pooling |
| [018](decisions/018-body-size-limit.md) | Request body size limit | 100 MB default matching nginx, Gitea push compatibility |
## Open Questions
Open questions are tracked in [open-questions.md](open-questions.md). Key
questions affecting this document:
- **OQ-01**: Should cipher suites be restricted beyond rustls defaults? (open)
- **OQ-03**: Should the health check endpoint be on a separate port? (open)
- ~~**OQ-01**: Should cipher suites be restricted beyond rustls defaults?~~ (resolved — ADR-012)
- ~~**OQ-03**: Should the health check endpoint be on a separate port?~~ (resolved — ADR-013)
- **OQ-07**: Should per-site TLS overrides be supported for mixed ACME/manual domains? (open)

View File

@@ -109,12 +109,15 @@ The proxy handler constructs a new request to the upstream:
5. Stream the response back to the client
The hyper Client is created once at startup and shared via axum's `State`. It
must be configured with:
must be configured with (see ADR-017 for rationale):
- Connection pooling (hyper default behavior)
- Connect timeout: 5 seconds
- Request timeout: 60 seconds
- HTTP/1.1 only for upstream connections (HTTP/2 proxying is out of scope)
- No redirect following (proxies should not follow redirects)
Per-site timeout overrides are available via `upstream_connect_timeout_secs`
and `upstream_request_timeout_secs` in `SiteConfig` (see ADR-015). When not
specified, defaults of 5s connect and 60s request are used.
### 4. Error Handling
| Upstream Condition | Response | Notes |
@@ -147,10 +150,11 @@ between the proxy and backend services on loopback is unnecessary.
## Body Size Limit
axum's `DefaultBodyLimit` layer sets the maximum request body size. For
compatibility with Gitea's push operations (large pack files), this defaults
to 100 MB. In Phase 1, the body limit is a global setting; Phase 2 may add
per-site body limits.
axum's `DefaultBodyLimit` layer sets the maximum request body size. The default
of 100 MB (104,857,600 bytes) matches our current nginx configuration and
accommodates Gitea's push operations with large pack files (see ADR-018). In
Phase 1, the body limit is a global setting; Phase 2 may add per-site body
limits.
## Design Decisions
@@ -161,11 +165,14 @@ All design decisions are documented as ADRs in [decisions/](decisions/).
| [002](decisions/002-custom-proxy-handler.md) | Custom proxy handler | One upstream per domain — simpler than a general proxy library |
| [007](decisions/007-custom-log-format.md) | Custom structured log format | key=value pairs with RATE_LIMIT prefix for fail2ban |
| [010](decisions/010-multi-site-phase1.md) | Multi-site in Phase 1 | Multiple domains from initial release |
| [015](decisions/015-per-site-timeouts.md) | Per-site upstream timeouts with defaults | 5s connect / 60s request defaults, per-site overrides |
| [017](decisions/017-upstream-connection-defaults.md) | Upstream connection defaults | HTTP/1.1, no redirects, connection pooling |
| [018](decisions/018-body-size-limit.md) | Request body size limit | 100 MB default matching nginx, Gitea push compatibility |
## Open Questions
Open questions are tracked in [open-questions.md](open-questions.md). Key
questions affecting this document:
- **OQ-06**: Should upstream timeouts be configurable per-site? (open — Phase 1
uses global defaults of 5s connect, 60s request)
- ~~**OQ-06**: Should upstream timeouts be configurable per-site?~~ (resolved —
ADR-015: per-site timeout overrides with defaults)

View File

@@ -115,25 +115,26 @@ regression if defaults change in future rustls releases.
### Cipher Suites
rustls 0.23 with the `aws_lc_rs` crypto provider defaults to a conservative
cipher suite selection that excludes all weak ciphers (no SHA-1, no 3DES, no
RC4, no CBC-mode suites, no RSA key exchange).
Cipher suites are explicitly restricted to match the scope of our current nginx
configuration. See ADR-012 for the full rationale.
The current nginx config explicitly restricts to:
**TLS 1.2 (explicitly selected):**
```
ECDHE-ECDSA-AES128-GCM-SHA256
ECDHE-RSA-AES128-GCM-SHA256
ECDHE-ECDSA-AES256-GCM-SHA384
ECDHE-RSA-AES256-GCM-SHA384
```
- `TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256`
- `TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256`
- `TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384`
- `TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384`
rustls's defaults include these plus TLS 1.3 suites (which nginx's config
also allows via `TLSv1.3`). The default rustls cipher list is a strict subset
of what browsers accept.
**TLS 1.3 (all default suites):**
See [open-questions.md](open-questions.md) OQ-01 for whether to further
restrict cipher suites beyond rustls defaults.
- `TLS_AES_128_GCM_SHA256`
- `TLS_AES_256_GCM_SHA384`
- `TLS_CHACHA20_POLY1305_SHA256`
This is configured by building a `CryptoProvider` with a custom `cipher_suite`
list and passing it to `ServerConfig::builder_with_provider()`. The cipher
list matches our current nginx configuration's scope, providing behavioral
parity during migration.
### ServerConfig Construction
@@ -223,12 +224,13 @@ All design decisions are documented as ADRs in [decisions/](decisions/).
| [005](decisions/005-tokio-rustls-direct.md) | tokio-rustls directly | Full control over TLS config and ACME resolver integration |
| [010](decisions/010-multi-site-phase1.md) | Multi-site in Phase 1 | Multiple domains from initial release |
| [011](decisions/011-multi-domain-tls.md) | Multi-domain TLS config | Single SAN certificate covering all domains via rustls-acme |
| [012](decisions/012-cipher-suite-restriction.md) | Restrict cipher suites | Match nginx scope: four ECDHE-AES-GCM suites for TLS 1.2, all TLS 1.3 suites |
## Open Questions
Open questions are tracked in [open-questions.md](open-questions.md). Key
questions affecting this document:
- **OQ-01**: Should cipher suites be restricted beyond rustls defaults? (open)
- ~~**OQ-01**: Should cipher suites be restricted beyond rustls defaults?~~ (resolved — ADR-012: restrict to nginx scope)
- **OQ-07**: Should per-site TLS overrides be supported for mixed ACME/manual
domains? (open)