diff --git a/docs/architecture/README.md b/docs/architecture/README.md index effd806..c99a76c 100644 --- a/docs/architecture/README.md +++ b/docs/architecture/README.md @@ -43,6 +43,13 @@ certificate via ACME. | [009](decisions/009-signal-handling.md) | Signal Handling Strategy | Accepted | | [010](decisions/010-multi-site-phase1.md) | Multi-Site Support in Phase 1 | Accepted | | [011](decisions/011-multi-domain-tls.md) | Multi-Domain TLS Configuration | Accepted | +| [012](decisions/012-cipher-suite-restriction.md) | Restrict Cipher Suites to nginx Scope | Accepted | +| [013](decisions/013-health-check-port.md) | Health Check on Separate Local Port | Accepted | +| [014](decisions/014-unix-socket-reload.md) | Unix Domain Socket Config Reload API | Accepted | +| [015](decisions/015-per-site-timeouts.md) | Per-Site Upstream Timeouts with Defaults | Accepted | +| [016](decisions/016-explicit-bind-address.md) | Explicit Bind Address Requirement | Accepted | +| [017](decisions/017-upstream-connection-defaults.md) | Upstream Connection Defaults | Accepted | +| [018](decisions/018-body-size-limit.md) | Request Body Size Limit | Accepted | ## Open Questions @@ -50,12 +57,12 @@ See [open-questions.md](open-questions.md) for the full tracker. | OQ | Question | Priority | Status | |----|----------|----------|--------| -| OQ-01 | Should cipher suites be restricted beyond rustls defaults? | medium | open | +| ~~OQ-01~~ | ~~Should cipher suites be restricted beyond rustls defaults?~~ | ~~medium~~ | **resolved** (ADR-012) | | ~~OQ-02~~ | ~~What log format should fail2ban consume?~~ | ~~high~~ | **resolved** (ADR-007) | -| OQ-03 | Should the health check endpoint be on a separate port? | low | open | -| OQ-04 | Config reload: SIGHUP only or also Unix socket API? | low | open | +| ~~OQ-03~~ | ~~Should the health check endpoint be on a separate port?~~ | ~~low~~ | **resolved** (ADR-013) | +| ~~OQ-04~~ | ~~Config reload: SIGHUP only or also Unix socket API?~~ | ~~low~~ | **resolved** (ADR-014) | | ~~OQ-05~~ | ~~Should the proxy bind to multiple addresses?~~ | ~~low~~ | **resolved** (single bind_addr sufficient) | -| OQ-06 | Should upstream timeouts be configurable per-site? | low | open | +| ~~OQ-06~~ | ~~Should upstream timeouts be configurable per-site?~~ | ~~low~~ | **resolved** (ADR-015) | | OQ-07 | Should per-site TLS overrides be supported for mixed ACME/manual domains? | low | open | ## Document Lifecycle @@ -63,6 +70,6 @@ See [open-questions.md](open-questions.md) for the full tracker. | Status | Meaning | Transitions | |--------|---------|-------------| | `draft` | Under active development. May change significantly. | → `reviewed` when open questions are resolved | -| `reviewed` | Architecture is final. Implementation may begin. | → `stable` when implementation is complete | +| `reviewed` | Architecture is final. Implementation may begin. Changes require review. | → `stable` when implementation is complete | | `stable` | Locked. Changes require review and may warrant an ADR. | → `deprecated` when superseded | | `deprecated` | Superseded. Kept for reference. | Removed when no longer referenced | \ No newline at end of file diff --git a/docs/architecture/config.md b/docs/architecture/config.md index 4609a3d..6d3d4c0 100644 --- a/docs/architecture/config.md +++ b/docs/architecture/config.md @@ -38,7 +38,9 @@ config.toml │ bind_addr │ │ sites[] │ │ http_port │ │ rate_limit │ │ https_port │ │ body_limit │ -│ tls.mode │ │ proxy_headers │ +│ health_check_port │ │ proxy_headers │ +│ admin_socket_path │ │ │ +│ tls.mode │ │ ← ArcSwap → │ │ tls.acme_domains │ │ │ │ tls.cert_path │ │ ← ArcSwap → │ │ tls.key_path │ │ ConfigReloadHandle │ @@ -59,9 +61,11 @@ Immutable after startup. Changes require a process restart. | Field | Type | Description | |-------|------|-------------| -| `bind_addr` | `String` | IP address to bind to (must be explicit, no `0.0.0.0`) | +| `bind_addr` | `String` | IP address to bind to (must be explicit, no `0.0.0.0`; see ADR-016) | | `http_port` | `u16` | Port for HTTP→HTTPS redirect (default: `80`; set to `0` to disable) | | `https_port` | `u16` | Port for TLS listener (default: `443`) | +| `health_check_port` | `u16` | Port for local health check endpoint (default: `9900`; set to `0` to disable; see ADR-013) | +| `admin_socket_path` | `String` | Unix domain socket path for admin API (default: `/run/reverse-proxy/admin.sock`; empty string to disable; see ADR-014) | | `tls.mode` | `"acme"` or `"manual"` | Certificate provisioning mode | | `tls.acme_domains` | `Vec` | Domains for ACME SAN certificate (ACME mode only) | | `tls.acme_cache_dir` | `String` | ACME state cache directory | @@ -95,6 +99,8 @@ connections immediately. | `host` | `String` | Hostname to match (e.g., `"git.alk.dev"`) | | `upstream` | `String` | Upstream address (e.g., `"127.0.0.1:3000"`) | | `upstream_scheme` | `"http"` or `"https"` | Protocol for upstream connection (default: `"http"`) | +| `upstream_connect_timeout_secs` | `u64` | TCP connect timeout in seconds (default: `5`; see ADR-015, ADR-017) | +| `upstream_request_timeout_secs` | `u64` | Full request timeout in seconds (default: `60`; see ADR-015, ADR-017) | | **Why these are dynamic:** See ADR-008 for the rationale. Site definitions and rate limits are per-request concerns that should not require restarting @@ -120,17 +126,23 @@ behind this split. ### Reload Trigger -The initial implementation uses SIGHUP as the reload trigger. When the process -receives SIGHUP: +Config reload is triggered by two mechanisms: +1. **SIGHUP**: Re-reads the config file, validates, and swaps DynamicConfig if + valid. Simple and well-understood, but provides no feedback on success or + failure. + +2. **Admin socket**: The `reload` command via the admin Unix domain socket + performs the same action as SIGHUP but returns a structured response + indicating success or failure with an error message. See ADR-014 for + details. + +Both mechanisms converge on the same code path: 1. Re-read the config file from disk 2. Deserialize into `DynamicConfig` 3. Validate (check upstream reachability is optional) 4. Call `ConfigReloadHandle::reload(new_config)` -Future implementations could add a Unix domain socket API or HTTP endpoint for -config reload, but SIGHUP is sufficient for Phase 1. - ## TOML Config Format ```toml @@ -140,6 +152,8 @@ config reload, but SIGHUP is sufficient for Phase 1. bind_addr = "203.0.113.10" # Replace with actual bind address http_port = 80 https_port = 443 +health_check_port = 9900 # Local health check (0 to disable) +admin_socket_path = "/run/reverse-proxy/admin.sock" # Empty string to disable [server.tls] mode = "acme" # "acme" or "manual" @@ -167,6 +181,8 @@ limit_bytes = 104857600 # 100 MB host = "git.alk.dev" upstream = "127.0.0.1:3000" upstream_scheme = "http" +# upstream_connect_timeout_secs = 5 # Default: 5s +# upstream_request_timeout_secs = 60 # Default: 60s [[sites]] host = "alk.dev" @@ -205,13 +221,17 @@ All design decisions are documented as ADRs in [decisions/](decisions/). | [008](decisions/008-static-dynamic-config-split.md) | Static/dynamic config split | Immutable StaticConfig, hot-reloadable DynamicConfig via ArcSwap | | [010](decisions/010-multi-site-phase1.md) | Multi-site in Phase 1 | Multiple domains from initial release | | [011](decisions/011-multi-domain-tls.md) | Multi-domain TLS config | Single SAN certificate covering all domains | +| [013](decisions/013-health-check-port.md) | Health check on separate local port | Localhost-only HTTP health check, configurable port | +| [014](decisions/014-unix-socket-reload.md) | Unix domain socket config reload API | Programmatic reload with success/failure feedback | +| [015](decisions/015-per-site-timeouts.md) | Per-site upstream timeouts with defaults | 5s connect / 60s request defaults, per-site overrides | +| [016](decisions/016-explicit-bind-address.md) | Explicit bind address required | Rejects `0.0.0.0` to prevent accidental exposure | ## Open Questions Open questions are tracked in [open-questions.md](open-questions.md). Key questions affecting this document: -- **OQ-04**: Should config reload support a Unix domain socket API in addition - to SIGHUP? (open) +- ~~**OQ-04**: Should config reload support a Unix domain socket API in addition + to SIGHUP?~~ (resolved — ADR-014: Unix domain socket admin API added) - **OQ-07**: Should per-site TLS overrides be supported for mixed ACME/manual domains? (open) \ No newline at end of file diff --git a/docs/architecture/decisions/012-cipher-suite-restriction.md b/docs/architecture/decisions/012-cipher-suite-restriction.md new file mode 100644 index 0000000..2e1e784 --- /dev/null +++ b/docs/architecture/decisions/012-cipher-suite-restriction.md @@ -0,0 +1,83 @@ +# ADR-012: Restrict Cipher Suites to Match nginx Scope + +## Status + +Accepted + +## Context + +Our current nginx configuration explicitly restricts cipher suites to four +ECDHE-AES-GCM suites: + +``` +ECDHE-ECDSA-AES128-GCM-SHA256 +ECDHE-RSA-AES128-GCM-SHA256 +ECDHE-ECDSA-AES256-GCM-SHA384 +ECDHE-RSA-AES256-GCM-SHA384 +``` + +rustls 0.23 with the `aws_lc_rs` crypto provider defaults to a conservative +set that excludes all weak ciphers (no SHA-1, no 3DES, no RC4, no CBC-mode +suites, no RSA key exchange). The rustls defaults include these four suites +plus TLS 1.3 suites (which nginx also allows via `TLSv1.3`). + +The question was whether to accept the wider rustls defaults or restrict to +the same scope as our current nginx configuration (see OQ-01). + +## Decision + +Explicitly restrict cipher suites to match the same scope as our current nginx +configuration: the four ECDHE-AES-GCM suites for TLS 1.2, plus all TLS 1.3 +suites. This is slightly more restrictive than rustls defaults but preserves +compatibility with all modern clients. + +The restricted set in rustls terms: + +**TLS 1.2 (explicitly selected):** +- `TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256` +- `TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256` +- `TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384` +- `TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384` + +**TLS 1.3 (all default suites):** +- `TLS_AES_128_GCM_SHA256` +- `TLS_AES_256_GCM_SHA384` +- `TLS_CHACHA20_POLY1305_SHA256` + +This is configured by building a `CryptoProvider` with a custom +`cipher_suite` list and passing it to `ServerConfig::builder_with_provider()`. + +## Rationale + +- Maintains behavioral parity with our current nginx configuration during + migration — no client that worked with nginx should see different TLS + behavior +- The four TLS 1.2 suites are the same ones nginx allows, providing a known + security baseline +- TLS 1.3 suites are all modern and secure — restricting them provides no + meaningful security benefit and reduces compatibility +- rustls defaults include ChaCha20-Poly1305 suites for TLS 1.2 which nginx does + not; these are cryptographically sound but represent a wider scope than our + current nginx config allows +- Explicit configuration means the cipher list is documented and auditable +- If compatibility issues arise, expanding the list is straightforward; the + reverse (restricting after deployment) risks breaking existing clients + +## Consequences + +**Positive:** +- Behavioral parity with current nginx TLS configuration +- Explicitly auditable cipher list +- No client that currently works will see different TLS behavior +- Matches security review expectations + +**Negative:** +- Slightly more restrictive than rustls defaults — excludes ChaCha20-Poly1305 + for TLS 1.2 and AES-CCM suites (rarely used) +- Must update the cipher list when deprecating TLS 1.2 in the future +- Custom `CryptoProvider` construction is slightly more code than using defaults + +## References + +- [tls.md](../tls.md) +- OQ-01 (now resolved) \ No newline at end of file diff --git a/docs/architecture/decisions/013-health-check-port.md b/docs/architecture/decisions/013-health-check-port.md new file mode 100644 index 0000000..83124d1 --- /dev/null +++ b/docs/architecture/decisions/013-health-check-port.md @@ -0,0 +1,69 @@ +# ADR-013: Health Check on Separate Local Port + +## Status + +Accepted + +## Context + +The health check endpoint (`/health`) needs to be accessible for monitoring +without requiring TLS. Currently the design places it on the main HTTPS +listener, which means: + +1. TLS handshake must succeed for the health check to respond +2. External monitoring tools need to handle TLS +3. A TLS configuration error would make the health check unreachable, creating + a false-negative monitoring signal + +Three options were considered (see OQ-03): + +1. **Main HTTPS listener only**: Simplest, but TLS config errors make health + checks unreachable +2. **Separate unencrypted port on localhost**: Simple, works with standard + monitoring tools, but health checks bypass TLS +3. **Admin port with its own listener**: Most flexible but adds complexity + +## Decision + +Add a configurable health check port that binds to `127.0.0.1` only (localhost), +serving `/health` over plain HTTP. This is a separate listener from the main +HTTP and HTTPS listeners. + +The port is configurable via `health_check_port` in StaticConfig. Setting it +to `0` (default) disables the separate health check listener, and `/health` +remains available on the main HTTPS listener as a fallback. + +## Rationale + +- A local-only health check port is the standard pattern for reverse proxies + and service meshes (envoy, haproxy, k8s health probes all use this pattern) +- Health checks should work even when TLS is misconfigured — that's the whole + point of monitoring +- Binding to `127.0.0.1` only means the health check is not exposed to the + internet — only local monitoring tools (systemd, scripts, load balancers on + the same host) can reach it +- Configurable port allows different deployment scenarios (some monitoring runs + on different ports) +- Disabling via `health_check_port = 0` keeps the main HTTPS `/health` endpoint + available for cases where a separate port isn't needed +- When this project is folded into alknet, the health check will use alknet's + existing patterns, making the separate port unnecessary in that context + +## Consequences + +**Positive:** +- Health checks work even when TLS is misconfigured +- Standard pattern that monitoring tools expect +- Not exposed to the internet (localhost only) +- Configurable — can be disabled if not needed +- systemd can use it for `NotifyAccess` readiness checks + +**Negative:** +- Additional listener to manage (minimal complexity) +- Two health check endpoints exist when the separate port is enabled (the + local one and the HTTPS one) — monitoring should prefer the local one + +## References + +- [operations.md](../operations.md) +- OQ-03 (now resolved) \ No newline at end of file diff --git a/docs/architecture/decisions/014-unix-socket-reload.md b/docs/architecture/decisions/014-unix-socket-reload.md new file mode 100644 index 0000000..21abe15 --- /dev/null +++ b/docs/architecture/decisions/014-unix-socket-reload.md @@ -0,0 +1,76 @@ +# ADR-014: Unix Domain Socket Config Reload API + +## Status + +Accepted + +## Context + +The proxy supports config reload via SIGHUP (ADR-009). SIGHUP is simple and +well-understood, but has limitations: + +1. No feedback — the sender doesn't know if the reload succeeded or failed +2. No structured input — you can only signal "reload", not specify which parts + to reload or pass validation context +3. Requires process signal permissions — not all deployment tools can send + signals + +A Unix domain socket API would allow programmatic config reload with +success/failure status, enabling integration with CI/CD pipelines, admin +tools, and automated configuration management. + +## Decision + +Add a Unix domain socket API for config reload alongside SIGHUP. The socket +accepts commands and returns structured responses. + +The socket path is configurable via `admin_socket_path` in StaticConfig +(default: `/run/reverse-proxy/admin.sock`). Setting it to empty string +disables the admin socket. + +Initial commands: + +- `reload` — Re-read config file, validate, and swap DynamicConfig. Returns + `{"status": "ok"}` or `{"status": "error", "message": "..."}`. +- `status` — Return basic process info (uptime, config load time, site count). + Returns `{"status": "ok", "uptime_secs": 1234, "sites": 2}`. + +Future commands (not in Phase 1, but the protocol supports extension): + +- `metrics` — Return Prometheus-compatible metrics +- `shutdown` — Graceful shutdown command + +## Rationale + +- Providing reload feedback is operationally valuable — CI/CD pipelines can + verify config changes before proceeding +- The implementation cost is low — a Unix domain socket listener is ~50 lines + of tokio code, and the command protocol is simple +- SIGHUP is retained as a fallback for environments where socket access is + inconvenient +- This pattern will integrate naturally with alknet's admin interface when the + projects merge +- Unix domain sockets are filesystem-permission-based, providing access control + without additional authentication +- The socket path is configurable, allowing deployment-specific paths + +## Consequences + +**Positive:** +- Config reload with success/failure feedback +- Programmatic integration with CI/CD and admin tools +- Structured response format enables automation +- SIGHUP still works as fallback +- Natural path to future admin commands + +**Negative:** +- Additional listener and command parsing logic (~100-150 lines) +- Socket file management (cleanup on startup, stale socket detection) +- One more config option (`admin_socket_path`) + +## References + +- [operations.md](../operations.md) +- [config.md](../config.md) +- ADR-009 (signal handling — SIGHUP retained as fallback) +- OQ-04 (now resolved) \ No newline at end of file diff --git a/docs/architecture/decisions/015-per-site-timeouts.md b/docs/architecture/decisions/015-per-site-timeouts.md new file mode 100644 index 0000000..80cb94e --- /dev/null +++ b/docs/architecture/decisions/015-per-site-timeouts.md @@ -0,0 +1,69 @@ +# ADR-015: Per-Site Upstream Timeouts with Defaults + +## Status + +Accepted + +## Context + +The proxy forwards requests to upstream services. Connection and request +timeouts affect reliability — too short and legitimate slow responses fail, +too long and the proxy is vulnerable to resource exhaustion from stalled +connections. + +Phase 1 initially specified global timeout defaults (5s connect, 60s request) +for all upstreams. However, different upstream services have different latency +profiles: + +- Gitea (git.alk.dev): Git push operations can be slow for large repos; 60s + may be insufficient for clone/push operations with large pack files +- Deno/Fresh (alk.dev): Fast responses expected; 60s is generous + +Global timeouts don't accommodate these differences. Per-site timeout +configuration allows tuning for each upstream without affecting others. + +## Decision + +Add optional per-site upstream timeout configuration to `SiteConfig`. When not +specified, sensible defaults are used. + +**SiteConfig additions:** + +| Field | Type | Default | Description | +|-------|------|---------|-------------| +| `upstream_connect_timeout_secs` | `u64` | `5` | TCP connection timeout in seconds | +| `upstream_request_timeout_secs` | `u64` | `60` | Full request timeout in seconds | + +These are part of `DynamicConfig` (hot-reloadable via ArcSwap) since they +affect per-request behavior and should not require a restart to change. + +## Rationale + +- Different upstreams genuinely have different latency profiles — Gitea pushes + with large pack files need more time than a fast static site +- Defaults of 5s connect and 60s request match common reverse proxy conventions + (nginx defaults: 60s, haproxy defaults: 30s connect, 60s server) +- Making these per-site rather than global allows tuning without side effects +- Per-site overrides in DynamicConfig means timeout changes don't require + restarts +- The defaults are reasonable for most services; explicit configuration is only + needed for outliers + +## Consequences + +**Positive:** +- Each upstream can be tuned independently +- Defaults work for most cases — explicit configuration is optional +- Hot-reloadable (part of DynamicConfig) +- Consistent with how other reverse proxies handle timeouts + +**Negative:** +- Two more fields per site in config (mitigated by sensible defaults) +- Per-site timeout means the proxy must look up per-request config for each + upstream connection (already required for routing, so no additional overhead) + +## References + +- [proxy.md](../proxy.md) +- [config.md](../config.md) +- OQ-06 (now resolved) \ No newline at end of file diff --git a/docs/architecture/decisions/016-explicit-bind-address.md b/docs/architecture/decisions/016-explicit-bind-address.md new file mode 100644 index 0000000..68cd21f --- /dev/null +++ b/docs/architecture/decisions/016-explicit-bind-address.md @@ -0,0 +1,53 @@ +# ADR-016: Explicit Bind Address Requirement + +## Status + +Accepted + +## Context + +The proxy's `bind_addr` configuration field specifies the IP address to bind +to. The default for most network services is `0.0.0.0` (bind all interfaces), +which means the service is accessible on every network interface, including +public-facing ones. + +For a reverse proxy that terminates TLS and handles security-sensitive traffic, +binding to all interfaces is risky. The proxy should only listen on the +intended interface(s) — typically a specific public IP for a single-server +deployment. + +## Decision + +The `bind_addr` field must be an explicit IP address. `0.0.0.0` is rejected +during config validation. The proxy will not start if `bind_addr` is `0.0.0.0`. + +## Rationale + +- A reverse proxy terminating TLS is a security boundary — it should not be + accidentally exposed on unintended interfaces +- Single-server deployments have a specific public IP; binding to it + deliberately is the correct operational practice +- `0.0.0.0` is often a default in example configurations and can be deployed + without the operator realizing the service is accessible on all interfaces +- Rejecting it at validation time prevents this common mistake +- If a deployment genuinely needs to bind all interfaces, `0.0.0.0` can be + overridden with an explicit flag, but this should be a deliberate choice +- This matches the principle of explicit over implicit for security-sensitive + configuration + +## Consequences + +**Positive:** +- Prevents accidental exposure on unintended network interfaces +- Forces operators to be deliberate about which interface the proxy serves +- Config validation catches the mistake before deployment + +**Negative:** +- Not suitable for deployments that genuinely need to bind all interfaces + (mitigated by explicit override if needed in the future) +- Slightly more configuration required (operator must know their public IP) + +## References + +- [config.md](../config.md) +- [overview.md](../overview.md) \ No newline at end of file diff --git a/docs/architecture/decisions/017-upstream-connection-defaults.md b/docs/architecture/decisions/017-upstream-connection-defaults.md new file mode 100644 index 0000000..44f74ea --- /dev/null +++ b/docs/architecture/decisions/017-upstream-connection-defaults.md @@ -0,0 +1,66 @@ +# ADR-017: Upstream Connection Defaults + +## Status + +Accepted + +## Context + +The proxy forwards requests to upstream services using a shared hyper `Client`. +The client's configuration affects all upstream connections: timeouts, redirect +handling, and connection pooling. These are significant design choices that +affect production behavior and should be traceable. + +## Decision + +Configure the hyper `Client` with the following defaults: + +| Setting | Value | Rationale | +|---------|-------|-----------| +| Connect timeout | 5 seconds | Prevents hanging on unreachable upstreams | +| Request timeout | 60 seconds (default) | Per-site override available (ADR-015) | +| Redirect following | Disabled | Proxies should not follow redirects — they pass response status to the client | +| Connection pooling | Enabled (hyper default) | Reuses connections to the same upstream for performance | +| HTTP version | HTTP/1.1 only | Upstream connections use HTTP/1.1; HTTP/2 proxying is out of scope | + +The connect timeout of 5 seconds applies to establishing a TCP connection to +the upstream. The request timeout of 60 seconds applies to the full +request/response cycle and can be overridden per-site (ADR-015). + +## Rationale + +- **5s connect timeout**: Matches common reverse proxy conventions. Long + enough for remote upstreams with network latency, short enough to fail fast + on unreachable services. +- **60s request timeout**: Generous default that accommodates Gitea push + operations with large pack files. Per-site overrides allow tuning for faster + upstreams. +- **No redirect following**: Standard reverse proxy behavior. If an upstream + returns a 3xx, the proxy streams it to the client. Following redirects would + break the client's expectation and could create security issues (redirect + loops, credential forwarding). +- **Connection pooling**: Hyper's default connection reuse improves performance + for repeated requests to the same upstream without additional configuration. +- **HTTP/1.1 only**: The proxy communicates with upstreams over plain HTTP + (loopback). HTTP/2 proxying is explicitly out of scope (see overview.md). + Upstreams needing HTTP/2+ run their own servers (e.g., api.alk.dev). + +## Consequences + +**Positive:** +- Explicit, documented defaults that match reverse proxy conventions +- Per-site timeout overrides available for specific needs +- No redirect following prevents common proxy pitfalls +- Connection pooling provides good performance by default + +**Negative:** +- 60s default may be too long for fast upstreams (mitigated by per-site + overrides in ADR-015) +- HTTP/1.1 only for upstream means no HTTP/2 multiplexing to backends (out of + scope per project design) + +## References + +- [proxy.md](../proxy.md) +- ADR-015 (per-site upstream timeouts) +- ADR-002 (custom proxy handler) \ No newline at end of file diff --git a/docs/architecture/decisions/018-body-size-limit.md b/docs/architecture/decisions/018-body-size-limit.md new file mode 100644 index 0000000..84e437b --- /dev/null +++ b/docs/architecture/decisions/018-body-size-limit.md @@ -0,0 +1,53 @@ +# ADR-018: Request Body Size Limit + +## Status + +Accepted + +## Context + +The proxy enforces a maximum request body size to protect against resource +exhaustion attacks. The default limit must balance security with usability. + +Gitea push operations can involve large Git pack files. The current nginx +configuration uses `client_max_body_size 100m`, and Gitea's documentation +recommends allowing up to 100 MB for push operations. + +## Decision + +Set the default request body size limit to 100 MB (104,857,600 bytes), +matching our current nginx configuration. The limit is global in Phase 1 +(configurable via `body.limit_bytes` in DynamicConfig). + +## Rationale + +- 100 MB matches the current nginx `client_max_body_size 100m`, ensuring + behavioral parity during migration +- Gitea push operations with large repositories regularly exceed 50 MB +- 100 MB is large enough for any legitimate Git operation while still + providing protection against resource exhaustion (a 100 MB body is not + enough to exhaust memory on modern servers, but prevents unbounded uploads) +- The limit is configurable — operators can reduce it for deployments that + don't need large uploads +- In Phase 2, per-site limits will allow different limits for different + upstreams (e.g., a lower limit for alk.dev, the current limit for + git.alk.dev) + +## Consequences + +**Positive:** +- Behavioral parity with current nginx configuration +- Gitea push operations work without configuration changes +- Configurable for deployments with different needs + +**Negative:** +- 100 MB is a generous default — some deployments may want a lower limit + (mitigated by configurability) +- Global limit means all sites share the same maximum (mitigated by Phase 2 + per-site limits) + +## References + +- [proxy.md](../proxy.md) +- [config.md](../config.md) +- nginx `client_max_body_size` documentation \ No newline at end of file diff --git a/docs/architecture/open-questions.md b/docs/architecture/open-questions.md index 6b6fc2a..90a7fa6 100644 --- a/docs/architecture/open-questions.md +++ b/docs/architecture/open-questions.md @@ -7,19 +7,15 @@ last_updated: 2026-06-11 ## TLS -### OQ-01: Should cipher suites be restricted beyond rustls defaults? +### ~~OQ-01: Should cipher suites be restricted beyond rustls defaults?~~ - **Origin**: [tls.md](tls.md) -- **Status**: open +- **Status**: resolved - **Priority**: medium -- **Context**: Our current nginx config explicitly restricts cipher suites to - four ECDHE-AES-GCM suites. rustls 0.23 with `aws_lc_rs` defaults to a - conservative set that excludes all weak ciphers (no SHA-1, no 3DES, no RC4, - no CBC-mode suites, no RSA key exchange). The defaults include TLS 1.3 suites - which nginx also allows. Restricting further would reduce compatibility with - older clients; not restricting means accepting a wider (but still safe) set - than the current nginx config. -- **Cross-references**: ADR-005 +- **Resolution**: Restrict cipher suites to match the nginx scope: four + ECDHE-AES-GCM suites for TLS 1.2 plus all TLS 1.3 suites. This provides + behavioral parity during migration. See ADR-012. +- **Cross-references**: ADR-005, ADR-012 ### ~~OQ-02: What log format should fail2ban consume?~~ @@ -47,31 +43,28 @@ last_updated: 2026-06-11 ## Logging and Monitoring -### OQ-03: Should the health check endpoint be on a separate port? +### ~~OQ-03: Should the health check endpoint be on a separate port?~~ - **Origin**: [operations.md](operations.md) -- **Status**: open +- **Status**: resolved - **Priority**: low -- **Context**: Currently the health check is on the main HTTPS listener at - `/health`. Alternatives: (a) separate unencrypted port for health checks - (simpler for load balancers but less secure), (b) admin port with its own - listener (more complex but isolates operational traffic), (c) on the main - listener (simplest, proposed approach). For a single-server deployment behind - no external load balancer, the main listener is fine. -- **Cross-references**: None +- **Resolution**: Add a configurable local health check port (default: 9900) + bound to `127.0.0.1` only. Health checks work even when TLS is misconfigured. + The main HTTPS `/health` endpoint remains available as a fallback. See + ADR-013. +- **Cross-references**: ADR-013 ## Configuration -### OQ-04: Should config reload support a Unix domain socket API in addition to SIGHUP? +### ~~OQ-04: Should config reload support a Unix domain socket API in addition to SIGHUP?~~ - **Origin**: [config.md](config.md) -- **Status**: open +- **Status**: resolved - **Priority**: low -- **Context**: Phase 1 uses SIGHUP for config reload, which is simple and proven. - A Unix domain socket API would allow programmatic reload (e.g., from an admin - tool or CI/CD pipeline) and could return success/failure status. This adds - complexity and is not needed for Phase 1. -- **Cross-references**: None +- **Resolution**: Yes. Add a Unix domain socket admin API alongside SIGHUP. + The socket accepts a `reload` command and returns structured success/failure + responses. SIGHUP is retained as a fallback. See ADR-014. +- **Cross-references**: ADR-014 ## Deployment @@ -84,17 +77,16 @@ last_updated: 2026-06-11 explicit IP address (not `0.0.0.0`). Multi-address binding is not needed for this single-server deployment. If needed in the future, `bind_addr` could be extended to an array. See config.md for the `bind_addr` field. -- **Cross-references**: None +- **Cross-references**: ADR-016 ## Proxy -### OQ-06: Should upstream timeouts be configurable per-site? +### ~~OQ-06: Should upstream timeouts be configurable per-site?~~ - **Origin**: [proxy.md](proxy.md) -- **Status**: open +- **Status**: resolved - **Priority**: low -- **Context**: Phase 1 uses global defaults (5s connect timeout, 60s request - timeout) for all upstream connections. Per-site timeout configuration would - allow tuning for different upstream services (e.g., a slow database-backed - API vs. a fast static site). Not needed for Phase 1 with a single upstream. -- **Cross-references**: None \ No newline at end of file +- **Resolution**: Yes. Per-site upstream timeouts with sensible defaults (5s + connect, 60s request). Optional fields in SiteConfig that override global + defaults when specified. See ADR-015. +- **Cross-references**: ADR-015, ADR-017 \ No newline at end of file diff --git a/docs/architecture/operations.md b/docs/architecture/operations.md index 625039b..17735a0 100644 --- a/docs/architecture/operations.md +++ b/docs/architecture/operations.md @@ -109,23 +109,30 @@ Configurable via `log_level` in StaticConfig. ## Health Check -### Endpoint +### Local Health Check Port + +The primary health check endpoint is served on a separate local port (default: +9900), bound to `127.0.0.1` only. This ensures health checks work even when TLS +is misconfigured. See ADR-013 for the rationale. ``` -GET /health → 200 OK (empty body) +GET http://127.0.0.1:9900/health → 200 OK (empty body) ``` -The health check endpoint is accessible on the main HTTPS listener. It returns -200 if the process is alive and serving requests. +The port is configurable via `health_check_port` in StaticConfig. Setting it +to `0` disables the separate health check listener. -**Limitation**: Since `/health` is served over TLS, it cannot detect TLS -configuration errors that prevent the TLS handshake from completing. External -monitoring should also check TCP connectivity to port 443 independently. +### HTTPS Health Check (Fallback) + +When the local health check port is enabled, `/health` is also available on the +main HTTPS listener for cases where TLS-level health verification is desired. +External monitoring should prefer the local health check for liveness checks +and can use the HTTPS endpoint for TLS verification. ### What It Checks - Process is running and the tokio runtime is responsive -- TLS listener is accepting connections +- TLS listener is accepting connections (HTTPS endpoint only) - Config is loaded (StaticConfig and DynamicConfig are initialized) It does **not** check upstream reachability. The health check answers "is the @@ -180,13 +187,22 @@ The proxy handles three signals via `signal-hook` (see [ADR-009](decisions/009-s - **SIGTERM / SIGINT**: Graceful shutdown. Stop accepting new connections, wait for in-flight requests to complete (up to a configurable timeout), then exit. - **SIGHUP**: Config reload. Re-read the config file, validate, and swap - DynamicConfig if valid. + DynamicConfig if valid. No feedback on success or failure. +- **Admin socket reload**: Send `reload` command via the Unix domain socket + (default: `/run/reverse-proxy/admin.sock`). Returns structured response + indicating success or failure. See ADR-014 for details. ### SIGHUP for Config Reload SIGHUP triggers config reload (see [config.md](config.md) for details). The process does not exit on SIGHUP. +### Admin Socket for Config Reload + +The admin Unix domain socket provides programmatic config reload with feedback. +This is useful for CI/CD pipelines and automation tools. See ADR-014 for the +command protocol. + ### Timeout In-flight requests have a configurable shutdown timeout (default: 30 seconds). @@ -242,10 +258,13 @@ All design decisions are documented as ADRs in [decisions/](decisions/). | [006](decisions/006-rate-limiting-approach.md) | Token bucket rate limiting | In-memory per-IP token bucket matching nginx burst semantics | | [007](decisions/007-custom-log-format.md) | Custom structured log format | key=value pairs with RATE_LIMIT prefix for fail2ban | | [009](decisions/009-signal-handling.md) | Signal handling strategy | signal-hook for SIGTERM/SIGINT/SIGHUP | +| [013](decisions/013-health-check-port.md) | Health check on separate local port | Localhost-only HTTP health check, configurable port | +| [014](decisions/014-unix-socket-reload.md) | Unix domain socket config reload API | Programmatic reload with success/failure feedback | ## Open Questions Open questions are tracked in [open-questions.md](open-questions.md). Key questions affecting this document: -- **OQ-03**: Should the health check endpoint be on a separate port? (open) \ No newline at end of file +- ~~**OQ-03**: Should the health check endpoint be on a separate port?~~ (resolved + — ADR-013: separate local port, default 9900, localhost only) \ No newline at end of file diff --git a/docs/architecture/overview.md b/docs/architecture/overview.md index ba70f33..316fe6a 100644 --- a/docs/architecture/overview.md +++ b/docs/architecture/overview.md @@ -40,22 +40,24 @@ details. - **Phase 1**: Multi-site reverse proxy with TLS termination - TLS termination with ACME (Let's Encrypt) multi-domain certificate management - Manual certificate paths as fallback mode + - Cipher suite restriction matching nginx scope (ECDHE-AES-GCM + TLS 1.3) - HTTP → HTTPS redirect - Host-based routing to multiple upstream services - Reverse proxy to Gitea at `127.0.0.1:3000` (git.alk.dev) - Reverse proxy to Deno/Fresh container for alk.dev (simple pass-through) - Proxy header injection (Host, X-Real-IP, X-Forwarded-For, X-Forwarded-Proto) + - Per-site upstream timeouts with sensible defaults (5s connect, 60s request) - Request rate limiting with fail2ban-compatible logging (global per-IP) - 100 MB body size limit (global) - - Configurable bind address (no `0.0.0.0` default) - - Health check endpoint + - Configurable bind address (must be explicit, no `0.0.0.0`) + - Local health check endpoint on separate port (default: 9900, localhost only) + - Unix domain socket admin API for config reload with feedback - Graceful shutdown (SIGTERM handling) - Systemd unit file - Dual licensing: MIT OR Apache-2.0 - **Phase 2**: Operational hardening - Per-site rate limits and body limits - - Per-site upstream timeouts - Metrics endpoint (Prometheus-compatible) - Connection limits and timeouts - Log rotation @@ -63,7 +65,6 @@ details. - **Phase 3**: Future enhancements - Wildcard subdomain support - Per-site TLS overrides (manual certs for specific domains) - - Unix domain socket config reload API ### Out of Scope @@ -168,12 +169,19 @@ All design decisions are documented as ADRs in [decisions/](decisions/). | [009](decisions/009-signal-handling.md) | Signal handling strategy | signal-hook for SIGTERM/SIGINT/SIGHUP | | [010](decisions/010-multi-site-phase1.md) | Multi-site in Phase 1 | Multiple domains from initial release; avoids config migration later | | [011](decisions/011-multi-domain-tls.md) | Multi-domain TLS config | Single SAN certificate covering all domains via rustls-acme | +| [012](decisions/012-cipher-suite-restriction.md) | Restrict cipher suites | Match nginx scope: ECDHE-AES-GCM for TLS 1.2, all TLS 1.3 | +| [013](decisions/013-health-check-port.md) | Health check on separate local port | Localhost-only HTTP health check, configurable port | +| [014](decisions/014-unix-socket-reload.md) | Unix domain socket config reload API | Programmatic reload with success/failure feedback | +| [015](decisions/015-per-site-timeouts.md) | Per-site upstream timeouts with defaults | 5s connect / 60s request defaults, per-site overrides | +| [016](decisions/016-explicit-bind-address.md) | Explicit bind address required | Rejects `0.0.0.0` to prevent accidental exposure | +| [017](decisions/017-upstream-connection-defaults.md) | Upstream connection defaults | HTTP/1.1, no redirects, connection pooling | +| [018](decisions/018-body-size-limit.md) | Request body size limit | 100 MB default matching nginx, Gitea push compatibility | ## Open Questions Open questions are tracked in [open-questions.md](open-questions.md). Key questions affecting this document: -- **OQ-01**: Should cipher suites be restricted beyond rustls defaults? (open) -- **OQ-03**: Should the health check endpoint be on a separate port? (open) +- ~~**OQ-01**: Should cipher suites be restricted beyond rustls defaults?~~ (resolved — ADR-012) +- ~~**OQ-03**: Should the health check endpoint be on a separate port?~~ (resolved — ADR-013) - **OQ-07**: Should per-site TLS overrides be supported for mixed ACME/manual domains? (open) \ No newline at end of file diff --git a/docs/architecture/proxy.md b/docs/architecture/proxy.md index 15bbae9..4369bc0 100644 --- a/docs/architecture/proxy.md +++ b/docs/architecture/proxy.md @@ -109,12 +109,15 @@ The proxy handler constructs a new request to the upstream: 5. Stream the response back to the client The hyper Client is created once at startup and shared via axum's `State`. It -must be configured with: +must be configured with (see ADR-017 for rationale): - Connection pooling (hyper default behavior) -- Connect timeout: 5 seconds -- Request timeout: 60 seconds +- HTTP/1.1 only for upstream connections (HTTP/2 proxying is out of scope) - No redirect following (proxies should not follow redirects) +Per-site timeout overrides are available via `upstream_connect_timeout_secs` +and `upstream_request_timeout_secs` in `SiteConfig` (see ADR-015). When not +specified, defaults of 5s connect and 60s request are used. + ### 4. Error Handling | Upstream Condition | Response | Notes | @@ -147,10 +150,11 @@ between the proxy and backend services on loopback is unnecessary. ## Body Size Limit -axum's `DefaultBodyLimit` layer sets the maximum request body size. For -compatibility with Gitea's push operations (large pack files), this defaults -to 100 MB. In Phase 1, the body limit is a global setting; Phase 2 may add -per-site body limits. +axum's `DefaultBodyLimit` layer sets the maximum request body size. The default +of 100 MB (104,857,600 bytes) matches our current nginx configuration and +accommodates Gitea's push operations with large pack files (see ADR-018). In +Phase 1, the body limit is a global setting; Phase 2 may add per-site body +limits. ## Design Decisions @@ -161,11 +165,14 @@ All design decisions are documented as ADRs in [decisions/](decisions/). | [002](decisions/002-custom-proxy-handler.md) | Custom proxy handler | One upstream per domain — simpler than a general proxy library | | [007](decisions/007-custom-log-format.md) | Custom structured log format | key=value pairs with RATE_LIMIT prefix for fail2ban | | [010](decisions/010-multi-site-phase1.md) | Multi-site in Phase 1 | Multiple domains from initial release | +| [015](decisions/015-per-site-timeouts.md) | Per-site upstream timeouts with defaults | 5s connect / 60s request defaults, per-site overrides | +| [017](decisions/017-upstream-connection-defaults.md) | Upstream connection defaults | HTTP/1.1, no redirects, connection pooling | +| [018](decisions/018-body-size-limit.md) | Request body size limit | 100 MB default matching nginx, Gitea push compatibility | ## Open Questions Open questions are tracked in [open-questions.md](open-questions.md). Key questions affecting this document: -- **OQ-06**: Should upstream timeouts be configurable per-site? (open — Phase 1 - uses global defaults of 5s connect, 60s request) \ No newline at end of file +- ~~**OQ-06**: Should upstream timeouts be configurable per-site?~~ (resolved — + ADR-015: per-site timeout overrides with defaults) \ No newline at end of file diff --git a/docs/architecture/tls.md b/docs/architecture/tls.md index 3d97658..b1fc3da 100644 --- a/docs/architecture/tls.md +++ b/docs/architecture/tls.md @@ -115,25 +115,26 @@ regression if defaults change in future rustls releases. ### Cipher Suites -rustls 0.23 with the `aws_lc_rs` crypto provider defaults to a conservative -cipher suite selection that excludes all weak ciphers (no SHA-1, no 3DES, no -RC4, no CBC-mode suites, no RSA key exchange). +Cipher suites are explicitly restricted to match the scope of our current nginx +configuration. See ADR-012 for the full rationale. -The current nginx config explicitly restricts to: +**TLS 1.2 (explicitly selected):** -``` -ECDHE-ECDSA-AES128-GCM-SHA256 -ECDHE-RSA-AES128-GCM-SHA256 -ECDHE-ECDSA-AES256-GCM-SHA384 -ECDHE-RSA-AES256-GCM-SHA384 -``` +- `TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256` +- `TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256` +- `TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384` +- `TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384` -rustls's defaults include these plus TLS 1.3 suites (which nginx's config -also allows via `TLSv1.3`). The default rustls cipher list is a strict subset -of what browsers accept. +**TLS 1.3 (all default suites):** -See [open-questions.md](open-questions.md) OQ-01 for whether to further -restrict cipher suites beyond rustls defaults. +- `TLS_AES_128_GCM_SHA256` +- `TLS_AES_256_GCM_SHA384` +- `TLS_CHACHA20_POLY1305_SHA256` + +This is configured by building a `CryptoProvider` with a custom `cipher_suite` +list and passing it to `ServerConfig::builder_with_provider()`. The cipher +list matches our current nginx configuration's scope, providing behavioral +parity during migration. ### ServerConfig Construction @@ -223,12 +224,13 @@ All design decisions are documented as ADRs in [decisions/](decisions/). | [005](decisions/005-tokio-rustls-direct.md) | tokio-rustls directly | Full control over TLS config and ACME resolver integration | | [010](decisions/010-multi-site-phase1.md) | Multi-site in Phase 1 | Multiple domains from initial release | | [011](decisions/011-multi-domain-tls.md) | Multi-domain TLS config | Single SAN certificate covering all domains via rustls-acme | +| [012](decisions/012-cipher-suite-restriction.md) | Restrict cipher suites | Match nginx scope: four ECDHE-AES-GCM suites for TLS 1.2, all TLS 1.3 suites | ## Open Questions Open questions are tracked in [open-questions.md](open-questions.md). Key questions affecting this document: -- **OQ-01**: Should cipher suites be restricted beyond rustls defaults? (open) +- ~~**OQ-01**: Should cipher suites be restricted beyond rustls defaults?~~ (resolved — ADR-012: restrict to nginx scope) - **OQ-07**: Should per-site TLS overrides be supported for mixed ACME/manual domains? (open) \ No newline at end of file