Decompose architecture into 23 atomic tasks across 7 parallel generations

Task graph covers all Phase 1 concerns: config system, TLS termination, proxy handler, operations (rate limiting, logging, health check, admin socket, signals, shutdown, body size limit), deployment artifacts, and two review checkpoints. No circular dependencies. Critical path length of 7. Risk distribution: 3 high-risk (ACME, TLS listener setup, startup orchestration), 7 medium, 11 low, 2 trivial.
2026-06-11 11:21:10 +00:00
parent ceb59ad9b9
commit 309878c561
23 changed files with 1676 additions and 0 deletions
--- a/tasks/ops/admin-socket.md
+++ b/tasks/ops/admin-socket.md
@@ -0,0 +1,74 @@
+---
+id: ops/admin-socket
+name: Implement Unix domain socket admin API for config reload with feedback and status
+status: pending
+depends_on: [config/dynamic-config]
+scope: moderate
+risk: medium
+impact: component
+level: implementation
+---
+
+## Description
+
+Implement the Unix domain socket admin API for programmatic config reload with success/failure feedback. This is an alternative to SIGHUP that provides structured responses.
+
+### Protocol
+
+- **Connection lifecycle**: One command per connection. Client connects, sends one newline-terminated command, receives one newline-terminated JSON response, then the server closes the connection.
+- **Message framing**: Newline-delimited (`\n`). Responses end with `\n`.
+
+### Commands
+
+- `reload` — Re-read config file, validate, and swap DynamicConfig. Returns:
+  - Success: `{"status": "ok"}`
+  - Failure: `{"status": "error", "message": "..."}`
+- `status` — Return basic process info. Returns:
+  - `{"status": "ok", "uptime_secs": 1234, "sites": 2}`
+
+### Error Responses
+
+- Unrecognized commands: `{"status": "error", "message": "unknown command: <cmd>"}`
+- Invalid or empty input: `{"status": "error", "message": "invalid input"}`
+
+### Socket Lifecycle
+
+- Socket path from `admin_socket_path` config (default: `/run/reverse-proxy/admin.sock`)
+- Empty string disables the admin socket
+- Remove any existing socket file at startup before binding
+- If the socket file exists and another process is listening, log a warning and disable the admin socket (but continue starting)
+
+### Concurrency
+
+- Multiple clients can connect simultaneously
+- Reload operations are serialized via the same `tokio::sync::Mutex` used by SIGHUP reload
+- If a reload is in progress, subsequent reload requests wait, then re-read the config file (getting the latest version)
+
+## Acceptance Criteria
+
+- [ ] Unix domain socket bound at `admin_socket_path`
+- [ ] `reload` command triggers config reload and returns structured JSON response
+- [ ] `status` command returns process uptime and site count
+- [ ] Unknown commands return `{"status": "error", "message": "unknown command: ..."}`
+- [ ] Empty/invalid input returns `{"status": "error", "message": "invalid input"}`
+- [ ] One command per connection, server closes connection after response
+- [ ] Stale socket file removed at startup
+- [ ] If socket file exists and is active (another process), log warning and continue
+- [ ] `admin_socket_path = ""` disables admin socket
+- [ ] Reload operations serialized with same Mutex as SIGHUP reload
+- [ ] Integration test: connect to socket, send `reload`, receive JSON response
+- [ ] Integration test: connect to socket, send `status`, receive JSON response
+
+## References
+
+- docs/architecture/operations.md — admin socket section
+- docs/architecture/decisions/014-unix-socket-reload.md — admin socket rationale
+- docs/architecture/config.md — reload serialization
+
+## Notes
+
+> The admin socket and SIGHUP converge on the same reload code path. The only difference is that the admin socket returns a structured response while SIGHUP provides no feedback.
+
+## Summary
+
+> To be filled on completion
--- a/tasks/ops/body-size-limit.md
+++ b/tasks/ops/body-size-limit.md
@@ -0,0 +1,52 @@
+---
+id: ops/body-size-limit
+name: Implement global request body size limit with axum DefaultBodyLimit middleware
+status: pending
+depends_on: [config/dynamic-config]
+scope: single
+risk: trivial
+impact: isolated
+level: implementation
+---
+
+## Description
+
+Implement the global request body size limit using axum's `DefaultBodyLimit` middleware. The default limit is 100 MB (104,857,600 bytes), matching the current nginx configuration and accommodating Gitea's push operations with large pack files (ADR-018).
+
+### Implementation
+
+- Set `DefaultBodyLimit::max(body_limit_bytes)` as axum middleware
+- `body_limit_bytes` comes from `DynamicConfig`, so it can be changed at runtime via config reload
+- When the limit is exceeded, axum returns `413 Payload Too Large` with `Payload Too Large` body
+- In Phase 1, the limit is global (not per-site)
+
+### Config Reload
+
+Since `body_limit_bytes` is in `DynamicConfig`, it updates on config reload. However, axum's `DefaultBodyLimit` is typically set as a layer at router construction time. The implementation needs to ensure the current limit is read from `DynamicConfig` on each request, not cached at router construction time.
+
+This may require a custom middleware that reads `DynamicConfig` via `ArcSwap` on each request, rather than relying solely on axum's `DefaultBodyLimit`.
+
+## Acceptance Criteria
+
+- [ ] Body size limit enforced on all proxied requests
+- [ ] Default: 100 MB (104,857,600 bytes)
+- [ ] 413 Payload Too Large response when limit exceeded
+- [ ] Limit is configurable via `DynamicConfig`
+- [ ] Limit can be changed at runtime via config reload
+- [ ] Config value is read from ArcSwap on each request (not cached)
+- [ ] Integration test: request with body > limit receives 413
+- [ ] Integration test: request with body < limit succeeds
+
+## References
+
+- docs/architecture/proxy.md — body size limit section
+- docs/architecture/config.md — DynamicConfig, body_limit_bytes
+- docs/architecture/decisions/018-body-size-limit.md — 100 MB default rationale
+
+## Notes
+
+> The implementation agent should investigate whether axum's `DefaultBodyLimit` can be dynamically updated, or if a custom middleware reading from ArcSwap is needed. The important thing is that config reload changes the limit without restarting.
+
+## Summary
+
+> To be filled on completion
--- a/tasks/ops/health-check.md
+++ b/tasks/ops/health-check.md
@@ -0,0 +1,57 @@
+---
+id: ops/health-check
+name: Implement health check endpoint on separate local port and HTTPS fallback
+status: pending
+depends_on: [config/static-config]
+scope: narrow
+risk: low
+impact: component
+level: implementation
+---
+
+## Description
+
+Implement the health check endpoint on a separate local port (default: 9900, bound to `127.0.0.1` only) and as a fallback on the HTTPS listener.
+
+### Local Health Check Port
+
+- Binds to `127.0.0.1:{health_check_port}`
+- `GET /health` returns `200 OK` with empty body
+- `health_check_port = 0` disables the separate listener
+- Port must not conflict with any listener's `http_port` or `https_port` on `127.0.0.1` (validated in config validation)
+
+### HTTPS Health Check Fallback
+
+When the local health check port is enabled, `/health` is also available on the HTTPS listener(s) for TLS-level health verification. External monitoring should prefer the local health check for liveness and can use the HTTPS endpoint for TLS verification.
+
+### What Health Check Verifies
+
+- Process is running and tokio runtime is responsive
+- TLS listener is accepting connections (HTTPS endpoint only)
+- Config is loaded (StaticConfig and DynamicConfig are initialized)
+
+It does **NOT** check upstream reachability. The health check answers "is the proxy process healthy?", not "is the upstream reachable?"
+
+## Acceptance Criteria
+
+- [ ] Local health check binds to `127.0.0.1:{health_check_port}` only
+- [ ] `GET /health` returns `200 OK` with empty body
+- [ ] `health_check_port = 0` disables the listener
+- [ ] Port conflict detection in config validation
+- [ ] `/health` available on HTTPS listener(s) as fallback
+- [ ] Health check does not verify upstream reachability
+- [ ] Integration test: local health check responds 200
+- [ ] Integration test: HTTPS health check responds 200
+
+## References
+
+- docs/architecture/operations.md — health check section
+- docs/architecture/decisions/013-health-check-port.md — separate local port rationale
+
+## Notes
+
+> To be filled by implementation agent
+
+## Summary
+
+> To be filled on completion
--- a/tasks/ops/logging.md
+++ b/tasks/ops/logging.md
@@ -0,0 +1,89 @@
+---
+id: ops/logging
+name: Implement structured logging with tracing, file output, and fail2ban-compatible format
+status: pending
+depends_on: [setup/project-init]
+scope: moderate
+risk: low
+impact: component
+level: implementation
+---
+
+## Description
+
+Implement structured logging using `tracing` and `tracing-subscriber` with dual output (file + stdout) and fail2ban-compatible log format.
+
+### Log Types
+
+1. **Access logs** (every proxied request, `info` level):
+   ```
+   REQUEST client_ip=203.0.113.50 host=git.alk.dev method=GET path=/user/repo status=200 upstream=127.0.0.1:3000 duration_ms=45
+   ```
+
+2. **Event logs** (rate limits, TLS errors, upstream failures, config reloads):
+   ```
+   RATE_LIMIT client_ip=203.0.113.50 host=git.alk.dev path=/login status=429
+   UPSTREAM_ERROR host=git.alk.dev upstream=127.0.0.1:3000 error="connection refused"
+   CONFIG_RELOAD status=success sites=1
+   ```
+
+### Dual Output
+
+- **File** (primary): Written to `log_file_path` when configured. This is the authoritative source for fail2ban.
+- **stdout/stderr** (always-on): For `docker logs`, `journalctl`, and development.
+
+Use `tracing-subscriber` `Layer` composition to write to both simultaneously.
+
+### Log Levels
+
+| Level | Use |
+|-------|-----|
+| `error` | Unrecoverable failures (TLS handshake failure, config validation) |
+| `warn` | Rate limit exceeded, upstream unreachable, upstream timeout |
+| `info` | Access logs, config reloads, ACME events, startup/shutdown |
+| `debug` | Request/response headers, connection details |
+| `trace` | Detailed protocol-level information |
+
+Configurable via `log_level` in StaticConfig.
+
+### Configuration
+
+- `logging.level`: Log verbosity (default: `"info"`)
+- `logging.format`: `"text"` or `"json"` (default: `"text"`)
+- `logging.log_file_path`: Optional file path; when set, logs are written to this file in addition to stdout
+
+### File Logging and fail2ban
+
+File logging is the primary integration point for fail2ban. In container deployments, the log directory is volume-mounted so fail2ban on the host can read it directly.
+
+A corresponding fail2ban filter definition and jail configuration will be provided in the deployment task.
+
+## Acceptance Criteria
+
+- [ ] `tracing` and `tracing-subscriber` initialized with dual output (file + stdout)
+- [ ] File output enabled when `log_file_path` is configured
+- [ ] Stdout output always enabled
+- [ ] Custom event format with `key=value` pairs
+- [ ] `REQUEST` prefix for access logs
+- [ ] `RATE_LIMIT` prefix for rate limit events
+- [ ] `UPSTREAM_ERROR` prefix for upstream failures
+- [ ] `CONFIG_RELOAD` prefix for config reload events
+- [ ] Log level configurable via `logging.level`
+- [ ] JSON format output when `logging.format = "json"`
+- [ ] Text format output when `logging.format = "text"` (default)
+- [ ] `duration_ms` field in access logs for response time
+- [ ] Unit tests for log format output
+
+## References
+
+- docs/architecture/operations.md — logging section, log format
+- docs/architecture/decisions/007-custom-log-format.md — custom log format rationale
+- docs/architecture/decisions/020-container-deployment.md — file-primary logging rationale
+
+## Notes
+
+> The fail2ban filter and jail configuration are a separate deployment task. This task focuses on producing the correct log format.
+
+## Summary
+
+> To be filled on completion
--- a/tasks/ops/rate-limiting.md
+++ b/tasks/ops/rate-limiting.md
@@ -0,0 +1,89 @@
+---
+id: ops/rate-limiting
+name: Implement token bucket rate limiting with IPv6 /64 normalization and background eviction
+status: pending
+depends_on: [config/dynamic-config]
+scope: moderate
+risk: medium
+impact: component
+level: implementation
+---
+
+## Description
+
+Implement per-IP token bucket rate limiting as axum middleware. This runs before the proxy handler and rejects requests that exceed the rate limit with 429 Too Many Requests.
+
+### Token Bucket Algorithm
+
+- **Nodelay** semantics matching nginx's `limit_req burst nodelay`
+- When bucket is empty, request is immediately rejected with 429 — no queuing
+- Tokens added at rate of `requests_per_second` (1 token every `1000ms / requests_per_second`)
+- Bucket capacity is `burst` value
+- Per-IP in Phase 1 (not per-site)
+
+### IPv6 Normalization
+
+- **IPv4**: Rate limited per individual address (`/32`)
+- **IPv6**: Rate limited per `/64` prefix. All addresses in the same `/64` share a token bucket
+- Normalize IPv6 addresses to their `/64` prefix before bucket lookup
+
+### Rate Limit State
+
+- `Arc<Mutex<HashMap<IpAddr, TokenBucket>>>` shared via axum State
+- Token bucket struct with: `tokens: f64`, `last_refill: Instant`, `rate: f64`, `max: u32`
+
+### Background Eviction Task
+
+- Runs every 60 seconds (configurable)
+- Removes entries whose last access timestamp is older than 300 seconds (5 minutes default)
+- Prevents unbounded memory growth
+
+### Config Reload Behavior
+
+When rate limit parameters change:
+1. New `DynamicConfig` swapped in via ArcSwap
+2. On next request from an existing IP, rate limiter reads current DynamicConfig
+3. Token bucket refills using new rate, capacity set to new burst
+4. If current token count exceeds new burst max, cap to new burst max
+5. HashMap is NOT cleared — avoids rate-limiting gap
+
+### Logging
+
+Rate limit events logged with `RATE_LIMIT` prefix:
+```
+RATE_LIMIT client_ip=203.0.113.50 host=Y.Z path=/W status=429
+```
+
+### Middleware Integration
+
+Rate limiting runs as tower middleware before the proxy handler in the axum router.
+
+## Acceptance Criteria
+
+- [ ] Token bucket implementation with nodelay semantics
+- [ ] Per-IP rate limiting with configurable rate and burst
+- [ ] IPv6 addresses normalized to `/64` prefix before bucket lookup
+- [ ] IPv4 addresses used as-is (`/32`)
+- [ ] Background eviction task removes stale entries every 60 seconds
+- [ ] Config reload: new rate/burst parameters adopted on next request from existing IP
+- [ ] Token count capped to new burst max when burst decreases
+- [ ] HashMap not cleared on config reload (no rate-limiting gap)
+- [ ] `429 Too Many Requests` response with `Too Many Requests` body
+- [ ] `RATE_LIMIT` prefixed log event with `client_ip`, `host`, `path`, `status`
+- [ ] Rate limiter state shared via `Arc<Mutex<HashMap<IpAddr, TokenBucket>>>`
+- [ ] Unit tests for token bucket algorithm (fill, drain, reject)
+- [ ] Unit tests for IPv6 `/64` normalization
+- [ ] Integration test: requests above rate limit receive 429
+
+## References
+
+- docs/architecture/operations.md — rate limiting section
+- docs/architecture/decisions/006-rate-limiting-approach.md — token bucket rationale
+
+## Notes
+
+> The rate limiter must be efficient on the hot path — no locks on reads. Consider using a `DashMap` or similar concurrent map instead of `Mutex<HashMap>` for better read performance. The spec says `Mutex<HashMap>` but an implementation agent may choose a more performant concurrent data structure.
+
+## Summary
+
+> To be filled on completion
--- a/tasks/ops/signals-and-shutdown.md
+++ b/tasks/ops/signals-and-shutdown.md
@@ -0,0 +1,76 @@
+---
+id: ops/signals-and-shutdown
+name: Implement signal handling (SIGTERM/SIGINT/SIGHUP) and graceful shutdown sequence
+status: pending
+depends_on: [config/dynamic-config, ops/admin-socket]
+scope: moderate
+risk: medium
+impact: component
+level: implementation
+---
+
+## Description
+
+Implement signal handling for SIGTERM, SIGINT, and SIGHUP, plus the graceful shutdown sequence.
+
+### Signal Handling
+
+Using `signal-hook` crate (per ADR-009):
+
+- **SIGTERM / SIGINT**: Graceful shutdown
+- **SIGHUP**: Config reload (same code path as admin socket `reload` command)
+
+### Graceful Shutdown Sequence
+
+On SIGTERM or SIGINT:
+
+1. **Stop accepting new connections** — Close all TCP listening sockets
+2. **Close idle keep-alive connections** — Send `Connection: close` on idle connections
+3. **Wait for in-flight requests** — Up to `shutdown_timeout_secs` (default: 30)
+4. **Force-close remaining connections** — After timeout, TCP RST
+5. **Cancel background tasks** — ACME renewal, rate limiter eviction, admin socket
+6. **Exit with code 0**
+
+### SIGHUP for Config Reload
+
+SIGHUP triggers the same config reload as the admin socket `reload` command:
+
+1. Re-read the config file from disk
+2. Deserialize into full config (static + dynamic)
+3. Validate the full config
+4. If valid: swap DynamicConfig, log warnings for any static changes
+5. If invalid: reject reload, log error, keep old DynamicConfig
+
+SIGHUP provides no feedback on success or failure — it just logs. The admin socket is the programmatic alternative with structured responses.
+
+### Shutdown Timeout
+
+Configurable via `shutdown_timeout_secs` in StaticConfig (default: 30 seconds).
+
+## Acceptance Criteria
+
+- [ ] `signal-hook` handles SIGTERM, SIGINT, SIGHUP
+- [ ] SIGTERM/SIGINT triggers graceful shutdown sequence
+- [ ] SIGHUP triggers config reload (same code path as admin socket)
+- [ ] Graceful shutdown: close listening sockets first
+- [ ] Graceful shutdown: close idle keep-alive connections
+- [ ] Graceful shutdown: wait for in-flight requests up to timeout
+- [ ] Graceful shutdown: force-close remaining connections after timeout
+- [ ] Cancel background tasks (ACME, eviction, admin socket) on shutdown
+- [ ] Exit code 0 on graceful shutdown
+- [ ] `shutdown_timeout_secs` configurable in StaticConfig
+- [ ] SIGHUP reload converges on same code path as admin socket reload
+- [ ] Integration test: send SIGTERM, verify graceful shutdown sequence
+
+## References
+
+- docs/architecture/operations.md — signal handling, shutdown sequence
+- docs/architecture/decisions/009-signal-handling.md — signal handling strategy
+
+## Notes
+
+> The shutdown sequence must be carefully ordered. Closing listening sockets before waiting for in-flight requests ensures no new connections arrive while existing ones drain.
+
+## Summary
+
+> To be filled on completion