Decompose implementation review fixes into 14 atomic tasks with post-fix review
Break down findings from review #002 into dependency-ordered fix tasks: Critical/High: - fix/acme-contact-and-challenge (C1+C2): Add acme_contact field, wire to ACME, remove unused challenge_config, add validation rule 19 - fix/remove-health-and-hardcode-https (W5+W14+ADR-022): Remove /health from main listener, hardcode X-Forwarded-Proto to https - fix/config-reload-static-drift (C4): Use ArcSwap<StaticConfig> so reload diffs against last config, not startup config - fix/access-logging (W13): Wire up log_request! macro for every proxied request with client_ip, host, method, path, status, upstream, duration_ms Medium: - fix/graceful-shutdown (W1+W7): Join HTTPS tasks with timeout instead of abort, add shutdown signal to admin socket and eviction task - fix/connect-timeout (W4): Wire upstream_connect_timeout_secs to enforce separate connect timeout Low/Independent: - fix/token-bucket-nanosecond (W6): Use as_nanos() instead of as_millis() - fix/normalize-host-ipv6 (S3): Handle IPv6 bracket notation in normalize_host - fix/http-port-validation (S1): Validate http_port in range 0 or 1-65535 - fix/integration-test-toml (S10): Fix double-nested listeners.listeners.sites - fix/logging-test-global-subscriber (W9): Use try_init() to avoid test conflicts - fix/fragile-error-detection (W3): Add typed error matching or documented string match - fix/add-code-comments (C3,W8,W10,W11,S9): Document correct-but-non-obvious behaviors - fix/request-timeout-scope (S8): Document full-request timeout scope - fix/clean-dead-code (S4+S2): Remove dead_code annotations, add #[non_exhaustive] Review gate: - review/post-fix-review: Verify all fixes against architecture spec
This commit is contained in:
97
tasks/fix/graceful-shutdown.md
Normal file
97
tasks/fix/graceful-shutdown.md
Normal file
@@ -0,0 +1,97 @@
|
||||
---
|
||||
id: fix/graceful-shutdown
|
||||
name: Fix shutdown to drain listeners and stop background tasks cleanly
|
||||
status: pending
|
||||
depends_on: []
|
||||
scope: moderate
|
||||
risk: medium
|
||||
impact: component
|
||||
level: implementation
|
||||
review_findings: [W1, W7]
|
||||
---
|
||||
|
||||
## Description
|
||||
|
||||
Two related shutdown issues:
|
||||
|
||||
1. **W1**: On shutdown, `handle.abort()` is called on each HTTPS server task in `main.rs`. This immediately kills the tokio task, interrupting in-flight request processing. The `InFlightGuard` RAII type ensures `decrement` is called on drop, but `abort()` prevents normal drops. The architecture spec says tasks should be joined with a timeout, not aborted — only aborting after the shutdown timeout expires.
|
||||
|
||||
The good news: `serve_https_listener` already has a `shutdown_rx` that breaks the accept loop on shutdown signal. So tasks will stop accepting new connections. We just need to wait for them to drain in-flight requests instead of aborting them.
|
||||
|
||||
2. **W7**: `start_admin_socket` runs an infinite `loop` accepting connections with no way to break out. It doesn't accept a shutdown signal, so it can't be gracefully stopped. Similarly, the rate limiter eviction task runs an infinite loop with no cancellation mechanism.
|
||||
|
||||
### Changes Required
|
||||
|
||||
**`src/main.rs`**:
|
||||
- Replace `handle.abort()` loop with timeout-based join:
|
||||
```rust
|
||||
let shutdown_timeout = shutdown.shutdown_timeout();
|
||||
for handle in https_server_handles {
|
||||
match tokio::time::timeout(shutdown_timeout, handle).await {
|
||||
Ok(_) => {}
|
||||
Err(_) => {
|
||||
warn!("shutdown timeout expired, aborting listener task");
|
||||
handle.abort();
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
- After draining, signal cancellation to admin socket and eviction task
|
||||
|
||||
**`src/admin/socket.rs`**:
|
||||
- Add a `shutdown_rx: tokio::sync::watch::Receiver<bool>` parameter to `start_admin_socket`
|
||||
- Replace the infinite `loop { listener.accept().await }` with `tokio::select!`:
|
||||
```rust
|
||||
tokio::select! {
|
||||
result = listener.accept() => { /* handle connection */ },
|
||||
_ = shutdown_rx.changed() => {
|
||||
info!("admin socket shutting down");
|
||||
break;
|
||||
}
|
||||
}
|
||||
```
|
||||
- Clean up the socket file on exit (remove the Unix domain socket file)
|
||||
- Update callers in `main.rs` to pass the shutdown channel
|
||||
|
||||
**`src/rate_limit/mod.rs`**:
|
||||
- Add a `shutdown_rx: tokio::sync::watch::Receiver<bool>` parameter to `start_eviction_task`
|
||||
- Replace infinite loop with `tokio::select!`:
|
||||
```rust
|
||||
tokio::select! {
|
||||
_ = interval_timer.tick() => { limiter.evict_stale(max_age); },
|
||||
_ = shutdown_rx.changed() => {
|
||||
info!("rate limiter eviction task shutting down");
|
||||
break;
|
||||
}
|
||||
}
|
||||
```
|
||||
- Update caller in `main.rs`
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
- [ ] HTTPS server tasks are joined with a timeout, not immediately aborted
|
||||
- [ ] Tasks are only aborted if the shutdown timeout expires before they finish
|
||||
- [ ] Admin socket listener breaks its accept loop on shutdown signal
|
||||
- [ ] Admin socket file is cleaned up on shutdown
|
||||
- [ ] Rate limiter eviction task breaks its loop on shutdown signal
|
||||
- [ ] ACME state machine task is cancellable (it already exits on `None` from stream, but should also respond to cancellation)
|
||||
- [ ] In-flight requests are allowed to drain before forceful shutdown
|
||||
- [ ] All existing tests pass
|
||||
- [ ] `cargo clippy` passes with no warnings
|
||||
|
||||
## References
|
||||
|
||||
- docs/architecture/operations.md — shutdown sequence
|
||||
- docs/reviews/002-implementation-review.md — W1, W7 findings
|
||||
- src/main.rs — current shutdown sequence
|
||||
- src/admin/socket.rs — current infinite loop
|
||||
- src/rate_limit/mod.rs — current infinite eviction loop
|
||||
- src/server.rs — InFlightCounter and drain_in_flight
|
||||
|
||||
## Notes
|
||||
|
||||
> To be filled by implementation agent
|
||||
|
||||
## Summary
|
||||
|
||||
> To be filled on completion
|
||||
Reference in New Issue
Block a user