Files
reverse-proxy/tasks/fix/graceful-shutdown.md
glm-5.1 f9d7b8112b Decompose implementation review fixes into 14 atomic tasks with post-fix review
Break down findings from review #002 into dependency-ordered fix tasks:

Critical/High:
- fix/acme-contact-and-challenge (C1+C2): Add acme_contact field, wire to
  ACME, remove unused challenge_config, add validation rule 19
- fix/remove-health-and-hardcode-https (W5+W14+ADR-022): Remove /health
  from main listener, hardcode X-Forwarded-Proto to https
- fix/config-reload-static-drift (C4): Use ArcSwap<StaticConfig> so reload
  diffs against last config, not startup config
- fix/access-logging (W13): Wire up log_request! macro for every proxied
  request with client_ip, host, method, path, status, upstream, duration_ms

Medium:
- fix/graceful-shutdown (W1+W7): Join HTTPS tasks with timeout instead of
  abort, add shutdown signal to admin socket and eviction task
- fix/connect-timeout (W4): Wire upstream_connect_timeout_secs to enforce
  separate connect timeout

Low/Independent:
- fix/token-bucket-nanosecond (W6): Use as_nanos() instead of as_millis()
- fix/normalize-host-ipv6 (S3): Handle IPv6 bracket notation in normalize_host
- fix/http-port-validation (S1): Validate http_port in range 0 or 1-65535
- fix/integration-test-toml (S10): Fix double-nested listeners.listeners.sites
- fix/logging-test-global-subscriber (W9): Use try_init() to avoid test conflicts
- fix/fragile-error-detection (W3): Add typed error matching or documented string match
- fix/add-code-comments (C3,W8,W10,W11,S9): Document correct-but-non-obvious behaviors
- fix/request-timeout-scope (S8): Document full-request timeout scope
- fix/clean-dead-code (S4+S2): Remove dead_code annotations, add #[non_exhaustive]

Review gate:
- review/post-fix-review: Verify all fixes against architecture spec
2026-06-12 04:08:45 +00:00

3.7 KiB

id, name, status, depends_on, scope, risk, impact, level, review_findings
id name status depends_on scope risk impact level review_findings
fix/graceful-shutdown Fix shutdown to drain listeners and stop background tasks cleanly pending
moderate medium component implementation
W1
W7

Description

Two related shutdown issues:

  1. W1: On shutdown, handle.abort() is called on each HTTPS server task in main.rs. This immediately kills the tokio task, interrupting in-flight request processing. The InFlightGuard RAII type ensures decrement is called on drop, but abort() prevents normal drops. The architecture spec says tasks should be joined with a timeout, not aborted — only aborting after the shutdown timeout expires.

    The good news: serve_https_listener already has a shutdown_rx that breaks the accept loop on shutdown signal. So tasks will stop accepting new connections. We just need to wait for them to drain in-flight requests instead of aborting them.

  2. W7: start_admin_socket runs an infinite loop accepting connections with no way to break out. It doesn't accept a shutdown signal, so it can't be gracefully stopped. Similarly, the rate limiter eviction task runs an infinite loop with no cancellation mechanism.

Changes Required

src/main.rs:

  • Replace handle.abort() loop with timeout-based join:
    let shutdown_timeout = shutdown.shutdown_timeout();
    for handle in https_server_handles {
        match tokio::time::timeout(shutdown_timeout, handle).await {
            Ok(_) => {}
            Err(_) => {
                warn!("shutdown timeout expired, aborting listener task");
                handle.abort();
            }
        }
    }
    
  • After draining, signal cancellation to admin socket and eviction task

src/admin/socket.rs:

  • Add a shutdown_rx: tokio::sync::watch::Receiver<bool> parameter to start_admin_socket
  • Replace the infinite loop { listener.accept().await } with tokio::select!:
    tokio::select! {
        result = listener.accept() => { /* handle connection */ },
        _ = shutdown_rx.changed() => {
            info!("admin socket shutting down");
            break;
        }
    }
    
  • Clean up the socket file on exit (remove the Unix domain socket file)
  • Update callers in main.rs to pass the shutdown channel

src/rate_limit/mod.rs:

  • Add a shutdown_rx: tokio::sync::watch::Receiver<bool> parameter to start_eviction_task
  • Replace infinite loop with tokio::select!:
    tokio::select! {
        _ = interval_timer.tick() => { limiter.evict_stale(max_age); },
        _ = shutdown_rx.changed() => {
            info!("rate limiter eviction task shutting down");
            break;
        }
    }
    
  • Update caller in main.rs

Acceptance Criteria

  • HTTPS server tasks are joined with a timeout, not immediately aborted
  • Tasks are only aborted if the shutdown timeout expires before they finish
  • Admin socket listener breaks its accept loop on shutdown signal
  • Admin socket file is cleaned up on shutdown
  • Rate limiter eviction task breaks its loop on shutdown signal
  • ACME state machine task is cancellable (it already exits on None from stream, but should also respond to cancellation)
  • In-flight requests are allowed to drain before forceful shutdown
  • All existing tests pass
  • cargo clippy passes with no warnings

References

  • docs/architecture/operations.md — shutdown sequence
  • docs/reviews/002-implementation-review.md — W1, W7 findings
  • src/main.rs — current shutdown sequence
  • src/admin/socket.rs — current infinite loop
  • src/rate_limit/mod.rs — current infinite eviction loop
  • src/server.rs — InFlightCounter and drain_in_flight

Notes

To be filled by implementation agent

Summary

To be filled on completion