Files
hub/docs/architecture/hub-startup.md
glm-5.1 2b63cda1c7 Setup repo: migrate architecture specs, code stubs, and tasks from alkhub_ts
Copy architecture docs, ADRs, storage domain specs, research, reviews,
and 56 storage architecture tasks from the alkhub_ts monorepo. Adapt for
standalone @alkdev/hub repo structure (src/ not packages/hub/).

Sanitize all sensitive information:
- Replace private IPs (10.0.0.1) with localhost defaults
- Remove internal server hostnames (dev1, ns528096)
- Replace /workspace/ private paths with npm package references
- Remove hardcoded credentials from examples
- Rewrite infrastructure.md without private network details

Add Deno project scaffolding: deno.json (pinned deps), .gitignore,
AGENTS.md, entry point. Migrate existing code stubs (crypto, config
types, logger) with updated import paths.
2026-05-25 10:56:32 +00:00

16 KiB

status, last_updated
status last_updated
draft 2026-05-18

Hub Startup Sequence

Overview

The hub startup is an ordered process that resolves configuration, connects to infrastructure services, initializes subsystems, and begins serving requests. This document specifies the sequence, failure modes, and readiness contract. The config system it depends on is defined in hub-config.md.

Design Principles

  1. Fail fast on missing prerequisites — If the master key, config file, Postgres, or Redis is unavailable, the hub MUST NOT start in a degraded state. Partial availability is worse than no availability.
  2. Config before connections — All configuration is resolved and validated before any network connections are made. This prevents partial-initialization states where some subsystems are connected and others aren't.
  3. Ordered, not parallel — Startup steps are sequential. Each step confirms success before the next begins. This makes startup deterministic and debuggable. Parallel initialization can be added later if startup latency becomes a problem, but correctness trumps speed.
  4. Single entry point — One function (startHub) owns the sequence. No scattered initialization across module scopes or top-level side effects.

Startup Sequence

Step 1: Resolve Config Paths
  │   Determine config file path and master key path.
  │   Defaults: /etc/alkhub/config.json, /run/secrets/hub_master_key
  │   Override: ALKHUB_CONFIG_PATH env var (non-sensitive, acceptable).
  │   Fail if files don't exist.
  │
Step 2: Load and Decrypt Config
  │   loadConfig(configPath, masterKeyPath) → HubConfig
  │   Reads master key, decrypts _encrypted fields, validates with TypeBox.
  │   Fail if master key is missing, config is invalid JSON, decryption
  │   fails, or TypeBox validation fails.
  │
Step 3: Initialize Logger
  │   Configure logtape with HubConfig.logLevel.
  │   Sink: stdout, structured JSON in production, pretty-print in development.
  │   Production vs. dev determined by HubConfig.development flag (see hub-config.md).
  │   From this point, structured logging is available for all subsequent steps.
  │
Step 4: Connect to Postgres
  │   Create connection pool using HubConfig.postgres.
  │   Verify connectivity: SELECT 1.
  │   Fail if connection is refused or authentication fails.
  │
Step 5: Run Migrations
  │   Run pending Drizzle migrations against Postgres
  │   using drizzle-orm's programmatic migrator (not drizzle-kit CLI).
  │   Migrations are SQL files from the ./migrations directory.
  │   Fail if migrations fail (schema mismatch, SQL errors).
  │   If the hub crashes mid-migration, the Drizzle migration table
  │   tracks which migrations completed. On next startup, migrations
  │   resume from the last completed step. Partial migrations require
  │   manual operator attention only if a SQL statement fails mid-transaction.
  │
Step 6: Connect to Redis
  │   Create Redis client using HubConfig.redis.
  │   Verify connectivity: PING.
  │   Fail if connection is refused or authentication fails.
  │
Step 7: Initialize Encryption Key Ring
  │   resolveEncryptionKeys(HubConfig.encryptionKeys) → EncryptionKeyRing
  │   Validates that at least one key exists, key versions are sequential.
  │   The key ring is used by client_secrets operations and key rotation.
  │
Step 8: Initialize Drizzle Client
  │   Create Drizzle ORM client wrapping the Postgres pool + schema.
  │   Schema namespace loaded from src/storage/schema.ts.
  │
Step 9: Initialize Subsystems
  │   Each subsystem has its own architecture doc for details.
  │   Initialization here creates and wires the runtime objects.
  │   ├── Operation Registry: scan hub operation directories
  │   ├── Keypal: initialize with HubKeyStorage (Drizzle adapter)
  │   │   └── apiKeyCacheTtl from HubConfig.auth configures RedisCache TTL
  │   ├── PubSub: create with RedisEventTarget (see pubsub-redis.md, from `@alkdev/pubsub` with `prefix` option)
  │   ├── Call Protocol: PendingRequestMap + CallHandler (from `@alkdev/operations`, see call-graph.md)
  │   └── Session System: AI SDK configuration (see agent-sessions.md)
  │       └── LLM provider keys are resolved from client_secrets at runtime
  │
Step 10: Start Hono HTTP Server + WebSocket Listener
  │   Listen on HubConfig.http.host:HubConfig.http.port.
  │   Register all HTTP routes and middleware.
  │   Register the /ws WebSocket upgrade route.
  │   On WS upgrade: authenticate spoke, create WebSocketEventTarget,
  │   register in RunnerPool. (This is a single Hono route, not a
  │   separate server — the WS handler rides on the same HTTP listener.)
  │
Step 11: Signal Ready
      Health check endpoint (/health) starts returning 200.
      Startup is complete. The hub is serving.

Failure Modes

Step 1-2: Config Resolution Failures

Failure Behavior
Config file not found Exit with error message including expected path
Master key file not found Exit with error message including expected path
Master key is empty or whitespace Exit — key must be non-empty
Config file is invalid JSON Exit with parse error details
Decryption of _encrypted field fails Exit — wrong master key or corrupted config
TypeBox validation fails Exit with field-level validation errors
encryptionKeys field missing from HubConfig Exit — hub cannot start without data encryption keys

All config failures are fatal. The hub cannot operate without valid config. No fallback, no defaults for sensitive values.

Step 4: Postgres Unreachable

Failure Behavior
Connection refused Exit with error. Do NOT retry indefinitely.
Authentication failed Exit — wrong credentials in config
Database doesn't exist Exit — the alkdev database must be created before first startup

No retry loop at startup. If Postgres isn't available, the operator needs to fix it, not wait. Container orchestration (Docker restart policy, systemd) handles restarts. The hub should fail quickly and let the orchestrator retry.

Exception: development convenience. A --wait-for-postgres CLI flag (dev only) can poll with a timeout. This is NOT the default and NOT for production.

Step 5: Migration Failures

Failure Behavior
Migration SQL error Exit with error details
Schema version conflict Exit — manual intervention required

Migrations are forward-only. No automatic rollback. If a migration fails, the database is in an inconsistent state and needs operator attention.

Step 6: Redis Unreachable

Failure Behavior
Connection refused Exit with error
Authentication failed Exit — wrong password

Same principle as Postgres — fail fast, let the orchestrator retry.

Step 7: Encryption Key Ring Invalid

Failure Behavior
encryptionKeys field missing from config Exit — hub cannot operate without data encryption keys
Empty or whitespace-only after decryption Exit
Malformed format (e.g., v1: with empty key) Exit — each version must have a valid base64 key
Duplicate versions (e.g., v1:abc,v1:def) Exit — versions must be unique
Non-sequential versions (e.g., v1:abc,v3:def) Exit — versions must be monotonically increasing starting from 1
Invalid base64 in key value Exit — keys must be valid base64-encoded 32-byte values

These validations run in resolveEncryptionKeys (see hub-config.md § Interfaces).

Step 9: Subsystem Failures

Subsystem initialization failures (e.g., keypal can't initialize, operation scan fails) should log the error and exit. Partial initialization is not acceptable — if the operation registry can't scan, the hub can't serve requests.

Readiness Contract

Health Check Endpoint

GET /health returns:

  • 200 OK with { "status": "ok" } only after all startup steps complete
  • 503 Service Unavailable with { "status": "starting", "step": "<current-step>" } during startup
  • 503 Service Unavailable with { "status": "shutting_down" } during graceful shutdown
  • 503 Service Unavailable with { "status": "degraded", "issues": [...] } if a post-startup subsystem fails

Step names (used in the step field during startup): resolve-config, load-config, init-logger, connect-postgres, run-migrations, connect-redis, init-keyring, init-drizzle, init-subsystems, start-server, ready

Runtime liveness: After startup completes, /health also performs lightweight liveness checks:

  • Postgres: SELECT 1 (timeout: 2s)
  • Redis: PING (timeout: 1s)
  • If either fails, return 503 { "status": "degraded", "issues": ["postgres: unreachable"] }
  • Liveness checks run on each /health request (not cached, not background-polled)
  • If the hub is in degraded state and the subsystem recovers, the next /health request returns 200

Docker health check configuration:

HEALTHCHECK --interval=30s --timeout=5s --retries=3 \
  CMD curl -f http://localhost:3000/health || exit 1

Dependency Wait Pattern

Other services (spokes, MCP clients) should NOT connect until /health returns 200. Docker Compose depends_on with condition: service_healthy handles this:

services:
  hub:
    # ... hub config ...
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
      interval: 5s
      retries: 30

  spoke:
    depends_on:
      hub:
        condition: service_healthy

Graceful Shutdown

The startup function should register signal handlers (SIGTERM, SIGINT) for graceful shutdown:

1. Set /health to return 503 { "status": "shutting_down" }
2. Stop accepting new HTTP connections
3. Stop accepting new WebSocket connections
4. Abort in-flight calls dispatched to spokes (call protocol cascading)
5. Drain in-flight HTTP requests (timeout: 10s)
6. Close WebSocket connections to spokes (send close frames)
7. Shut down AI SDK session system (cancel in-flight streams)
8. Shut down Keypal (flush any pending audit log writes)
9. Close Redis connection
10. Close Postgres connection pool (wait for active queries, timeout: 10s)
11. Flush and close logtape sinks (final log entries)
12. Exit with 0

The shutdown sequence mirrors the startup sequence in reverse order — resources initialized last are closed first (HTTP/WebSocket before DB connections), and resources that depend on others are closed before their dependencies.

Timeout: If graceful shutdown doesn't complete in 30 seconds, force exit with 1. This prevents zombie processes.

The startHub Function

The architecturally significant interface:

interface HubStartOptions {
  configPath: string;       // /etc/alkhub/config.json
  masterKeyPath: string;    // /run/secrets/hub_master_key
}

interface Hub {
  config: HubConfig;               // Fully-resolved, validated config
  db: DrizzleClient;               // Drizzle + Postgres
  redis: RedisClient;              // Redis connection
  keyRing: EncryptionKeyRing;      // Data encryption key ring
  operations: OperationRegistry;   // Scanned hub operations
  keypal: KeypalClient;            // API key management
  pubsub: PubSubClient;            // Redis-backed pub/sub
  server: HonoServer;              // HTTP + WebSocket server
}

async function startHub(options: HubStartOptions): Promise<Hub> {
  // Steps 1-10 in sequence
  // Steps happen sequentially, but subsystems are constructed inside startHub
  // and wired via closure/DI to each other.
  // The returned Hub object provides access to all initialized subsystems.
  // startHub does NOT register signal handlers — the caller (main.ts) does,
  // using the returned Hub to orchestrate graceful shutdown.
}

main.ts resolves defaults before calling startHub:

const options: HubStartOptions = {
  configPath: Deno.env.get("ALKHUB_CONFIG_PATH") || "/etc/alkhub/config.json",
  masterKeyPath: Deno.env.get("ALKHUB_MASTER_KEY_PATH") || "/run/secrets/hub_master_key",
};
const hub = await startHub(options);
// Register signal handlers using hub for graceful shutdown

The ALKHUB_CONFIG_PATH env var is resolved by main.ts, not by startHub — the startup function takes explicit paths and has no env var dependency.

Design Decisions

D1: Fail-fast, no retry loops

Context: Some services implement exponential backoff retry during startup (e.g., wait for Postgres to become available).

Decision: No retry loops. Fail immediately and let the container orchestrator restart.

Rationale: In Docker, the orchestrator already handles restart timing and backoff. Adding retry logic inside the application duplicates this and makes startup behavior harder to reason about. Quick failures give the operator clear signal — "Postgres is not running, go fix it" vs. "waiting... waiting... waiting..." with no visibility.

D2: Sequential initialization, not parallel

Context: Steps 4 (Postgres) and 6 (Redis) are independent and could run in parallel.

Decision: Start with sequential initialization. Parallel is a future optimization.

Rationale: Sequential startup is deterministic — the same failure always appears at the same step. Parallel initialization introduces race conditions in error handling (what if Postgres fails and Redis succeeds?). The startup cost is dominated by network round-trips (< 100ms for local connections), so the latency savings from parallelism are negligible.

D3: No module-scope side effects

Context: Some frameworks initialize database connections at module import time (e.g., export const db = drizzle(pool) at module top level).

Decision: All initialization happens inside startHub. Modules export factories or constructors, not singletons.

Rationale: Module-scope side effects make startup order implicit (import order matters), prevent testing with different configs, and make graceful shutdown impossible (you can't close a connection that was opened at import time). The startHub function makes the sequence explicit and testable.

D4: Health check reflects startup progress

Context: The health endpoint could either return 503 until fully ready, or return 200 once the HTTP server is listening.

Decision: Return 503 with progress information until all startup steps complete.

Rationale: A spoke or client connecting to a partially-initialized hub will get errors (can't decrypt secrets, can't query database). The 503 response with the current step gives clients and orchestrators clear information about when to retry. The step field uses the step names defined in the Readiness Contract section.

Open Questions

  1. Background migration vs. startup migration — Should migrations block startup, or should they run in the background while the hub serves with the old schema? Blocking is simpler and safer. Background migration requires schema version negotiation. Recommendation: Block for now; revisit if startup latency becomes a problem with large migrations.

  2. Config reload signal — Could SIGHUP trigger re-reading the config file for non-encrypted fields (logLevel, cache TTLs)? Encrypted fields would need the master key to remain in memory. This is a future enhancement; startup currently reads config once.

  3. Hot spare / zero-downtime restart — For production deployments, can we start a new hub process before shutting down the old one? This requires connection draining and session transfer. Deferred — the hub is a single-instance service for now (see infrastructure.md).

  4. Startup observability — Should the startup sequence emit events (pub/sub) so monitoring systems can track startup progress? Or is the /health endpoint sufficient? Recommendation: /health endpoint for now; structured log messages at each step for debugging.

References