hub/docs/architecture/hub-startup.md

---
status: draft
last_updated: 2026-05-18
---

# Hub Startup Sequence

## Overview

The hub startup is an ordered process that resolves configuration, connects to infrastructure services, initializes subsystems, and begins serving requests. This document specifies the sequence, failure modes, and readiness contract. The config system it depends on is defined in [hub-config.md](hub-config.md).

## Design Principles

1. **Fail fast on missing prerequisites** — If the master key, config file, Postgres, or Redis is unavailable, the hub MUST NOT start in a degraded state. Partial availability is worse than no availability.
2. **Config before connections** — All configuration is resolved and validated before any network connections are made. This prevents partial-initialization states where some subsystems are connected and others aren't.
3. **Ordered, not parallel** — Startup steps are sequential. Each step confirms success before the next begins. This makes startup deterministic and debuggable. Parallel initialization can be added later if startup latency becomes a problem, but correctness trumps speed.
4. **Single entry point** — One function (`startHub`) owns the sequence. No scattered initialization across module scopes or top-level side effects.

## Startup Sequence

```
Step 1: Resolve Config Paths
  │   Determine config file path and master key path.
  │   Defaults: /etc/alkhub/config.json, /run/secrets/hub_master_key
  │   Override: ALKHUB_CONFIG_PATH env var (non-sensitive, acceptable).
  │   Fail if files don't exist.
  │
Step 2: Load and Decrypt Config
  │   loadConfig(configPath, masterKeyPath) → HubConfig
  │   Reads master key, decrypts _encrypted fields, validates with TypeBox.
  │   Fail if master key is missing, config is invalid JSON, decryption
  │   fails, or TypeBox validation fails.
  │
Step 3: Initialize Logger
  │   Configure logtape with HubConfig.logLevel.
  │   Sink: stdout, structured JSON in production, pretty-print in development.
  │   Production vs. dev determined by HubConfig.development flag (see hub-config.md).
  │   From this point, structured logging is available for all subsequent steps.
  │
Step 4: Connect to Postgres
  │   Create connection pool using HubConfig.postgres.
  │   Verify connectivity: SELECT 1.
  │   Fail if connection is refused or authentication fails.
  │
Step 5: Run Migrations
  │   Run pending Drizzle migrations against Postgres
  │   using drizzle-orm's programmatic migrator (not drizzle-kit CLI).
  │   Migrations are SQL files from the ./migrations directory.
  │   Fail if migrations fail (schema mismatch, SQL errors).
  │   If the hub crashes mid-migration, the Drizzle migration table
  │   tracks which migrations completed. On next startup, migrations
  │   resume from the last completed step. Partial migrations require
  │   manual operator attention only if a SQL statement fails mid-transaction.
  │
Step 6: Connect to Redis
  │   Create Redis client using HubConfig.redis.
  │   Verify connectivity: PING.
  │   Fail if connection is refused or authentication fails.
  │
Step 7: Initialize Encryption Key Ring
  │   resolveEncryptionKeys(HubConfig.encryptionKeys) → EncryptionKeyRing
  │   Validates that at least one key exists, key versions are sequential.
  │   The key ring is used by client_secrets operations and key rotation.
  │
Step 8: Initialize Drizzle Client
  │   Create Drizzle ORM client wrapping the Postgres pool + schema.
  │   Schema namespace loaded from src/storage/schema.ts.
  │
Step 9: Initialize Subsystems
  │   Each subsystem has its own architecture doc for details.
  │   Initialization here creates and wires the runtime objects.
  │   ├── Operation Registry: scan hub operation directories
  │   ├── Keypal: initialize with HubKeyStorage (Drizzle adapter)
  │   │   └── apiKeyCacheTtl from HubConfig.auth configures RedisCache TTL
  │   ├── PubSub: create with RedisEventTarget (see pubsub-redis.md, from `@alkdev/pubsub` with `prefix` option)
  │   ├── Call Protocol: PendingRequestMap + CallHandler (from `@alkdev/operations`, see call-graph.md)
  │   └── Session System: AI SDK configuration (see agent-sessions.md)
  │       └── LLM provider keys are resolved from client_secrets at runtime
  │
Step 10: Start Hono HTTP Server + WebSocket Listener
  │   Listen on HubConfig.http.host:HubConfig.http.port.
  │   Register all HTTP routes and middleware.
  │   Register the /ws WebSocket upgrade route.
  │   On WS upgrade: authenticate spoke, create WebSocketEventTarget,
  │   register in RunnerPool. (This is a single Hono route, not a
  │   separate server — the WS handler rides on the same HTTP listener.)
  │
Step 11: Signal Ready
      Health check endpoint (/health) starts returning 200.
      Startup is complete. The hub is serving.
```

## Failure Modes

### Step 1-2: Config Resolution Failures

| Failure | Behavior |
|---------|----------|
| Config file not found | Exit with error message including expected path |
| Master key file not found | Exit with error message including expected path |
| Master key is empty or whitespace | Exit — key must be non-empty |
| Config file is invalid JSON | Exit with parse error details |
| Decryption of `_encrypted` field fails | Exit — wrong master key or corrupted config |
| TypeBox validation fails | Exit with field-level validation errors |
| `encryptionKeys` field missing from HubConfig | Exit — hub cannot start without data encryption keys |

**All config failures are fatal.** The hub cannot operate without valid config. No fallback, no defaults for sensitive values.

### Step 4: Postgres Unreachable

| Failure | Behavior |
|---------|----------|
| Connection refused | Exit with error. Do NOT retry indefinitely. |
| Authentication failed | Exit — wrong credentials in config |
| Database doesn't exist | Exit — the `alkdev` database must be created before first startup |

**No retry loop at startup.** If Postgres isn't available, the operator needs to fix it, not wait. Container orchestration (Docker restart policy, systemd) handles restarts. The hub should fail quickly and let the orchestrator retry.

**Exception: development convenience.** A `--wait-for-postgres` CLI flag (dev only) can poll with a timeout. This is NOT the default and NOT for production.

### Step 5: Migration Failures

| Failure | Behavior |
|---------|----------|
| Migration SQL error | Exit with error details |
| Schema version conflict | Exit — manual intervention required |

Migrations are forward-only. No automatic rollback. If a migration fails, the database is in an inconsistent state and needs operator attention.

### Step 6: Redis Unreachable

| Failure | Behavior |
|---------|----------|
| Connection refused | Exit with error |
| Authentication failed | Exit — wrong password |

Same principle as Postgres — fail fast, let the orchestrator retry.

### Step 7: Encryption Key Ring Invalid

| Failure | Behavior |
|---------|----------|
| `encryptionKeys` field missing from config | Exit — hub cannot operate without data encryption keys |
| Empty or whitespace-only after decryption | Exit |
| Malformed format (e.g., `v1:` with empty key) | Exit — each version must have a valid base64 key |
| Duplicate versions (e.g., `v1:abc,v1:def`) | Exit — versions must be unique |
| Non-sequential versions (e.g., `v1:abc,v3:def`) | Exit — versions must be monotonically increasing starting from 1 |
| Invalid base64 in key value | Exit — keys must be valid base64-encoded 32-byte values |

These validations run in `resolveEncryptionKeys` (see [hub-config.md](hub-config.md) § Interfaces).

### Step 9: Subsystem Failures

Subsystem initialization failures (e.g., keypal can't initialize, operation scan fails) should log the error and exit. Partial initialization is not acceptable — if the operation registry can't scan, the hub can't serve requests.

## Readiness Contract

### Health Check Endpoint

`GET /health` returns:
- `200 OK` with `{ "status": "ok" }` **only after** all startup steps complete
- `503 Service Unavailable` with `{ "status": "starting", "step": "<current-step>" }` during startup
- `503 Service Unavailable` with `{ "status": "shutting_down" }` during graceful shutdown
- `503 Service Unavailable` with `{ "status": "degraded", "issues": [...] }` if a post-startup subsystem fails

**Step names** (used in the `step` field during startup):
`resolve-config`, `load-config`, `init-logger`, `connect-postgres`, `run-migrations`, `connect-redis`, `init-keyring`, `init-drizzle`, `init-subsystems`, `start-server`, `ready`

**Runtime liveness**: After startup completes, `/health` also performs lightweight liveness checks:
- Postgres: `SELECT 1` (timeout: 2s)
- Redis: `PING` (timeout: 1s)
- If either fails, return `503 { "status": "degraded", "issues": ["postgres: unreachable"] }`
- Liveness checks run on each `/health` request (not cached, not background-polled)
- If the hub is in degraded state and the subsystem recovers, the next `/health` request returns 200

Docker health check configuration:

```dockerfile
HEALTHCHECK --interval=30s --timeout=5s --retries=3 \
  CMD curl -f http://localhost:3000/health || exit 1
```

### Dependency Wait Pattern

Other services (spokes, MCP clients) should NOT connect until `/health` returns 200. Docker Compose `depends_on` with `condition: service_healthy` handles this:

```yaml
services:
  hub:
    # ... hub config ...
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
      interval: 5s
      retries: 30

  spoke:
    depends_on:
      hub:
        condition: service_healthy
```

## Graceful Shutdown

The startup function should register signal handlers (SIGTERM, SIGINT) for graceful shutdown:

```
1. Set /health to return 503 { "status": "shutting_down" }
2. Stop accepting new HTTP connections
3. Stop accepting new WebSocket connections
4. Abort in-flight calls dispatched to spokes (call protocol cascading)
5. Drain in-flight HTTP requests (timeout: 10s)
6. Close WebSocket connections to spokes (send close frames)
7. Shut down AI SDK session system (cancel in-flight streams)
8. Shut down Keypal (flush any pending audit log writes)
9. Close Redis connection
10. Close Postgres connection pool (wait for active queries, timeout: 10s)
11. Flush and close logtape sinks (final log entries)
12. Exit with 0
```

The shutdown sequence mirrors the startup sequence in reverse order — resources initialized last are closed first (HTTP/WebSocket before DB connections), and resources that depend on others are closed before their dependencies.

**Timeout**: If graceful shutdown doesn't complete in 30 seconds, force exit with 1. This prevents zombie processes.

### The `startHub` Function

The architecturally significant interface:

```ts
interface HubStartOptions {
  configPath: string;       // /etc/alkhub/config.json
  masterKeyPath: string;    // /run/secrets/hub_master_key
}

interface Hub {
  config: HubConfig;               // Fully-resolved, validated config
  db: DrizzleClient;               // Drizzle + Postgres
  redis: RedisClient;              // Redis connection
  keyRing: EncryptionKeyRing;      // Data encryption key ring
  operations: OperationRegistry;   // Scanned hub operations
  keypal: KeypalClient;            // API key management
  pubsub: PubSubClient;            // Redis-backed pub/sub
  server: HonoServer;              // HTTP + WebSocket server
}

async function startHub(options: HubStartOptions): Promise<Hub> {
  // Steps 1-10 in sequence
  // Steps happen sequentially, but subsystems are constructed inside startHub
  // and wired via closure/DI to each other.
  // The returned Hub object provides access to all initialized subsystems.
  // startHub does NOT register signal handlers — the caller (main.ts) does,
  // using the returned Hub to orchestrate graceful shutdown.
}
```

`main.ts` resolves defaults before calling `startHub`:

```ts
const options: HubStartOptions = {
  configPath: Deno.env.get("ALKHUB_CONFIG_PATH") || "/etc/alkhub/config.json",
  masterKeyPath: Deno.env.get("ALKHUB_MASTER_KEY_PATH") || "/run/secrets/hub_master_key",
};
const hub = await startHub(options);
// Register signal handlers using hub for graceful shutdown
```

The `ALKHUB_CONFIG_PATH` env var is resolved by `main.ts`, not by `startHub` — the startup function takes explicit paths and has no env var dependency.

## Design Decisions

### D1: Fail-fast, no retry loops

**Context**: Some services implement exponential backoff retry during startup (e.g., wait for Postgres to become available).

**Decision**: No retry loops. Fail immediately and let the container orchestrator restart.

**Rationale**: In Docker, the orchestrator already handles restart timing and backoff. Adding retry logic inside the application duplicates this and makes startup behavior harder to reason about. Quick failures give the operator clear signal — "Postgres is not running, go fix it" vs. "waiting... waiting... waiting..." with no visibility.

### D2: Sequential initialization, not parallel

**Context**: Steps 4 (Postgres) and 6 (Redis) are independent and could run in parallel.

**Decision**: Start with sequential initialization. Parallel is a future optimization.

**Rationale**: Sequential startup is deterministic — the same failure always appears at the same step. Parallel initialization introduces race conditions in error handling (what if Postgres fails and Redis succeeds?). The startup cost is dominated by network round-trips (< 100ms for local connections), so the latency savings from parallelism are negligible.

### D3: No module-scope side effects

**Context**: Some frameworks initialize database connections at module import time (e.g., `export const db = drizzle(pool)` at module top level).

**Decision**: All initialization happens inside `startHub`. Modules export factories or constructors, not singletons.

**Rationale**: Module-scope side effects make startup order implicit (import order matters), prevent testing with different configs, and make graceful shutdown impossible (you can't close a connection that was opened at import time). The `startHub` function makes the sequence explicit and testable.

### D4: Health check reflects startup progress

**Context**: The health endpoint could either return 503 until fully ready, or return 200 once the HTTP server is listening.

**Decision**: Return 503 with progress information until all startup steps complete.

**Rationale**: A spoke or client connecting to a partially-initialized hub will get errors (can't decrypt secrets, can't query database). The 503 response with the current step gives clients and orchestrators clear information about when to retry. The `step` field uses the step names defined in the Readiness Contract section.

## Open Questions

1. **Background migration vs. startup migration** — Should migrations block startup, or should they run in the background while the hub serves with the old schema? Blocking is simpler and safer. Background migration requires schema version negotiation. **Recommendation**: Block for now; revisit if startup latency becomes a problem with large migrations.

2. **Config reload signal** — Could SIGHUP trigger re-reading the config file for non-encrypted fields (logLevel, cache TTLs)? Encrypted fields would need the master key to remain in memory. This is a future enhancement; startup currently reads config once.

3. **Hot spare / zero-downtime restart** — For production deployments, can we start a new hub process before shutting down the old one? This requires connection draining and session transfer. Deferred — the hub is a single-instance service for now (see infrastructure.md).

4. **Startup observability** — Should the startup sequence emit events (pub/sub) so monitoring systems can track startup progress? Or is the `/health` endpoint sufficient? **Recommendation**: `/health` endpoint for now; structured log messages at each step for debugging.

## References

- [hub-config.md](hub-config.md) — Config system that startup consumes
- [infrastructure.md](infrastructure.md) — Server layout, Docker deployment
- [storage/README.md](storage/README.md) — Drizzle setup, migration strategy
- [spoke-runner.md](spoke-runner.md) — Spoke registration, WebSocket auth
- [pubsub-redis.md](pubsub-redis.md) — Redis EventTarget initialization
- `src/crypto.ts` — Encryption utilities used in config loading